Fetch a document API

mjyurag · February 10, 2017, 12:13pm

Hi Guys,

I have some issue in exploring the API for fetching the document.

It asks me the following:

ID (The unique identifier for the document) → where can I find this?
Authorization (This header parameter contains the token_type and the access_token) → what does this mean? How can I obtain this?
Accept (Gives the Content-Type that the document will be returned as. If the Content-Type is unsupported, a 406 error will be generated.) → what is this? Can you please give me sample?

Please help me.

Thank you guys a lot!

MJ

MArkWilliams · February 14, 2017, 10:13am

Please read :-

and

if you are still having trouble after reading those, feel free to reply to this post.

ashtekarmukesh · February 17, 2018, 4:43pm

I am facing this same problem, Does any one there to clarify the usage of this document API please ?

Thanks,
Mukesh Ashtekar

voracityemail · February 18, 2018, 3:20pm

Welcome to the forum!

If you’re trying to use the “Explore this API” form on the Document Download page I don’t think this works - see e.g.
http://forum.aws.chdev.org/t/not-able-to-download-companies-account-data-using-api/569

Otherwise, you should be able to find the answers in @MArkWilliams post above yours - as he says the place to start is the CH API documentation.

For detail on the general process for downloading a document you might find my response here helpful:
http://forum.aws.chdev.org/t/document-problems-with-cors/1627/2

An overview of the process of requesting a CH document goes something like:

Find the document filing you’re interested in (e.g. make a filing history request for the company). Parse the response for the link to the document in the field “links” : { “document_metadata” : “link URI fragment here” }.
For a given document request the document metadata via CH Document API. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).
Request the actual document, specifying the mime type (e.g. “application/pdf”).
CH send back a response which is a redirect (http 302) to the document. The documents are stored on Amazon servers ATM.
Request this URI from Amazon again passing the content type you want again.
Amazon send the actual document data.

Depending on the system you’re using parts 4-6 may occur automatically e.g. if using javascript I think you just make the request (step 3) and your code will receive the data (step 6). More details in the post above if needed.

Hope this helps. If not please state clearly the problem you’re trying to solve and if you have errors or think there’s a bug provide a minimal test case and / or examples of the data which is not correct.

kasia_kulma · June 19, 2018, 2:59pm

Hi there, thank for your clarification. I’ve been following the above steps (in R) with partial success, but still fail to get access to documents’ content:

Find the document filing you’re interested in (e.g. make a filing history request1 for the company). Parse the response for the link to the document in the field “links” : { “document_metadata” : “link URI fragment here” }.

No problem:

library(httr)
library(jsonlite)
library(openssl)

### retrieving filing history ####
company_num = 'FC013908'
key = 'my_key'
fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
fh_url <- modify_url("https://api.companieshouse.gov.uk/", path = fh_path)
fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
docs <- fh_parsed$items

Done.

2 For a given document request the document metadata via CH Document API3. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).

No problems here:

md_meta_url = docs$links.document_metadata[1]  
key_pass <- paste0(key,":")
decoded_auth <- paste0('Basic ', base64_encode(key_pass))

md_test <- GET(md_meta_url,
               add_headers(Authorization = decoded_auth)
               )
md_test #status_code = 200!
md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)

This way I can obtain the content URL:

cont_url = md_parsed$links$document

Request the actual document9, specifying the mime type (e.g. “application/pdf”).

I do it while NOT following the redirect and, as expected, I get the 302 status code with the location header:

accept = 'application/pdf'
cont_test <- GET(cont_url, 
           add_headers(Authorization = decoded_auth,
                       Accept = accept),
           config(followlocation = FALSE)
)

final_url <- cont_test$headers$location

> final_url
[1] "https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D"

However, when I try to

Request this URI from Amazon again passing the content type you want again.
I get 400 error:

 final_test <- GET(final_url, 
                 add_headers(Authorization = decoded_auth,
                             Accept = accept
                             ))

> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D]
  Date: 2018-06-20 08:37
  Status: 400
  Content-Type: application/xml
  Size: 523 B
<BINARY BODY>

Needless to say, executing

browseURL(final_test$url)

returns Access Denied error. I suspect it may have something to do with AWS authorization problems similar to those described here. Any ideas how to solve this final hurdle?

Thanks!

voracityemail · June 27, 2018, 9:46am

See my update to your re-post at Can't access documents from AmazonS3 server

miroslaw_storoniak · January 29, 2020, 7:05pm

Hi,

I have problem with second point:

“For a given document request the document metadata via CH Document API16. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).”

Request
GET /document/GdA8vcfuhlN6jp_ckh5Kfd3VAvwFexZFmvATUZUnmZM HTTP/1.1
Host: document-api.companieshouse.gov.ukundefined
Authorization: Basic NEtxTVR0Y1dIM3YzNHliUnJSUzQ3Q0wtYVlVRGhZaktPQlRULVdEWcccg==

Token: is encoded to base64
ID: collected according to point 1

My error code: 0 - online test or 500 - test on postman.

Please advise.

Regards,
M.

voracityemail · February 5, 2020, 5:12pm

Is that last “undefined” an artifact of your logging script / process? If not something there needs to be fixed.

Next step would be to check your API key / which server you’re requesting from / exactly the request you’re making as the endpoint you requested works OK for me:

curl -uMY_API_KEY_HERE: "https://document-api.companieshouse.gov.uk/document/GdA8vcfuhlN6jp_ckh5Kfd3VAvwFexZFmvATUZUnmZM"

Gives

{
    "company_number": "SC327000",
    "barcode": "X8CY8ZLT",
    "significant_date": null,
    "significant_date_type": "",
    "category": "annual-returns",
    "pages": 3,
    "created_at": "2019-08-30T12:53:41.19681517Z",
    "etag": "",
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/GdA8vcfuhlN6jp_ckh5Kfd3VAvwFexZFmvATUZUnmZM",
        "document": "https://document-api.companieshouse.gov.uk/document/GdA8vcfuhlN6jp_ckh5Kfd3VAvwFexZFmvATUZUnmZM/content"
    },
    "resources": {
        "application/pdf": {
            "content_length": 19662
        }
    }
}

That’s the following company: BANK OF SCOTLAND PLC overview - Find and update company information - GOV.UK

No need to post your API key on the forum! Just check you have this correct and encoded correctly per http basic.

miroslaw_storoniak · February 5, 2020, 8:15pm

Thanks ! I solved that issue some times ago - please refer to the following post:
https://forum.aws.chdev.org/t/help-with-document-api-fetch-a-document/255/10?u=miroslaw_storoniak

I provided there other question connected with this topic.
I would be grateful for a support.

Best Regards,
M.

sunil342it · May 6, 2020, 8:08am

browseURL(final_test$url) —
this one only give pdf output on web browser how can i downloads pdf using R to make Data frame in R studio , please help me
using download.file function in r getting Error cannot open URL .