As it should be clear above - it sounds like you’re getting to the last step but one. Downloading the document metadata is downloading the metadata. After that you need to parse that data to get the correct URL to download the document content.
Please see my linked post - the steps:
(3) Request the actual document , specifying the mime type (e.g. “application/pdf”).
(4) CH send back a response which is a redirect (http 302) to the document. The documents are stored on Amazon servers ATM.
(5) Request this URI from Amazon again passing the content type you want again.
(6) Amazon send the actual document data.
Again this is why I recommend using curl initially - it’s easier to see what is happening.
You requested:
https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY
That should give you a JSON response, like shown below (I’ve snipped some parts marked “…”)
{
"company_number": "00048839",
...
"links": {
"self": ...
"document": "https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/content"
},
"resources": {
"application/pdf": {
"content_length": 437669
}
}
}
You need to check the “resources” member for available data types (mime types), then request the URL given in “document”. I think the resource type defaults to the given one if there is only one but it makes sense to always specify which one you want - and especially if there is more than one type of resource. You do that - as the documentation says - by setting that in the http “Accept” header).
So I’d request:
curl -u YOUR_API_KEY_HERE: -H "Accept: application/pdf" https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/content
Note the “/content” at the end there - that tells companies House you want the actual data.
You do not get back the actual data immediately. You get a redirect to Amazon’s AWS servers. So that’s an http 302:
< HTTP/1.1 302 Found
< Date: Mon, 20 Feb 2023 12:20:25 GMT
< Location: https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/application-pdf?...
(Again some data snipped here “…” to save space)
Depending on your language / tool you may have to catch that redirect URL in your own code. This is because if your tool / code incorrectly tries to send the Companies House authorisation (e.g. API key) to Amazon (e.g. by automatically following the link) that will cause Amazon to send you an error. You will also need to request the link from Amazon quickly as it is time-limited e.g. it will expire at some point.
So in your example I did this and correctly downloaded a PDF file of 437669 bytes.
Good luck.