Filing History API Returns 500 Error on document_metadata Link—How to Retrieve PDF?

geork · October 16, 2024, 1:10pm

I’m trying to retrieve a PDF document for a specific company using its company ID. After calling the Filing History API, I received the following response object:

"links": {
  "self": "/company/14451541/filing-history/MzQxNTkzNTkxOWFkaXF6a2N4",
  "document_metadata": "https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM"
}

However, when I access the document_metadata link, it returns an Internal Server Error (500). Is there an alternative method to obtain the PDF link via the API?

voracityemail · October 16, 2024, 4:35pm

The link works fine for me, using curl:

curl -u MYAPIKEYHERE: "https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM"
{
"company_number":"14451541","barcode":"XCZNCOOA","significant_date":"2023-09-30T00:00:00Z","significant_date_type":"made-up-date","category":"accounts","pages":6,"filename":"14451541_aa_2024-03-25","created_at":"2024-03-25T11:35:21.766731669Z","etag":"",
"links":{"self":"https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM","document":"https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM/content"},"resources":{"application/pdf":{"content_length":56537},"application/xhtml+xml":{"content_length":30525}}
}

Note: you can retrieve different types of file if available by setting the http Accepts header (you can see different data types available in the “resources” member above).

You’ll find if you request the document link above you actually immediately get a 302 redirect - however depending on your language / library / tool you may not see that and your system may just follow it.

curl -u MYAPIKEYHERE: -v "https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM/content"
...
> GET /document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM/content HTTP/1.1
> Host: document-api.company-information.service.gov.uk
...
< Location: https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM/application-xhtml%252Bxml?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...

Exactly how this works depends on what is doing the http request for you. For example, using curl I can tell the system to follow links - it knows to only sent the http Basic Authorization to the initial site (Companies House) and it happily retrieves the file.

 curl -u MYAPIKEY: -L "https://document-api.company-information.service.gov.uk/document/oeuCPq3-1oZUKkp85orZO-_DGD81rzO3Zubf1js-_KM/content" > foo.pdf

However if you have a tool which sends the http Basic Authorization header to Amazon (e.g. follows the link but sends the header to that) then you will get an error.

So you may have to intercept the first response from Companies House (302) if the tool / library you’re using tries to send the http Basic Authorization header again, then send a request to the URL you’re given there without the Authorization header.

There’s more information about the process in this thread including a link to another thread where I’ve tried to show how this all works.

Good luck