Document API ignoring requested document format

alex_g · November 12, 2019, 1:36pm

When querying the document api with the ‘Accept = application/xhtml+xml’ often returns a PDF essentially ignoring the requested document type. My current work around is this is below (in python) however it requires multiple attempts (set to 4 but has taken up to 8 on occasion), given the rate limiting this is very time inefficient. Thus I have three questions:

is there a better way of doing this?
Has anyone else encountered this?

(To our friends in the companies house dev team) can this be investigated? (it will reduce the server load if we don’t have to query it multiple time with the same request…)

 request_params = {'Accept': 'application/xhtml+xml'}
 n_api_attempts = 0
 wrong_content_type = True
 while wrong_content_type & (n_api_attempts <= 3):
     n_api_attempts += 1

     doc_output = Companies_House_Api.query_document_api(docs_url, api_key, doc_id, request_params)

     if doc_output.status_code == 200:
         if 'Content-Type' in doc_output.headers.keys():
             if doc_output.headers['Content-Type'] == 'application/xhtml+xml':
                 wrong_content_type = False

     if wrong_content_type & (n_api_attempts <= 3):
         print('retrying...')
         time.sleep(1)

note - this code works so there is no problems with the querying function, you request exactly the same thing multiple time until you get the right answer.

Cheers

andy4 · February 7, 2025, 12:31pm

I’m also seeing this issue with the document API (randomly cycling between MIME types even though I’m passing the Accept parameter with “application/xhtml+xml”) but I can’t believe it hasn’t been commented on or reported by anyone else in the past 5 years!! Is there a fix for this issue that I’m not seeing here for some reason?

Thanks

voracityemail · February 10, 2025, 9:25am

I’ve not noticed that but we are mostly just downloading PDF anyway.

Don’t know the exact cause but I wonder if this is because you’re requesting a mime type which isn’t available for a given filing? And perhaps then CH (via AWS) is just sending back “the one they have” e.g. PDF? Most filings only have PDF anyway as far as I’m aware.

Apologies if you already do this but for issues we’ve had we’ve found that the API definitely seems to be designed to be “walked through” step-by-step e.g. almost as if working through the website. So e.g. from company (CompanyProfile) to Filing History (list) to Document Metadata to request for content (which of course then goes via the redirect to the host - still AWS I believe)…

… so we follow this e.g. we always “follow the link” from Filing History entry (links.document_metadata - which is as it says e.g. points you to request the document metadata first. We then check to see what mime types are listed (in resources - docs here) and only request from that list.

Aside from a given format not necessarily being there it’s possible that another XML mime type may be presented. Companies House themselves list application/xml in their documentation, but I imagine it’s possible they could even just offer “text/html” - sometimes used for iXBRL online - or “text/xml”?

andy4 · February 10, 2025, 2:46pm

Thanks.

Whilst most document types are just PDF, 70-80% of accounts are available as PDF or iXBRL (the PDF being derived from a rendering of the iXBRL). I’m only interested in the iXBRL accounts, and I’m using the MIME type indicated by the document metadata.

I’ll try the other MIME types but I suspect the type being returned is random in these circumstances, judging by the behaviour I’m observing.

voracityemail · February 10, 2025, 6:03pm

This isn’t something we regularly do but I’m just trying a few of these manually (if it’s an issue, we need to know…)

Picking some random companies (looking on the CH website to locate instances where they say they have PDF and iXBRL - which might perhaps mean “only some”…) but I find:

I get the type I request (PDF or iXBRL).
If i re-request the data (same request), I get the same type (e.g. the examples I tried it’s not switching between different types).

Very small sample size due to time, but “it works for me”…

curl -u MYAPIKEY: "https://api.company-information.service.gov.uk/company/14351587/filing-history/MzQxNzM2OTEwM2FkaXF6a2N4"

(use the “document_metadata” entry):

curl -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/U92CyiA9WBzWktpyIxWFmhq7J1NRFm06ntFzIz4U-sA"
{"company_number":"14351587","barcode":"XD0HJ94X","significant_date":"2023-09-30T00:00:00Z","significant_date_type":"made-up-date","category":"accounts","pages":1,"filename":"14351587_aa_2024-04-06","created_at":"2024-04-06T16:54:56.384465773Z","etag":"","links":{"self":"https://document-api.company-information.service.gov.uk/document/U92CyiA9WBzWktpyIxWFmhq7J1NRFm06ntFzIz4U-sA","document":"https://document-api.company-information.service.gov.uk/document/U92CyiA9WBzWktpyIxWFmhq7J1NRFm06ntFzIz4U-sA/content"},"resources":{"application/pdf":{"content_length":22603},"application/xhtml+xml":{"content_length":14917}}}

curl -L --header "Accept: application/xhtml+xml" -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/U92CyiA9WBzWktpyIxWFmhq7J1NRFm06ntFzIz4U-sA/content"

… consistently gives me xiBRL

curl -L --header "Accept: application/pdf" -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/U92CyiA9WBzWktpyIxWFmhq7J1NRFm06ntFzIz4U-sA/content"

… gives me PDF.

Another:

curl -u MYAPIKEY: "https://api.company-information.service.gov.uk/company/05388912/filing-history/MzQzNDk3MDg3OGFkaXF6a2N4"

curl -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/mHOKgQpQWfXabfh6GrO2_pUI-Y7i1JETwXoQSiXhIb4"
{"company_number":"05388912","barcode":"XDB9DYM1","significant_date":"2023-12-31T00:00:00Z","significant_date_type":"made-up-date","category":"accounts","pages":13,"filename":"05388912_aa_2024-09-09","created_at":"2024-09-09T15:00:44.037434452Z","etag":"","links":{"self":"https://document-api.company-information.service.gov.uk/document/mHOKgQpQWfXabfh6GrO2_pUI-Y7i1JETwXoQSiXhIb4","document":"https://document-api.company-information.service.gov.uk/document/mHOKgQpQWfXabfh6GrO2_pUI-Y7i1JETwXoQSiXhIb4/content"},"resources":{"application/pdf":{"content_length":174343},"application/xhtml+xml":{"content_length":195023}}}

curl -L --header "Accept: application/xhtml+xml" -u  MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/mHOKgQpQWfXabfh6GrO2_pUI-Y7i1JETwXoQSiXhIb4/content"

curl -L --header "Accept: application/pdf" -u  MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/mHOKgQpQWfXabfh6GrO2_pUI-Y7i1JETwXoQSiXhIb4/content"

That all works as expected e.g. correct types returned (and same data returned if I re-request the same thing).

Obvs. just a tiny sample - but do they work for you? If not, I guess the next direction would be do you get consistent results e.g. if you repeat a request do you get a different type? Then - how are you calling the API? If you’re consistently getting the wrong type do you have any examples?

We’ll keep tabs on this ourselves.
Good luck.

voracityemail · February 11, 2025, 8:47am

Another one which works as expected (FWIW showing getting the filing history record first, as in prev. examples, so you can see company number etc):


curl -u MYAPIKEY: "https://api.company-information.service.gov.uk/company/10795619/filing-history/MzQxMzMyMDM3NWFkaXF6a2N4"
{"transaction_id":"MzQxMzMyMDM3NWFkaXF6a2N4","barcode":"XCY5G9C0","type":"AA","date":"2024-03-04","category":"accounts","description":"accounts-with-accounts-type-micro-entity","description_values":{"made_up_date":"2023-11-30"},"pages":4,"action_date":"2023-11-30","links":{"self":"/company/10795619/filing-history/MzQxMzMyMDM3NWFkaXF6a2N4","document_metadata":"https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k"}}

curl -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k"
{"company_number":"10795619","barcode":"XCY5G9C0","significant_date":"2023-11-30T00:00:00Z","significant_date_type":"made-up-date","category":"accounts","pages":4,"filename":"10795619_aa_2024-03-04","created_at":"2024-03-04T16:43:38.565321925Z","etag":"","links":{"self":"https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k","document":"https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k/content"},"resources":{"application/pdf":{"content_length":34035},"application/xhtml+xml":{"content_length":38268}}}

curl -L --header "Accept: application/xhtml+xml" -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k/content" > ch_accept_test1.xml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 38268  100 38268    0     0   479k      0 --:--:-- --:--:-- --:--:--  479k

curl -L --header "Accept: application/pdf" -u MYAPIKEY: "https://document-api.company-information.service.gov.uk/document/_OyVjEQ_-j8DYWm7NyAcFjMrGKMxMkuH1O1FJwO5q1k/content" > ch_accept_test1.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 34035  100 34035    0     0   251k      0 --:--:-- --:--:-- --:--:--  251k