[Solved] XBRL insead of PDF documents

bchazalet · March 19, 2019, 8:25am

Hi,

I’m successfully downloading annual accounts documents using the api. I’m after the PDF documents, but quite often I get a XBRL document instead of the PDF document. This seems to happen pretty randomly, and sometimes after some retries I get the right PDF document (but sometimes not).

Does anyone have any idea why that is? Is there a way to force the download of the PDF version and not the XBRL?

Cheers,

bchazalet · March 18, 2019, 6:12pm

My guess is that if one doesn’t specify an Accept header, the response’s content-type isn’t deterministic.

voracityemail · March 18, 2019, 8:47pm

You’ve said what you need! The solution is to ask for what you want by setting the Accept header when requesting the document content.

Currently filings may be available in:

No data at all - you might see some variant of “unavailable” or “please contact us for this”.
PDF - the “standard”.
XBRL - only a few and these are - as far as I’m aware - only when the company has filed in this format. CH don’t generate this info.

…so you got lucky a few times.

To see what’s available (per the document API docs) the “resources” member of the call the document metadata object lists these. (The the metadata endpoint can be found in the links in the filing history list / filing history item object). Example:

For company 00197009, the filing history item https://api.companieshouse.gov.uk/company/00197009/filing-history/MzE5Mzc0OTc3MGFkaXF6a2N44

If you request the metadata from this with:
https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI84

You get:

{
    "company_number": "00197009",
    ...
    "resources": {
        "application/pdf": {
            "content_length": 26343    
        },
        "application/xhtml+xml": {
            "content_length": 20364    
        }    
    }
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI8",
        "document": "https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI8/content"    
    },    
}

I don’t think there’s any “non-determinism” - unless you’ve repeatedly tried the same document endpoint with different results. Getting different formats is likely the result of randomly hitting on accounts where there is another format than the PDF data e.g. XBRL. It may be that you always get XBRL if you don’t specify “Accept” with a type and both it and PDF are available or it may be you get these in the order “they come in” - which may or may not be fixed. Or some other reason. Just set the “Accept” header.

bchazalet · March 19, 2019, 8:12am

Thanks for your response @voracityemail!

There’s definitely some non-determinism if the Accept headers are not set. But that’s ok, as I know understand that it can just get what I want by setting them.