How to get document_id?

alex_g · November 1, 2019, 12:26pm

I want to get documents from a company using the API, I can find them on the filing history API however they don’t have document IDs therefore making it impossible to use the document API to download them.

I tried both the transaction ID and the XXX of the document metadata url “https://frontend-doc-api.companieshouse.gov.uk/document/XXX” but these gave 500 internal server error.

Thus leading me to my question:

How do I obtain the document ID in order to find it in the document API

or (assuming the XXX in the above is correct)
Why am I receiving 500 errors, is this an internal problem? (I’ve tried it over a number of days as initially I thought it was a problem with the API however i’ve seen no mention of internal issues and I would have thought if there were issues they would have been fixed after a few days).

Thanks

mariya_garkavenko · January 23, 2020, 1:22pm

Hello! I am having the same problem. Have you found any solution?

alex_g · January 23, 2020, 1:48pm

Yeah, what they don’t tell you is the authentication for the document API works in a completely different way than the rest for no apparent reason. This is what you need (in python):

DOCUMENT_API_URL= https://%s:@document-api.companieshouse.gov.uk/document/%s/content

REQUEST_PARAMS_FOR_DOC_API = {'Accept': 'application/xhtml+xml'}

DOCUMENT_API_URL = DOCUMENT_API_URL % (api_key, document_id)

raw_response = requests.get(DOCUMENT_API_URL, params=REQUEST_PARAMS_FOR_DOC_API)

Note this assumes you actually want xhtml-xml documents, if you want pdf you give it something else in the params, I don’t know what because I didn’t want pdfs. Also fun fact, even when you ask for xhtml+xml documents the API sometimes ignores you completely and gives you PDFs, so I just put a while loop to try 4 times to make sure the output in the format I want then abandon the attempt if they keep sending PDFs and try again later.

Good luck

mariya_garkavenko · January 23, 2020, 2:19pm

Thanks a lot!

LOL

voracityemail · January 24, 2020, 6:05pm

Hmm…I’ve not experienced this - also what’s written above may be a typo but I think what you need is to specify the type you want in the http headers, not URI parameters. Python request.get seems to use “headers” for http headers and not “params” as written above as per Quickstart — Requests 2.25.1 documentation
At least using the http header works for me - and CH say it in the document library documentation. (note “If the Content-Type is unsupported, a 406 error will be generated.”)

I think this should be (note - I’m not a pythonista):

DOCUMENT_API_URL= (exactly what you get from the filing history response links → document metatdata )

REQUEST_HEADERS_FOR_DOC_API = {‘Accept’: ‘application/xhtml+xml’} (or whatever type you want which is listed in the document metadata resource section)

raw_response = requests.get(DOCUMENT_API_URL, auth=(‘YOUR_API_KEY’, ‘’) headers=REQUEST_HEADERS_FOR_DOC_API)

Note I’ve altered this so you’re not using the https://user:password@… syntax - as mentioned e.g. in the URL
wikipedia article this is deprecated (“Use of the format username:password in the userinfo subcomponent is deprecated for security reasons.”). The “auth” part is using a blank password as that’s how CH have organised things.

So - after you’ve checked what formats are actually provided by checking the “resources” section returned by the document metadata endpoint then set your http “Accept” header accordingly. Note - the vast majority of documents are only filed as PDF format (“application/pdf”).

There’s an overview of some of the pitfalls of getting documents here in my comments (I’d rather CH made theirs a bit clearer themselves but hope this helps).