Fetch Document 500ing

swoopfunding · March 26, 2021, 2:34pm

I get the Document Link curling ‘https://api.companieshouse.gov.uk/company/Compa/filing-history?category=accounts’ adding my valid token into the header as “Authorization: {token}”.

In the list I can see the link to the document, that I bet it is something like https://frontend-doc-api.company-information.service.gov.uk/document/{DocumentID}.

I then tried the following options:

curl https://frontend-doc-api.company-information.service.gov.uk/document/{DocumentID} -H "Authorization: <token>" -k -vvvvvvvvvvvvv

curl https://document-api.company-information.service.gov.uk/document/{DocumentID} -H "Authorization: <token>" -k -vvvvvvvvvvvvv

both responding with 500.

curl https://api.companieshouse.gov.uk/document/{DocumentID} -H "Authorization: <token>" -k -vvvvvvvvvvvvv

responds with 404

Please help

tonyb · March 27, 2021, 6:31pm

Hi. I’ve not retrieved documents yet, but are you base64 encoding the token and setting it to basic authentication? Only saying that because you just mentioned "Authorization: (token) "

swoopfunding · March 29, 2021, 12:25pm

The same token works for

https://api.companieshouse.gov.uk/company/{CompanyID}/filing-history?category=accounts

voracityemail · March 30, 2021, 10:04am

Welcome!

I’m assuming here you just want help getting the document metadata - e.g. not downloading the filing document itself. For information on the whole process please see other answers on this forum e.g. the following one:

I’m not sure what you mean by the “token” part? All the endpoints simply take the API key (technically, http basic Authorization where the username is the API key and the password is blank). Since you’re already calling curl to get the filing history for a company you’ll have an API key, yes?

I find the simplest way to do this with curl is using the -u argument to pass the username and password. That means you can simply put your unmodified API key from CH and then a colon, space and then the rest of your curl statement (because password part is blank).

Rolling your own (header) is not difficult but it seems to cause people a lot of confusion, so I’d go with the simplest method first. (The only reasons for working directly with http headers in the CH API are if you haven’t got a library to do the basics for you [I’d recommend using one to save labor]. In that case you may need to manage the rate-limiting system and / or manually follow the redirects in the document API to download document content without passing CH authorisation to Amazon. Those are topics covered elsewhere e.g. on this forum and probably won’t concern you if you’re manually accessing the system using curl).

Another aside - I’d try to avoid using the curl -k argument - I’d certainly avoid this in a production environment. If curl can’t verify certificates this is really a prompt to update your certificate store.

…I bet it is something like…

No need to guess, it’s all (reasonably) well documented here (the links below work, not sure why the previews don’t) - first filing history. Note you can get either a single item or list of all items, I’m just listing the single item endpoint below:
https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/filing-history/filinghistoryitem-resource
…and the format of a single entry:
https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/resources/filinghistoryitem?v=latest
…and how to request document metadata information:
https://developer-specs.company-information.service.gov.uk/document-api/reference/document-metadata/fetch-a-documents-metadata
…and what it returns:
https://developer-specs.company-information.service.gov.uk/document-api/resources/documentmetadata?v=latest

Using an example company here (04253605) and my API key I got the following to work fine just now. I’ve left out your -k and -v options (disable certificate check and verbose) for clarity:

Examining a filing history entry:
curl -uMY_API_KEY: “https://api.company-information.service.gov.uk/company/04253605/filing-history/MzI4MDk0OTUwM2FkaXF6a2N4”

{
    "action_date": "2020-02-29",
    "category": "accounts",
    "date": "2020-10-19",
    "description": "accounts-with-accounts-type-dormant",
    "description_values": {
        "made_up_date": "2020-02-29"
    },
    "links": {
        "self": "/company/04253605/filing-history/MzI4MDk0OTUwM2FkaXF6a2N4",
        "document_metadata": "https://frontend-doc-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc"
    },
    "paper_filed": true,
    "type": "AA",
    "pages": 3,
    "barcode": "A9FKE6FU",
    "transaction_id": "MzI4MDk0OTUwM2FkaXF6a2N4"
}

Using the document metadata link to get document info:
curl -uMY_API_KEY_HERE: “https://frontend-doc-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc”

{
    "company_number": "04253605",
    "barcode": "A9FKE6FU",
    "significant_date": "2020-02-29T00:00:00Z",
    "significant_date_type": "made-up-date",
    "category": "accounts",
    "pages": 3,
    "created_at": "2020-10-21T04:46:11.10712573Z",
    "etag": "",
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc",
        "document": "https://document-api.companieshouse.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc/content"
    },
    "resources": {
        "application/pdf": {
            "content_length": 45593
        }
    }
}

Using the alternative form of the document API endpoint (instead of “frontend…”):
curl -uMY_API_KEY_HERE: “https://document-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc”

{
    "company_number": "04253605",
    "barcode": "A9FKE6FU",
    "significant_date": "2020-02-29T00:00:00Z",
    "significant_date_type": "made-up-date",
    "category": "accounts",
    "pages": 3,
    "created_at": "2020-10-21T04:46:11.10712573Z",
    "etag": "",
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc",
        "document": "https://document-api.companieshouse.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc/content"
    },
    "resources": {
        "application/pdf": {
            "content_length": 45593
        }
    }
}

Of course you can combine this with other things in curl e.g. curl -I to get the headers etc.

swoopfunding · March 30, 2021, 2:53pm

Thanks very much, curl with -u works, I will figure out how to send the same request in Postman. Thanks a lot for your help!

swoopfunding · March 30, 2021, 3:24pm

So my problem now is to get Document content, I get exactly the same response as in https://forum.aws.chdev.org/t/how-to-download-a-document-from-companieshouse-api-through-postman/1809

swoopfunding · March 30, 2021, 3:34pm

At the end I solved this this way (thanks both for your help):

curl https://document-api.company-information.service.gov.uk/document/{DocumentID} -H “Authorization: Basic {base64 encoded token}” -k -vvvvvvvvvvvvv

curl https://document-api.company-information.service.gov.uk/document/{DocumentID}/content -H “Authorization: Basic {base64 encoded token}” -k -vvvvvvvvvvvvv

Thanks!

turbaevsky · May 1, 2024, 1:53pm

I’m afraid the document api link does not work - could you double check it?

voracityemail · May 1, 2024, 2:18pm

You’ll have to be more specific - which one?

Companies House APIs seem to be designed for people to “follow the links” e.g. request a filing history (item), use the link there to request the document metadata, use the link there to request the document data.

I recommend doing that (again it seems they’ve designed their API essentially assuming people follow that kind of path - it’s clearer that way anyway). If there’s a specific issue post the URL you are requesting and the http response code and any body and someone might be able to help. (More information the better - so information on the environment you’re calling from might help also. Just don’t post your API key).

turbaevsky · May 1, 2024, 4:20pm

Dear all,
Thank you VERY much for such a prompt reply.

I met the issue in getting document metadata as well as its content using JS fetch - I can get the CH API using proxy (for dev env) but following OPTION and GET to AWS S3 returns CORS error. I used request such as https://find-and-update.company-information.service.gov.uk/company/13062145/filing-history/MzQwNjY5MDE2MWFkaXF6a2N4/document?format=xhtml&download=1 because https://document-api.company-information.service.gov.uk return error 500

Regarding https://find-and-update.company-information.service.gov.uk I must admit that such request works fine in Python requests.

I hope that more ‘formal’ https://document-api.company-information.service.gov.uk should work better in JS, but it does not even trying curl as following:

curl -k -vvvvv -H "Authorization: " https://document-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc

Trying 13.41.210.139:443…
Connected to document-api.company-information.service.gov.uk (13.41.210.139) port 443 (#0)
ALPN: offers h2,http/1.1
TLSv1.3 (OUT), TLS handshake, Client hello (1):
TLSv1.3 (IN), TLS handshake, Server hello (2):
TLSv1.2 (IN), TLS handshake, Certificate (11):
TLSv1.2 (IN), TLS handshake, Server key exchange (12):
TLSv1.2 (IN), TLS handshake, Server finished (14):
TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
TLSv1.2 (OUT), TLS handshake, Finished (20):
TLSv1.2 (IN), TLS handshake, Finished (20):
SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
ALPN: server did not agree on a protocol. Uses default.
Server certificate:
subject: C=GB; L=Cardiff; O=Companies House; CN=*.companieshouse.gov.uk
start date: Oct 5 00:00:00 2023 GMT
expire date: Oct 16 23:59:59 2024 GMT
issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust TLS RSA CA G1
SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
using HTTP/1.x

GET /document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc HTTP/1.1
Host: document-api.company-information.service.gov.uk
User-Agent: curl/7.88.1
Accept: /
Authorization: …

< HTTP/1.1 500 Internal Server Error
< Date: Wed, 01 May 2024 16:19:05 GMT
< Server: nginx/1.22.1
< Content-Length: 0
< Connection: keep-alive
<

Connection #0 to host document-api.company-information.service.gov.uk left intact

Thank you in advance!

voracityemail · May 1, 2024, 5:03pm

Starting from the end of your post - I am guessing your curl example has an incorrect http Basic header.

A quick way to test this is instead of making a header yourself, try:

curl -k -vvvvv -u YOURAPIKEYHERE: https://document-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc

(Note that you need the “:” after your API key - this is because normally a username:password string goes here but Companies House have made it so the username part is the API key and the password is empty / blank).

This works correctly for me - I receive a http 200 and the body contains the expected JSON. (There is another difference - curl on my server supports more than http 1.1 but I suspect that this is not the source of problems here).

If you manually construct the Authorisation header correctly, this also works. So take the username:password string above (the API key plus : on the end) and base64 encode it to give a string - let’s call it BASE64USERANDPASS . Now make the header string:

Authorization: Basic BASE64USERANDPASS

… and use curl:

curl -k -vvvvv -H “Authorization: Basic BASE64USERANDPASS” https://document-api.company-information.service.gov.uk/document/bFofQLDBGWrTBK02r1myESnrGJi0Uf7v1OTfQE7cbvc

This also works correctly for me.

HOWEVER if there is a problem with that header - so the BASE64USERANDPASS is not valid base64-encoded text, or you were missing the final “:”, or the capitalisation of the “Basic” is incorrect e.g. “basic” etc. … then you will get an error 500.

There are several other things which could cause you problems:

you need to have registered a live application, not a test one
you need to be calling Companies House from an IP address or (for js) a host url you have registered with them.
Specifically for when you request the document data (e.g. …/document/…/content ) note that this involves at least one redirect and the first is away from Companies House to Amazon AWS (currently) where the data is stored. Please search the site for more information on that (I’ve a couple of posts with details of this).

Using the URLs that the Companies House web site itself uses such as https://find-and-update.company-information.service.gov.uk/company/13062145/... may not be a good idea as that may be changed without notice. It’s probably better to use the API as that is specifically designed for this.

Finally - once you’re sure everything else works - for help with CORS errors please search for that specifically on this forum. There is a lot of information available.

Good luck.

turbaevsky · May 1, 2024, 5:53pm

Thank you very much - curl works fine. Could you please clarify whether BASE64USERANDPASS is the same as my API_KEY? I am using ‘Authorization’: ‘Basic 33…’ header but got 500

voracityemail · May 1, 2024, 8:39pm

See: “So take the username:password string above (the API key plus : on the end) and base64 encode it to give a string - let’s call it BASE64USERANDPASS. …”

The username part is your API key. The password part is blank (empty string).

See e.g. wikipedia entry for http basic.

Plenty of javascript examples online e.g. in Stackoverflow.

In many environments there is built-in support for http basic so you don’t have to build the headers yourself. That seems to be a common source of confusion and problems!