Document API - 401 Status Code Returned Using Python

I have gone through the well documented steps outlined here to retrieve a document, with success right until the end.

I have used the company house public data API ‘filing-history’ to retrieve document IDs using python, see simplified code snippet for a single request below.

import requests

url = 'https://api.company-information.service.gov.uk/company/07426533/filing-history'
key = (open('api_key.txt', mode='r')).read()
params = {'category': 'accounts'}
response = requests.get(url, auth=(key, ''), params=params).json()   
document_id = response['items'][0]['links']['document_metadata']

I can then use this to retrieve the document metadata using the following code with no problems.

import requests

url = 'https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE'
api_key = (open('api_key.txt', mode='r')).read()
metadata= requests.get(url, auth=(api_key, '')).json()    

Inspecting the metadata object I see the document is available in both pdf and xhtml/xml, its 10 pages long and I see the link to the document is the same as the one I used to retireve the metadata but with ‘/content’ appended to the end.

I’ve read through some of the documentation on this site and I see that when requesting the document you should not include your api key (link here).

When sending off a request for the pdf I don’t retrieve any data, I simply get a 401 status code (unauthorised). When I do include my api key as usual, I receive an SSLError error message as expected from reading the post above. Included a code snippet below.

import requests

url = 'https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content'
response = requests.get(url, headers={'Accept': 'application/pdf'})

print(response.status_code)
>>401

How do I go about authorising this request for a document? Any help would be hugely appreciated!

Thanks,
Matt

Welcome - a good error report this so pretty clear what’s going on. You’re nearly there. The issue is just the first point of call in your last request is still a Companies House server - so you will need to pass the API key there:

url = 'https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content'
response = requests.get(url, headers={'Accept': 'application/pdf'})

If you don’t pass in the API key then you’ll see what you said e.g. 401, same as calling any other part of Companies House without the API key.

However, what this request returns if you do send the key is a http 302 redirect (to Amazon AWS). Many others had the problem that the tool they were using was “following” this as expected but continuing to send the Companies House API key as http Basic Authentication - which then causes Amazon’s servers to complain (or did at that point).

I don’t know python but there’s an answer to “how to check if a request redirects to a new URL” here:
https://www.adamsmith.haus/python/answers/how-to-check-if-a-request-redirects-to-a-new-url-in-python

So - unless you use a different library it looks like you call your last endpoint with the API key, then look through the list of responses. Find the first one and then just request that URL without passing in the API key. (You may not even need to do this. I don’t know but Python - like curl as described below - may be able to follow the redirect(s) and not pass the API key every time. I’d check what the python library libraries will do).

You can see / test this with Curl. Your last request is effectively sending:

curl -I "https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content"

Not surprisingly Companies House won’t let you do that!

HTTP/1.1 404 Not Found
Content-Length: 19
Content-Type: text/plain; charset=utf-8
Date: Thu, 14 Apr 2022 13:12:15 GMT
Server: nginx/1.18.0
X-Content-Type-Options: nosniff
Connection: keep-alive

If you add the API key to that you’ll get a 302 (using curl -v instead of -I as otherwise you don’t see the redirect) (some of the following snipped both within lines and complete lines for clarity - marked “…”):

curl -v -u MY_API_KEY: "https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content"

< HTTP/1.1 302 Found
< Date: Thu, 14 Apr 2022 13:11:27 GMT
< Location: https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/application-pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=...

Aside - with curl you can of course tell it to follow the redirects (the -L flag). This actually allows me to get the file with version of curl e.g. it only passes the authorization to the first host e.g. Companies House:

(some of the following snipped both within lines and complete lines for clarity - marked “…”)

curl -v -L -u MY_API_KEY: "https://frontend-doc-api.company-information.service.gov.uk/document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content" > download.pdf
...

GET /document/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/content HTTP/1.1
Host: frontend-doc-api.company-information.service.gov.uk
Authorization: Basic {ENCODED API KEY HERE}

< HTTP/1.1 302 Found
< Date: Thu, 14 Apr 2022 13:23:37 GMT
< Location: https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/application-pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=
Issue another request to this URL: 'https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/application-pdf?X-Amz-Algorithm=A
GET /document-api-images-live.ch.gov.uk/docs/iTf5l1sphFi4eBM-ndd7WGZclS11-L4FJdVSx7SN3xE/application-pdf?X-Amz-Algorithm=A
Host: s3.eu-west-2.amazonaws.com

< HTTP/1.1 200 OK
< Content-Type: application/pdf
< Server: AmazonS3
< Content-Length: 183106

That should get this working for you.