Download documents from Companies House API

Hi,

I am trying to download a document using companies house API with the following steps:

  1. Get Filing History List
  2. Get Filing History Item
  3. Get document metadata from the url which comes as response from step-2 ---- This step never returns me anything and keep on retrying. – I am using power automate to hit the URLs.

After this step, I am not able to move forward as getting document metadata never returns.

Please suggest where am I doing wrong.

Thanks

Welcome. Downloading documents is something that seems to cause people troubles.
I know nothing about “power automate” - presumably Microsoft Power Automate? So I can’t help you with the details of that but it sounds like you’ve at least managed to connect to Companies House.

It’s not clear exactly the URLs / data you were sending. It is sometimes helpful to know this when trying to understand an issue with a web service / REST API. You can always obscure some of the details!

Your problem may be due to the fact that the document metadata (and indeed document data) is on a different host (“document-api.company-information.service.gov.uk” vs. “api.company-information.service.gov.uk”). Also in the past you might see slightly different hosts listed in the returned data. They should still work fine. However I can’t tell what you actually sent so I don’t know.

A quick step through of what I guess you’re doing:

  1. Get Filing History List - you make a request to https://api.company-information.service.gov.uk/company/{company_number}/filing-history (substitute the appropriate {company_number}) as described in the Filing History List Docs.

  2. At this point you will have a Filing History Resource. This contains a list of items with the same data for each as you’d get from calling Filing History Item. (As far as I’m aware this is still the case). So there’s no need to make another call. Just find the item you want in the list of items, get the links.document_metadata and use that.
    It should be of the form:

https://document-api.company-information.service.gov.uk/document/{document_id}

(The document_id part will obviously vary.) You can then just request that per the documentation.

I’ve written a couple of posts on downloading documents with more or less detail. They should still be current:

Good luck.

Thank you so much for a quick response.
I am able to reach till getting the filing history.
But, when I try to hit the URL which I get in document metadata, it keeps on retrying but never returns.
What should be the document ID here? Will it be transaction ID?

I have just run the following to show you the entire process on one filing. (If you want more info on the full process please read the information I linked to, or just search on the this forum using the “magnifying glass” icon at the top right).

I like to use curl (a simple command line tool) to manually test things. It’s simple, readily available and you can see exactly what is sent / returned. I think you should do this to check exactly what is happening and any errors / responses. If you see errors when doing the following this might be due to:

  • Using the wrong API settings (e.g. not using the live but using the “sandbox” environment, use of “localhost” without having used the fix to set an alias in your hosts file etc.). You can search this forum for more info on that.

  • Limitations / details you need to work around in your Microsoft automation tool. (I know nothing about that).

  • Network issues e.g. maybe you have a firewall blocking something

Those are up to you!

Starting from looking up the filing history for company SC327000 (I’ve limited this to 2 items to save space and I’ve snipped information from the response - marked “…”. I’ve also formatted the JSON return for clarity - it will be returned without the spacing / new lines obviously). You’ll need to enter your own API key.

curl -u YOUR_API_KEY: "https://api.companieshouse.gov.uk/company/SC327000/filing-history?start_item=0&items_per_page=2"

{
    "items": [
        {
            "category": "officers",
            "description": "appoint-person-director-company-with-name-date",
            "date": "2021-11-03",
            "transaction_id": "MzMxOTA3NDIxOGFkaXF6a2N4",
            "links": {
                "self": "/company/SC327000/filing-history/MzMxOTA3NDIxOGFkaXF6a2N4",
                **"document_metadata": "https://frontend-doc-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI"**
            },
            ...
        },
        {
            "category": "officers",
            "description": "termination-director-company-with-name-termination-date",
            "date": "2021-09-30",
            "transaction_id": "MzMxNTYyODMzMWFkaXF6a2N4"
            ...
            "links": {
                "self": "/company/SC327000/filing-history/MzMxNTYyODMzMWFkaXF6a2N4",
                "document_metadata": "https://frontend-doc-api.company-information.service.gov.uk/document/DF_f0_3zuroMElPj05Tv_nLWWNCKxk8dTe1Mo8nCn8k"
            },
            ...
        }
    ],
    "filing_history_status": "filing-history-available",
    "items_per_page": 2,
    "start_index": 0,
    "total_count": 252
}

Take whatever is in the document_metadata - and just request that. Here’s the first one:

curl -u YOUR_API_KEY: "https://frontend-doc-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI"

{
    "company_number": "SC327000",
    "barcode": "XAGI54RL",
    "category": "officers",
    "pages": 2,
    "filename": "SC327000_ap01_2021-11-03",
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI",
        "document": "https://document-api.companieshouse.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/content"
    },
    "resources": {
        "application/pdf": {
            "content_length": 90224
        }
    }
    ...
}

This tells you there is a PDF format version available, at what is in the document member. Take that. If there is more than one entry in resources you can use the http header “Accept” to specify which you want e.g. Accept: application/pdf. In curl this would be --header "Accept: application/pdf". I’ve also specified the -I flag to show the header information and the -X GET (otherwise we get a complaint because by default -I sends a HEAD command I think).

The first thing you’ll get back is a redirect (302). Currently this will send you to Amazon AWS. To show

curl -u YOUR_API_KEY: -I --header "Accept: application/pdf" -X GET "https://frontend-doc-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/content"

HTTP/1.1 302 Found
Date: Mon,
14 Feb 2022 09: 17: 57 GMT
Location: https: //s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/application-pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAWRGBDBV3DJKBSBZ6%2F20220214%2Feu-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220214T091757Z&X-Amz-Expires=60&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDwaCWV1LXdlc3QtMiJGMEQCICpmdATAsFtTe1exlaV2By%2B3EsjkoG8luFO82QlVCTgJAiAi%2BibCOQUnYsm1hsIQzlkTVFLq8T716mVACh8ytDbTZCqDBAiF%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAQaDDQ0OTIyOTAzMjgyMiIMqJbz5WsgE86qCTKkKtcDP%2BQbPoVEPEDSUso%2B0EzNdIGjdAFxISrm9WIWqvHvQd1FZNeWFE9e7oxI%2BUPExhJstQuaWjZ6BnnVUuf97zkjSc8Le%2FzlKozp%2Ba3rVX3k5jx8B19NrW84GNBLppBirsOkttj3MY74VTlL5NBqOFY9mrX6yAgYaS3oY3c1BLbq5EcoOYdVVuGqd1YtJr5HIuz0OexsyTZRC2NGQ4zZI0SygZuFHNcv6YNsN7PKLlPKMub9xcyfrtsJi%2BsZJEkmUWmTDQ4er92wZGALincsg%2BJ5%2Bsl2Marfbu%2FSNQN%2F6XbY3qoxaRTLZr93bP1JGqOZ0sLPRkJdMzG6672FPU3nQvirNSj15pd7Znp1XvPH9pd0DWBFWJV32AQrqDNOhgL%2B1A44TyfZZRfkcWaOEITd4UhIODRIEIeUvZRK3NKGJoQj73KnKhCbb%2F7SmXkeJC32ikDxlejYDtPlCikA2HAYoK%2Bhsm%2BOkqfnFfkVXZajwNCmKqOlH%2FbfMWSQeYjovzYRclaYYooUZJjC5rkBgC8ACfdP%2Fhbewq%2B%2BJnRiROECJBWIUbupid7j5TRNJt2SIY6eRpaQOaXxnkMkychRthoaLgljY1h3Bf8tSUpVLEYvEfmnXQx14YN6SQaoMKWep5AGOqYBcyEgdy9XTWJCaPjKHaDDnNY2Z95dV2gJHzkNvbUe8nPoEM2DtBG%2BAOoiLuXbDKSUas3wsaJByeDyFKGjNUongcLUkoqAQZslSDzIZBapGYNL6fImE2XFCHLUBfRvjp2XzYccoBu%2FqB%2B3iwNmtL7hPYqzK%2FWMoFIvHWXAhdHzFZ5F9xw%2Fqtrb55n8dVUqpzwmNGXNZtY%2B0IYADhy%2FslUeRXor1W3VFQ%3D%3D&X-Amz-SignedHeaders=host&response-content-disposition=inline%3Bfilename%3D"SC327000_ap01_2021-11-03.pdf"&X-Amz-Signature=7075c5755209fdf7bfe05c191972a6f24fe8b377103ba980b1f41ed2a7adbc97
Server: nginx/1.18.0
X-Ratelimit-Limit: 600
X-Ratelimit-Remaining: 598
X-Ratelimit-Reset: 1644830362
Content-Length: 0
Connection: keep-alive

The link above is time-limited e.g. if you try to use this one it will fail - you’ll need to follow this process and get it yourself.

When you make the request to Amazon you should not send your companies house username and password (the API KEY part). a) It’s not appropriate since Amazon isn’t Companies House and b) this will likely cause Amazon to get confused and the communication to fail.

Amazon may also redirect you - I can’t remember. If you want curl to automatically follow all links use the -L flag.

curl -L "{your own version of the link above}" > test.pdf

That successfully downloaded the PDF for me.

Good luck.

1 Like

Hi @voracityemail

I tried using the curl commands you suggested above and everything worked fine.
But, in Power Automate, whenever I hit the URL which I get in document metadata, it keeps on retrying and never returns back.

I am using various other actions of companies House and none of them is giving this issue.

Only getting document metadata never returns.

Any thoughts on this?

Also, yes I am using MS Power Automate as you asked in your first response on this post.

Thanks
Neha

Great - so you’ve proved that you don’t have problems with the API key, network access, firewalls etc. (at least not for where you’re running curl from).

This isn’t anything to do with Companies House API, which you’ve now proved is working as expected.

Even better - so you know exactly where the problem is!

You need to be clear what exact url you are requesting, what data you’re passing and what response you’re getting back.

Is it failing to give you the document metadata itself (which is JSON) or failing to give you the document content (e.g. download the PDF)? The urls to request either have the same start but the content one ends in “/content”. In my experience that’s often where people have problems.
If it’s the content you’re having problems getting then make sure you’ve understood the notes above about not sending the Companies House API key to Amazon and redirects. Then see if you can get Power Automate to not follow redirects so you can step through the process. Again - ask around on Power Automate forums.

Both of “document API” the following urls have a different host to the rest of the API. Is that giving Power Automate a problem?

https://document-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/content

You say “keeps retrying” however there is no doubt a setting in Power Automate which controls what happens if you get an error. Or there will be another way to call something and get the http headers. You should be able to get the system to stop on the first error. Ask around on Power Automate forums or read their documentation.

So make sure you’ve understood which if any of the following urls are giving an issue? If not, what error do you get? As above play around with settings until you can get an error if it doesn’t give you one. Maybe Power automate doesn’t like four dots in the URL? Ask on Power Automate forums!

Document metadata urls:
https://frontend-doc-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI
The same one but with their “documented” host:
https://document-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI

Actual document URLs:
https://frontend-doc-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/content

Same with different host:

https://document-api.company-information.service.gov.uk/document/mixpA6C5NJo7r0SZ2ypIiLCLM4XtZMnHZKMcHestIKI/content

Post what you find here when you fixed it. It might help someone else.
Good luck.

1 Like

Thank you for all your help @voracityemail
I was able to get the documents to download.
The issue was with the authorization header I was using in HTTP action of Power automate.
It was expecting the API key to be in Base64 format with Authorization header.

The next issue I am facing is that it returns only first 25 records. Whereas I want to download incorporation certificate.
How can I change the items per page for this?

Thanks in advance! :slight_smile:

I’m glad you resolved your issue.

I recommend you use the “search” facility and read the documentation. Almost all the answers are available that way. However since I’ve already typed that… :wink:

Read the documentation here:
https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/filing-history/list

You can limit the filing history items returned by category which can save you searching through results. In this case there is one for “incorporation” so specify that.

The list of the possible category values is here, in the category member:
https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/resources/filinghistorylist?v=latest

You can specify a number of items per page. However if you ask for large numbers of results (not sure the limit and it may vary… possibly 100?) Companies House won’t return all the information. You’ll need to “page through” the result set if you don’t get back all those you need. For more information on that see the following threads but a very brief summary:

  • Send them both start_index and items_per_page, not just one or the other.

  • start_index starts from zero and should be a multiple of items_per_page

  • There was a bug if you set items_per_page to 1 - I’m not sure this has been fixed so avoid that.

About the “single item” bug:

Good luck.

1 Like

Thanks @voracityemail !
All up and running now for filing history.
I can download the documents including incorporation certificate.