Need to Page through advanced-search/companies, but fails after start_index 10000

I’ve been trying to figure out a good way to page through all company numbers. Best option I see is to use the Advance Search API. However it seems to throw an error after start_index reaches 10000. Is this expected behavior? or is this an error?

command

curl -u 801932d9-9bdc-4428-b413-61ef40f0e791:  https://api.company-information.service.gov.uk/advanced-search/companies?start_index=10000&size=1

result

{"timestamp":"2023-09-22T22:08:47.822+00:00","status":500,"error":"Internal Server Error","path":"/advanced-search/companies"}
2 Likes

Page through all company numbers?
The API is not intended for that sort of use.
We do have bulk products Companies House that
that may be more suited to your needs

1 Like

Thanks for this - I’ve been wrestling with this for a little while (something saying that it will only let you look at the first 10,000 records on the web pages would have saved me a couple of days of assuming it was me doing something wrong)…

However I’m a little confused as to what the intention of this method actually is, then. As you can’t access a company after the 10,000th in the list of whatever filter you are selecting using this method, what’s the use case it is designed for?

I’ve grabbed the bulk download file and put the 5.5 million or so records of active companies (i.e. not dissolved ones) into a table, but that is correct s at 1/3/24 and I’ve been asked by my colleagues in research (I work in a University) if I can get the balance up to today and maintain that. Looking for companies incorporated after 1/3/24 yields about 38,000 records…

So I’ll need to cycle through each date from 1st to today using that as a filter to grab the smaller number of daily records (hopefully within 10,000 each day) to get that full list.
Point being, the API will still deliver out the same amount of data to me (so no ‘advantage’ to Companies house) but it just makes life harder and causes more API calls to be made.

Just my musings having encountered this API for the first time recently.
Thanks

1 Like

I’ve spent 2 days of work trying to understand why my pipeline was breaking, until I got to this same conclusion. Very strange…
I need to pull all companies with some given sic codes but I can’t?.. Why would there be a restriction to this? I can pull that bulk data, however I would like to have it on a more frequent basis other than monthly, hence the reason I would use the advanced search endpoint.

I can’t comment on “why”, not being from Companies House (except for lots of people here who clearly would otherwise just use the API to grab “all the data” which presumably has a cost of it has to come through an API pipeline etc. - this API is free to the consumer…)

One possible way to do this - albeit you don’t get everything all at once - is to start using the Streaming API to get all updates (for companies, officers, PSCs etc). You can then apply those updates on to of the data available via the (intermittently updated) bulk data sets (again for Companies, fillings, PSCs and officers). If you needed it presumably any extra info not in those could be requested via the Public Data API.

More work for the consumer but OTOH the data is indeed “free beer” which no doubt people get some commercial value from.

And back in the day you had to request individual filings - via web site, no API - and pay for them! After some time a (chargeable) XML gateway was created - in fact you can still pay if you want and consume the data available via that. Horses for courses…

Good luck.

1 Like

I believe the 10,000 limit in advanced search is due to a limitation of Elasticsearch, which is used by Companies House.

1 Like