500 error advanced-search using start_index

ryan.bentham · May 18, 2022, 10:41am

I’m trying to run this query:

https://api.company-information.service.gov.uk/advanced-search/companies?company_status=active&size=5000&start_index=10000

I assumed this would effectively provide the 3rd page of results. Previous I ran this with a start_index of 0 and 5000 with no issues.

Am I doing this wrong?

marcell.toth · May 30, 2022, 9:25am

Ran into the very same issue, thanks for starting the thread, @ryan.bentham. Tagging @MArkWilliams as I see him replying to many of the posts.

I am using the advanced search for checking daily dissolutions, running:
https://api.company-information.service.gov.uk/advanced-search/companies?dissolved_from=2022-01-04&dissolved_to=2022-01-04

From the response, I can see that 21408 companies were struck off the register that day. Now I execute the very same query, with &size=5000 and &start_index=0, 5000, 10000, 15000, 20000 to get chunks of data. The first two cases, 0 and 5000 works as intended, but the remaining gives “Failed with error code: 500 | Reason: Internal Server Error”

Any help is much appreciated!

ryan.bentham · May 30, 2022, 9:43am

Hopefully someone can shed some light on this. I had to cancel the project I was working on as I just couldn’t seem to get the data I needed (unless we’re approaching it the wrong way)

voracityemail · May 30, 2022, 9:57am

I don’t know, but I suspect it depends what you actually want to do (e.g. what data you need and why). For their normal search there used to be much lower limits on a) how much data you could request and b) the maximum number of results. (Search this forum for details). I think that then threw an error (EDIT see post here):

Enquiries to Companies House got the reply that “this system is not designed for extracting large chunks of data but for more targetted searches”. They were very keen to spread the load over time and avoid anything that looked like scraping the dataset via the API. (Part of the reason for the streaming API I think). For use cases needing a lot of data like “I want to find all companies of type x” they have directed people to use the bulk data. (Some available here, some datasets e.g. officers available by making a request on one of the threads on this forum).

It seems like they’ve raised the limit - at least for the advanced search. However I’d be tempted to set the start_index to just below 5000 and creep over it in smaller increments, just to see if a hardwired limit was indeed the issue.

Could be a bug of course but seems an unlikely number.

marcell.toth · June 1, 2022, 10:15am

@voracityemail , thanks for the reply!

I went through these, the limit is 10K, which is hardwired. If x is the query size and y is the start index, then x + y <= 10000.

Currently, I am “hacking” my way around using incorporation dates to fragment the response size, but I think that’s hardly an acceptable solution.

MArkWilliams · June 1, 2022, 12:20pm

Have you considered using our free data snapshot product?
http://download.companieshouse.gov.uk/en_output.html

marcell.toth · June 2, 2022, 9:14am

Thank you, Mark, for replying.

Yes, I already have checked-downloaded all the free data products (companies, accounts, psc). I even went for a bonus round to get some of the snapshot files (companies, psc) historically from the National Archives. Regarding the opening question, I think this might be a solution, @ryan.bentham.

However, for dissolutions, I don’t see the way clear cut from the companies data product. Can you elaborate?