/advanced-search/companies responds with 500 after 10000 items

Hey folks!

thanks for the great service and API!

we’re trying to get a local copy of the UK companies by crawling the items from the advanced-search/companies endpoint, 5000 per page with something similar to:

https --auth MY_KEY: https://api.company-information.service.gov.uk/advanced-search/companies start_index==10000 size==5000

e.g.
https://capture.dropbox.com/qWEOsW4RepiyhA6w?src=ss

but the 3rd page always returns 500 Internal Server Error

we’re using a live api key

can you guide me if i’m not using the correct URL or if my assumptions are wrong for this endpoint?

why is this enpoint having only 10000 results?

p.s. as you can see we’re using the endpoint without any other parameters except the size & the page

thanks in advance
atanas

I don’t have the answer to your specific question, but in general this service isn’t designed for bulk downloads/crawling/scraping etc. I don’t have the link on me but I suspect it may be against the terms-of-use of the api. I’m not sure if the advanced search API is actually officially live yet either so there may be more documentation to come.

Companies House has some bulk data products and resources that may be a lot more appropriate for what you’re trying to do: Companies House

1 Like

Thanks @ash for the prompt reply

i’ll give my shot with the bulk data product, but still would be great if i can scrape it through the API (bulk data updates monthly, maybe an update can trigger other API calls for officers, etc, maybe having some other logic around it)

thanks!
atanas

Better to choose a batch filter and to reduce down the size of each batch. A good approach would be to limit the incorporation date to a range and reduce the range size down if more than 200 results are returned.

https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/search/advanced-company-search

This is publicly-available data and its consumption should not be limited, though rate-limiting is understandable for resource constraint reasons.

No API should EVER return 500 Internal Server Error.

To get your own copy of our data please use the bulk products and the streaming API, thats what they are for. The API is not for that purpose.

Hey, thanks a lot for the replies

i can see the point of Mark Williams, but my opinion is closer to David Bond. i believe that by design an API should never return 500 and if used according to the rate limits, we should be able to scroll the full database, not a part of it

thanks
atanas

it’s not an error, I ran into the same issue and now having to rebuild my logic to suit the API

The issue here is that your query returns lets say 20,000 results, the API request returns the first 5,000 of the 20,000 and your start_index is for the 5,000 results that the API returns not for the entire 20,000 matches.

You are basically telling the API to order your results from number 10,000 with only 5,000 results returned thus the 500 error.

What you should be doing is making sure that your query returns 5000 results or less and do not use the start_index as it’s only for the starting point in your returned results and not the total results that meet your parameters.

Hope this helped