/advanced-search/companies responds with 500 after 10000 items

Hey folks!

thanks for the great service and API!

we’re trying to get a local copy of the UK companies by crawling the items from the advanced-search/companies endpoint, 5000 per page with something similar to:

https --auth MY_KEY: https://api.company-information.service.gov.uk/advanced-search/companies start_index==10000 size==5000

e.g.
https://capture.dropbox.com/qWEOsW4RepiyhA6w?src=ss

but the 3rd page always returns 500 Internal Server Error

we’re using a live api key

can you guide me if i’m not using the correct URL or if my assumptions are wrong for this endpoint?

why is this enpoint having only 10000 results?

p.s. as you can see we’re using the endpoint without any other parameters except the size & the page

thanks in advance
atanas

I don’t have the answer to your specific question, but in general this service isn’t designed for bulk downloads/crawling/scraping etc. I don’t have the link on me but I suspect it may be against the terms-of-use of the api. I’m not sure if the advanced search API is actually officially live yet either so there may be more documentation to come.

Companies House has some bulk data products and resources that may be a lot more appropriate for what you’re trying to do: Companies House

1 Like

Thanks @ash for the prompt reply

i’ll give my shot with the bulk data product, but still would be great if i can scrape it through the API (bulk data updates monthly, maybe an update can trigger other API calls for officers, etc, maybe having some other logic around it)

thanks!
atanas

Better to choose a batch filter and to reduce down the size of each batch. A good approach would be to limit the incorporation date to a range and reduce the range size down if more than 200 results are returned.

https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/search/advanced-company-search

This is publicly-available data and its consumption should not be limited, though rate-limiting is understandable for resource constraint reasons.

No API should EVER return 500 Internal Server Error.

To get your own copy of our data please use the bulk products and the streaming API, thats what they are for. The API is not for that purpose.

Hey, thanks a lot for the replies

i can see the point of Mark Williams, but my opinion is closer to David Bond. i believe that by design an API should never return 500 and if used according to the rate limits, we should be able to scroll the full database, not a part of it

thanks
atanas