Searching Using Curl returns only 20 results a Page

shaun_faulkner · November 9, 2017, 10:08am

When using a Curl to search all I only get 20 results per page, how can I get more per page and/or iterate through the pages, I can’t see any clear explanation how to index through.

Thanks in advance

Shaun

voracityemail · November 9, 2017, 5:07pm

Documentation is at https://developer.companieshouse.gov.uk/api/docs/search/search.html

You need something like (e.g. for “lloyds”):

curl -u{APIkey} "https://api.companieshouse.gov.uk/search?q=lloyds&items_per_page=30&start_index=0"

You obviously need your own API key - which should end in a colon - instead of {APIkey} here.
Check the “items_per_page” field in the response to ensure you did get back 30 results (or count 'em!).
To get the next 30:

curl -u{APIkey} "https://api.companieshouse.gov.uk/search?q=lloyds&items_per_page=30&start_index=30"

You’ll also probably be interested in the “total_results” field.
(By the way - if you are just interested in companies or officers, you can limit the search to these by using /search/companies or /search/officers instead of /search).

The general syntax is as per a standard RESTful API:

curl -u{APIKEY} "{restURI}"

(Obviously see curl docs if you e.g. need to get the http header instead etc.)
Where (to spell it out):

{APIKEY} is your API username followed by a colon.
The format used by cURL is actually username:password but CH just give you a username and no password.
{restURI} - note you’ll want to enclose this in quotes for windows command line / unix shells.
This is:
The URI for the appropriate end point. The first part is either:
For main API - https://api.companieshouse.gov.uk/
For Document API - http://document-api.companieshouse.gov.uk/
So for search (all), you want https://api.companieshouse.gov.uk/search
Search takes the following query parameters:
search?q={term}&items_per_page={ipp}&start_index={start}
{term} is the string you’re searching for (obviously, if that includes URI “special characters” like “?”, “&” etc. these need percent encoding)
{ipp} is the number of items you want back (note - this doesn’t guarantee you’ll get as many as this).
{start} is the item in your search results to start with (zero-based I believe).

Many CH endpoints use start_index and Items_per_page to step through potentially large data sets - a quick search on this forum will show you various ways to do this e.g.:

You’ll also find out about limits in some cases which are not currently documented in the main documentation.

shaun_faulkner · November 9, 2017, 8:10pm

Thanks for the responses guys, let me show you what I have done, because it still seems to return the same information:

curl -u myKey: https://api.companieshouse.gov.uk/search?q=searchTerm&items_per_page=20&start_index=0

Then using the number of items returned to define the start_index value I use the following:

curl -u myKey: https://api.companieshouse.gov.uk/search?q=searchTerm&items_per_page=20&start_index=20

I seem to get the same information?

Thanks in advance

shaun_faulkner · November 9, 2017, 9:03pm

Ah I got what I need now…

I used Python instead and urllib3 and then the start_index but the example shown above using curl did not work for me.

Thanks for the help both

voracityemail · November 10, 2017, 12:35pm

Good you’ve got it working.

When using curl, did you enclose the https://… part in quotes? Your example doesn’t show any.

(If you did so then the info below won’t apply. I don’t know what’s wrong but for help post exactly which command you issued to curl [without your API key details obviously] and the response you received.)

That might be the reason why you got the same information using a different “items_per_page”. The command line / in Windows / many linux shells will split the command you gave above into 3, after the “&” character.

So if you run (using “tesco” as search term):

curl -u myKey: https://api.companieshouse.gov.uk/search?q=tesco&items_per_page=20&start_index=0

This is intepreted as 3 commands:

https://api.companieshouse.gov.uk/search?q=tesco
items_per_page=20
start_index=0

The first will happily give the first page of results (as if you’d set “start_index+0”).

When you run
… and then the system tries to run:

curl -u myKey: https://api.companieshouse.gov.uk/search?q=tesco&items_per_page=20&start_index=20

…again the shell / command line will split this up and you’ll get the same command as you had before sent via curl:

https://api.companieshouse.gov.uk/search?q=tesco

See:

shaun_faulkner · November 12, 2017, 9:43am

Good man @voracityemail that was exactly the issue, I tried again using double quotes as you suggested and then the indexing worked.

Thanks for that, it might come in handy again, I am new to Curl, in the end I resorted to using pythons urllib3. but its still nice to know for future reference

carmen_aguilar_garci · November 12, 2018, 9:49am

Hi!
I am using also the start_index and the items_per_page to iterate through the pages. However, I got an error when I set the start index over 901 (it works with 900 but not with 901). Is there any limit or I’m doing something wrong? And how can I get the rest of the results?
Thanks

MArkWilliams · November 12, 2018, 10:15am

Search is tuned to ‘find’ a specific company name, it is not intended to be used to get all company names, we have bulk products for that.
If you search returns too many results, you need to make your search term more specific.

carmen_aguilar_garci · November 12, 2018, 11:39am

Thanks for the answer, Mark.
I have checked the bulk project, and find the company data product. However, the dataset includes only “live companies” and not those which have been dissolved recently. Is there any other product where I can have the complete database to narrow down the companies that I am interested in?

Regarding making the term more specific, how can I do it? when I include more than one term in the search, it returns even more results (for both terms separately and not results which include both terms at the same time.

Thanks

MArkWilliams · November 12, 2018, 1:50pm

Can you provide a company name that you are searching for that you cannot find?

carmen_aguilar_garci · November 12, 2018, 2:57pm

So, for instance, this company “BITCOIN ALLIANCE LTD” is not in the dataset for “live companies” available for download, because it was dissolved in February.

Regarding the searching terms, I try to make my search more specific but, for instance, if I write q=virtual+coin or q=%22+virtual+coin+%22 I got more results than using only “coin”.
Thanks

MArkWilliams · November 12, 2018, 3:19pm

When I search the API for BITCOIN ALLIANCE LTD, it is the first one returned, an exact match.
GET /search/companies?q=BITCOIN+ALLIANCE+LTD

There is no company called virtual coin, so you will not find it.

carmen_aguilar_garci · November 12, 2018, 4:04pm

Sorry, I didn’t explain well.

My aim is getting the companies that have been dissolved in the two last years. The type of companies that I am looking at are those related to bitcoins. There is no catalogue of which companies are related to that business, so I am searching in Companies House for those which have in the name (title) bitcoin, crypto, virtual coin… Not for specific companies.

I know that Bitcoin Alliance LTD is in the API, but it is not in the bulk data available to download here: Companies House. That is why I am using the API to get the companies that can be related to bitcoin, and later filter by the “dissolved” ones. But I cannot get more than 1000 results. I tried to be more specific with the terms of the search, but it is given me even more results.

Is it possible to produce a csv with the companies dissolved in the last two years, as there is one for the live companies? Or, is it an alternative way of getting this information?

MArkWilliams · November 12, 2018, 6:06pm

The short answer is that the API search is not intended to do what you are trying to do.
There is no bulk product of dissolved companies either.

carmen_aguilar_garci · November 13, 2018, 9:50am

Okis,
Thanks for all the help!

MArkWilliams · November 13, 2018, 10:02am

There is another product called the DVD ROM product that has 20 years dissolved companies data on it but it is chargeable. Details can be found About our services - Companies House - GOV.UK

Sam76 · December 7, 2018, 9:09am

What is the maximum number of results that the CH API returns?

voracityemail · December 7, 2018, 10:19am

I’ll reply here so your main queries get a chance for Companies House / someone more knowledgeable to pick up.

maximum number of results that the CH API returns

Queries without an items_per_page but which still return a list e.g. company insolvency information, company exemptions, company registers.

all the data as far as I’m aware (presumably never going to be very extensive).

Assuming (for lists) by “maximum number of results” you mean “maximum number of items returned in one request”

I had thought CH deliberately didn’t make any promises here but I found this thread:
Data capped limit
… says “100” - and this is what I’d found experimentally e.g. anything above “items_per_page=100” just returns 100 results. You could always request 500 and see what you get…
if you don’t specify “items_per_page” the default seems to be 20.

If you mean "is there a maximum number of results I can get (with multiple api calls):

For searches e.g. search, search/companies, search/officers, search/disqualified-officers there’s a limit of 300. See:
Search company officers returns HTTP 416 when start_index over 300
For other calls, I don’t think there is. Example - very long filing history lists, companies with 2000 partners (OC305357) - if you’re prepared to page through results eventually you should get them all.

Looking at your questions in general you seem to be saying “could the API be changed so I can efficiently replicate the data and keep updated with changes on a notification basis?”

CH repeatedly state that this is not their remit when creating the API. Providing more granularity in the way of searching (nationality) may be down to your own implementation. However:

They provide the company, PSC datasets and accounts (if that interests you) as bulk data. You can also sign up on the forum for officer appointments as bulk data. It seems there’s no bulk dissolved companies data / disqualified officers although some people who post here offer these services themselves. Caveats - the company bulk data set is only updated monthly, the format doesn’t match the API and it doesn’t contain all data that you get e.g. with Company Profile in the API.
They have plans - trailed now for a couple of years, search the forum - for a “streaming API” which sounds like it would give you the required updates on changes.
Probably not useful for your needs but you can sign up to follow companies via email updates when they make a filing.

Aside from CH there are some services mentioned on the forum which may provide you with additional functionality - search around!

Edit: overview of bulk data products - see:

There’s also a DVD of ex-companies (for sale):

Found out about this in the following thread: