Hi everyone
I came here looking for some advice around your experiences when migrating over to Streaming API, from REST API (Companies House Public Data API).
To provide some context:
- the service has its own local database which constains a subset of companies - those that were requested ad-hoc (by CRN) and cached for future use - for 30 days - to minimise requests made to the API
- it doesn’t scale due to rate limiting imposed on the REST API
- I need to migrate to Streaming API whereby the cached data is updated as a result of events provided by streams (straightforward, at least in theory)
- before utilising Streaming API, I need to make sure that the current database state is up-to-date - this is to ensure no events that are no longer available via Streaming API are missed
The latter is rather problematic. The only viable way I can think of is going over the cached data, and for anything that’s older than 7 days, re-fetching the entire state, that is: the company profile itself, PSCs, officers, appointments, officer disqualifications, filings.
As you can see already, it’s a lot of requests to RESTful API to make. Out of 115000 companies in the dataset, 70% of the cache is older than 7 days and does not guarantee that any changes made in the meantime will be included in Streaming API (This assumption is based on several thread where I read that streams go +/- 10 days in the past).
This is still viable, but seems like a waste of time and effort, and resources. Even in the best case scenario, I’d need to make 80 000 (companies) * 8 (requests: company profile, charges, filings, PSCs, officers [assuming 1 officer], appointments [1 officer = 1 request], disqualifications [1 officer = 1 request] and insolvency) which amounts to 640 000 requests (!). Considering that the rate limit for the API is 600 requests per 5 minutes, it’d take ~90 hours / almost 4 days. In the best case scenario.
I’m therefore looking for potential hints or straight-on enlightenment as to how to handle this, in perhaps a more efficient, less time consuming manner.
Things I’ve looked at so far are:
- Company data product for companies; it can’t be fully converted to match the API payload due to missing data points (iirc 2 or 3 are missing from the CSV)
- People with significant control (PSC) data product for PSCs; same story as the above
- Using company profile flags (
has_charges
,has_insolvency_history
,can_file
and potentially other) to avoid retrieving charges, insolvency, filings and others if the flag isfalse
. I’m yet to confirm whether these flags match the assumptions I’m making here.