Migrating to Streaming API

p.zajdler · April 25, 2024, 4:25pm

Hi everyone

I came here looking for some advice around your experiences when migrating over to Streaming API, from REST API (Companies House Public Data API).

To provide some context:

the service has its own local database which constains a subset of companies - those that were requested ad-hoc (by CRN) and cached for future use - for 30 days - to minimise requests made to the API
it doesn’t scale due to rate limiting imposed on the REST API
I need to migrate to Streaming API whereby the cached data is updated as a result of events provided by streams (straightforward, at least in theory)
before utilising Streaming API, I need to make sure that the current database state is up-to-date - this is to ensure no events that are no longer available via Streaming API are missed

The latter is rather problematic. The only viable way I can think of is going over the cached data, and for anything that’s older than 7 days, re-fetching the entire state, that is: the company profile itself, PSCs, officers, appointments, officer disqualifications, filings.

As you can see already, it’s a lot of requests to RESTful API to make. Out of 115000 companies in the dataset, 70% of the cache is older than 7 days and does not guarantee that any changes made in the meantime will be included in Streaming API (This assumption is based on several thread where I read that streams go +/- 10 days in the past).

This is still viable, but seems like a waste of time and effort, and resources. Even in the best case scenario, I’d need to make 80 000 (companies) * 8 (requests: company profile, charges, filings, PSCs, officers [assuming 1 officer], appointments [1 officer = 1 request], disqualifications [1 officer = 1 request] and insolvency) which amounts to 640 000 requests (!). Considering that the rate limit for the API is 600 requests per 5 minutes, it’d take ~90 hours / almost 4 days. In the best case scenario.

I’m therefore looking for potential hints or straight-on enlightenment as to how to handle this, in perhaps a more efficient, less time consuming manner.

Things I’ve looked at so far are:

Company data product for companies; it can’t be fully converted to match the API payload due to missing data points (iirc 2 or 3 are missing from the CSV)
People with significant control (PSC) data product for PSCs; same story as the above
Using company profile flags (has_charges, has_insolvency_history, can_file and potentially other) to avoid retrieving charges, insolvency, filings and others if the flag is false. I’m yet to confirm whether these flags match the assumptions I’m making here.

ebrian101 · April 26, 2024, 9:30am

As far as I can tell, the PSC bulk data product contains identical information to that found in the streaming or REST API. Which fields do you think are missing?

p.zajdler · April 26, 2024, 1:09pm

Thanks for a reply!

I double checked because I thought that too for a long time and thought that perhaps I missed something.

I did miss something - the fact that JSON objects coming from the snapshots are incomplete - I only checked on a couple of partial snapshots, and didn’t get many properties. The only missing from the full snapshot is $.date_of_birth.day as far as I can see.

This definitely helps, because I can just as well default it to 1st without losing too much accuracy.

Thanks for help, that’s one problem solved, and 80 000 requests less to make

@ebrian101 I also looked at your profile and found the link to CH Guide. Will definitely give it a go, perhaps I’ll find more useful tips!

ebrian101 · April 26, 2024, 1:22pm

Great!
Perhaps this is what you’re alluding to, but the precise date of birth (to the day) isn’t usually published. Only the month and year of birth. They suppress the day for privacy I suppose.
The PSC bulk data set looks like the best of the bunch, possibly because it was made in 2016 which is years after the others.
Unfortunately I haven’t updated CH Guide in some time, and there were some gaps in the original write up. The most complete section is the bulk data loading section. If you have the inclination, do feel free to suggest changes to it on Github, I’m very happy to accept contributions, if you think there is any essential information missing.
Another tool I made which may be of use during your stream migration is https://companies.stream. It’s a simple dev tool for seeing if there is currently data coming through on the streams. Can help with debugging and seeing the format of events.
All the best!