Using Streaming API to keep data up to date

Hello,

I am working for a client who is interested in taking a snapshot of the company data here Companies House and then use the streaming API to keep this data up to date.

The problem is there is no way to know what timepoint that dataset is up to to then call with the Streaming API. There is potential for missing data updates?

Is there a better way to handle this?

Thanks,
Scott.

Snapshots for the steaming API’s are not yet available.
The best strategy in the absence of the streaming API snapshots is to:

  1. determine what items in the downloaded database you want to keep up to date
  2. ascertain which streaming API endpoints update those fields.
  3. Run your app for the streaming API endpoints from start of the month and cache all updates until you have downloaded the new database
  4. After putting your database in place, update with cached data and continue updating with streaming endpoints

PS: Mind the dates!

1 Like

Thanks for this. I was hoping there was another way other than caching but that will do for now.

Maybe someone from CH will be-able to add their input but i’ve seen inconsistencies between the data of what the monthly batch files contain and what the RESTful APIs return (of which streaming is part of). I have it in my head that 2 different systems are at play so if you are to use the Bulk Dataset then maybe worth validating the data you’re interested in aligns…

1 Like

Here’s my 2 cents, bearing in mind that CH have (had?) an intention to provide snapshops specifically for the streaming API’s.

Did you look at the full spectrum of the streaming API’s and still had inconsistencies with the RESTful data? The RESTful payload is the ‘complete / consolidated picture’ whereas the different streaming API endpoints report updates to sections of that picture in ‘real-time’. (not so sure about that last part as I’ve noticed batched data processing after midnight).

Since the bulk dataset is a consolidation of the RESTful API, then if the full spectrum of the streaming API’s covers it, then it should work with the bulk dataset. But yes, could someone from CH please throw some more light on this.

I’ve been building something along these lines .my SQL table has the Timepoint from the change stream as a column. Any record coming from the genuine change stream will have a non-zero Timepoint.

After syncing regularly from the change stream for a while, the next time a new monthly drop appears, I have another command that can import from the .csv, and it assumes Timepoint zero for those records. My update code requires the update to have a Timepoint that is greater than or equal to the timepoint of the existing record, so snapshots cannot overwrite change stream records.

There is a problem, which is that the snapshot CSV and the change stream JSON only overlap partially in their schemas. Some things are only in the CSV, other things are only in the change stream. Because I want to maintain up-to-date information via the change stream, I’m ignoring anything in the CSV that isn’t in the change stream.

This appears to be less of a problem with Persons With Significant Control because the snapshot is in almost the same schema as the change stream (except with no Timepoints, so same approach of setting those to zero).

For example, in the companies snapshot CSV there are several headers for information about mortgages, but these don’t appear in the change stream.

Meanwhile in the change stream there’s a whole bunch of stuff about foreign company registrations, which doesn’t appear in the snapshot CSV.

And I’m assuming these pair up:

json.DateOfCessation = csv.DissolutionDate
json.DateOfCreation = csv.IncorporationDate

I’ve done the opposite, I only update the data in the stream that was in the CSV. Any additional data in the stream I am ignoring for now.

Also, I wasn’t planning on uploading the CSV file every month and just taking the updates from the stream. We might take a rebase every now. Because the dump is up to 5 days after the start month and there is no way to tell the stream you want all updates since the start of the month. Also the time it takes to upload the new data dump means data will be unusable for a period of time while it is loaded unless you load into a different db every month!

I am caching the event stream data in the database and processing it separately to update the company information. The option in the future is to keep this running as the new data is loaded then only process event data that came in for the current month once the current dump is loaded.

Thanks,
Scott.

@scott1 Likewise, I’m assuming the snapshot will only be pulled in once to fill in the gaps, and it’s just stream updates from then on. My worry with favouring the CSV schema is that that way I will have some data that never gets updated, e.g. the mortgage fields must surely change in value over time, so whatever I store from the initial snapshot will become stale. Hence favouring the change stream’s fields, so my copy will gain content over time and stay accurate.

Much of the text in the CSV is uppercase, whereas in the changes it’s mixed case!

Also the CSV has CompanyCategory which I see in a typical example is Private Limited Company, whereas the change stream has type with the value ltd in the next change for the same example company.

Fortunately I’m mostly interested in the name and address for a company number, which seem usable as long as they are treated case-insensitively.

Yes, some data may become stale which is why we will probably rebase every now and then rather than monthly.

I have a lookup YAML file that maps between type to CompanyCategory so that the company data stays consistent.