PSC bulk "generated_at" timestamp

jburant · July 6, 2022, 12:26pm

I would like to make a seamless integration of data from the PSC bulk file and the PSC streaming API.

The only timestamp information on the PSC bulk file comes in the last line. For example, for the file available a few days ago this was (note I have adjusted formatting slightly for reading clarity):

{“data”:
{
“kind”: “totals#persons-of-significant-control- snapshot”,
“persons_of_significant_control_count”: 9810297,
“statements_count”: 628256,
“exemptions_count”: 61,
“generated_at”: “2022-07-03T03:41:30+01:00”
}
}

My question is: Can I rely on generated_at for comparison with the event.published_at data element from the streaming API? More specifically, is it guaranteed that the bulk file accounts for all changes/deletions made up to the generated_at moment and nothing else, so that if I begin with API timepoints with event.published_at immediately after that timestamp that I will have a seamless integration with no data loss?

I realize there may be timezone issue based on the formatting of the timestamps, but I should be able to figure that out.

Thanks

MArkWilliams · July 8, 2022, 2:21pm

The “generated_at”: “2022-07-03T03:41:30+01:00” is the end time of the dump process. It is configured to start at 03:30hrs (Although it is sometimes produced at different times of the day) and generally takes approx 10-15 mins.
The data source is not ‘locked’ so there is a chance that updates could be applied during this 10-15 min window.
Hope that helps.

jburant · July 8, 2022, 3:46pm

Thanks. Instead of trying for an exact match on timing, I started with a bulk file and took the streaming API timepoints starting the prior day and just walked through them sequentially matching with the bulk file on links.self updating by overwriting bulk file data when a change event was specified and deleting where a delete event was specified. Any reason that wouldn’t be a reasonable approach?