Deduplication of API stream data

Hi, we are receiving multiple duplicate streaming events that relate to the same filing event. An example of two events that relate to the same filing received as two separate timepoints:

me_id | 17345
me_monitor_id | 823
me_external_id | b13f95e0-4b61-11ef-9515-e111192e6fb1
me_key | FILINGS–filing-history–MzQzMDI1NzExMmFkaXF6a2N4–2024-07-26T15:14:02.499Z
me_data | {“data”: {“data”: {“date”: “2024-07-26”, “type”: “CS01”, “links”: {“self”: “/company/09463445/filing-history/MzQzMDI1NzExMmFkaXF6a2N4”}, “barcode”: “XD84US75”, “category”: “confirmation-statement”, “description”: “confirmation-statement-with-updates”, “transaction_id”: “MzQzMDI1NzExMmFkaXF6a2N4”, “description_values”: {“made_up_date”: “2024-07-26”}}, “event”: {“type”: “changed”, “timepoint”: 164957821, “published_at”: “2024-07-26T15:14:02.499Z”}, “resource_id”: “MzQzMDI1NzExMmFkaXF6a2N4”, “resource_uri”: “/company/09463445/filing-history/MzQzMDI1NzExMmFkaXF6a2N4”, “resource_kind”: “filing-history”}, “stream”: “FILINGS”, “client_id”: “67babcd0-2c93-11ee-958a-af36d72efa01”, “reference”: “bombpot–67babcd0-2c93-11ee-958a-af36d72efa01–f5a65c40-e15e-11ee-ae68-6350cb592c69”, “monitor_id”: “COMPANIES_HOUSE–67babcd0-2c93-11ee-958a-af36d72efa01–f5a65c40-e15e-11ee-ae68-6350cb592c69”, “company_number”: “09463445”}
me_acknowledged | f
me_notes |
me_communicated_at | 2024-07-27 00:00:05.071233+00
me_timepoint | 164957821
me_created | 2024-07-26 15:14:02.943322+00
me_last_modified | 2024-07-27 00:00:05.071233+00
-[ RECORD 2 ]------±-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
me_id | 17346
me_monitor_id | 823
me_external_id | ddf50cf0-4b61-11ef-b07f-21e051e087da
me_key | FILINGS–filing-history–MzQzMDI1NzExMmFkaXF6a2N4–2024-07-26T16:15:17.000Z
me_data | {“data”: {“data”: {“date”: “2024-07-26”, “type”: “CS01”, “links”: {“self”: “/company/09463445/filing-history/MzQzMDI1NzExMmFkaXF6a2N4”, “document_metadata”: “https://frontend-doc-api.company-information.service.gov.uk/document/wF5pTegh-cgImDs34Tiq5GEASpKEhINxBNCDF3MsNQo”}, “pages”: 4, “barcode”: “XD84US75”, “category”: “confirmation-statement”, “description”: “confirmation-statement-with-updates”, “transaction_id”: “MzQzMDI1NzExMmFkaXF6a2N4”, “description_values”: {“made_up_date”: “2024-07-26”}}, “event”: {“type”: “changed”, “timepoint”: 164957998, “published_at”: “2024-07-26T16:15:17.000Z”, “fields_changed”: [“links.document_metadata”]}, “resource_id”: “MzQzMDI1NzExMmFkaXF6a2N4”, “resource_uri”: “/company/09463445/filing-history/MzQzMDI1NzExMmFkaXF6a2N4”, “resource_kind”: “filing-history”}, “stream”: “FILINGS”, “client_id”: “67babcd0-2c93-11ee-958a-af36d72efa01”, “reference”: “bombpot–67babcd0-2c93-11ee-958a-af36d72efa01–f5a65c40-e15e-11ee-ae68-6350cb592c69”, “monitor_id”: “COMPANIES_HOUSE–67babcd0-2c93-11ee-958a-af36d72efa01–f5a65c40-e15e-11ee-ae68-6350cb592c69”, “company_number”: “09463445”}
me_acknowledged | f
me_notes |
me_communicated_at | 2024-07-27 00:00:05.071233+00
me_timepoint | 164957998
me_created | 2024-07-26 15:15:17.954237+00
me_last_modified | 2024-07-27 00:00:05.071233+00

I’d like to confirm that each filing event could be uniquely identified through the transaction ID and we could effectively disregard the second event?

The use case here is to alert customers of filing events that relates to a company they are monitoring and currently customers would receive multiple notifications that relate to the same filing event.

One of the events is produced when the document is filed (but before its fully processed), and another event is sent once the document is processed and available for download from their Document API.
So the first event will be missing the links.document_metadata but it will be present in the second. Hence the fields_changed value of the second event. You can decide which of these events you want to use, as it will depend on the use case.

Great, thanks @ebrian101. You would not happen to know if there is the possibility for another event following this? For example I can imagine there might be an update to a historic filing or a deletion?

My thoughts are that we can pretty much use either of the two initial events for our purposes here but then I would not deduplicate an event which might be useful, but received days / months after the intitial filing event.

Of course if no update is possible at all to a processed filing, then this is not something we need to take into consideration.

Thanks again for the insight.

I’m not aware of any events following a processed filing, but can’t say with certainty that they never happen. Of course either of the two initial events could be sent more than once since companies house does “at least once” delivery.
Not sure what their policy is on correcting errors, for example if the initial events had the wrong date in description_values, would they send another event to correct it?

Thanks @ebrian101 .

We’re opting to deduplicate based on the transaction ID and the date (excluding time) of the publishedAt datetime value. In that way we won’t generate more than one event on the same day but should there be future events it would still create a notification…