Meaning of officer_id in appointmentList resource

Hi -

I know that officers are not truly uniquely identified and that we might have the same entity in real life being linked to different appointments_id.

Preamble to the questions is:

we query the API with a certain appointment_link containing an officer_id, say:

umVXYzu2PmpPehTY22bsCgQdmHA.

Let’s store this json object in a variable called r.

r["name"] is COMWOOD SECRETARIAL LIMITED.

r["items"] contains all the actual appointments of COMWOOD SECRETARIAL LIMITED,

The list of r["items"]["address"] in the object r are not unique - which can be very normal as same entity can register at different addresses.

My questions are:

  1. Just to confirm, even if the same entity has different addresses, since it is linked to the same officer_id, all appointments contained in r["items"] are relating to the same officer (officer_id), right?

  2. In the case where we have a set built by querying the API with 1000 appointment_links, if 1) is true, then we could apply the following logic¹:

    if officer_id is not the same
    and name is the same
    and address² is the same
    overwrite officer_id with the officer_id of the first record where this logic returns true

to have some sort of de duplication, with the caveat that of course, 2 officers with the exact same name could be registered at the exact same address - which is a false positive which I am willing to accept.

endnotes:
¹ this would be done in a simple SQL update statement.
² address would be the merged r["items"]["address"] dictionary in one string.

On first glance 1 appears true e.g. I think 1 officer_id → 1 entity (company, natural person, legal person). So “all appointments contained in r[“items”] are relating to the same officer (officer_id)” seems correct.

As you point out the opposite doesn’t hold e.g. 1 entity may have more than one officer id. I’m certain what triggers CH to allocate different officer_ids - thought it was different address as you say, but not 100% on this. (Companies House normally gets 2 different addresses on registration of officers - a private address which we almost never get access to and the “service address”. This is often the company address for e.g. officers but could be anything…) E.g. there would appear to be quite a few duplicates here:

Your point about “of course, 2 officers with the exact same name could be registered at the exact same address” - this may be slightly more possible in certain cases e.g. shell companies (often many at one address), or people sharing a “managed address”. No examples to hand, sorry.

Another gotcha to beware of - the data contained mis-classifications e.g. companies listed as people and vice-versa e.g.

Finally, your de-duplication will need to be slightly smart to match names and addresses put in slightly different formats or with errors.

Hey - Gosh, what an exhaustive answer - thank you so much!

I agree on

this may be slightly more possible in certain cases e.g. shell companies (often many at one address), or people sharing a “managed address”.

However not de duplicating on the name+address would mean having a way more falsely uniquely identified officers than the falsely de-duped officers (in the case I indeed de dupe using that logic).

Re smart code to find mistakes and slightly differently written addresses or names - openrefine is your friend! :smile:

Would you have any other recommendations otherwise?

Thanks!

No particular recommendations - feel free to share though. Don’t know if you use R but that’s got good support for this kind of thing. Google have APIs / libraries for processing address / geolocation etc - but the API gives you a limited quota. OpenRefine looks good for some batch-based tasks. It does seem you can query it from other processes on a server but this doesn’t seem to be it’s primary use case.

If you want all the officer data at once you can request this on the following thread:

Coming back to this after a while. I found that the best way to dedupe officers entries is using the software dedupe.io

I fed it a list of 1.5 million appointments and I trained through the GUI labelling 100 correct and 100 incorrect clusters. Then the software did the rest.

The power of this tool was that it further polished the clusters by checking duplicate clusters, for which i used the rule we identified above. If two clusters contained records with same officer_id, even if the address was different, then I merged them.

Highly highly recommended.