Format and content of bulk data products

Hello,

We have received access to two bulk data products (company profiles, company director appointments).

I have two remaining questions:

  1. Is it possible to receive these products formatted as JSON instead? We currently have to read each line (= entry) and parse the data using pre-defined substrings of characters, which significantly impacts the performance of our servers. To illustrate, to read and parse 50,000 records takes about 30 minutes. Reading the data from JSON would only take 5 minutes.

  2. I noticed various differences between the data from these bulk data products and the data on the Companies House website. For example, where the profiles on the CH-website are accurate, it seems that we received the “raw data”, where country names are spelled inconsistently. Is it indeed the case that we received the raw data, while CH performs data cleaning on information displayed in its register?

Thank you!

Best, Simon.