Content Format Type

pbajaj · November 1, 2022, 9:05am

Hi All,
I am checking the document’s API response and see the parameter “resources” define the format in which the document can be downloaded.
I did some analysis on 2K odd filings and saw the majority of them are in PDF format. Around 1/4 of them have XML+XHTML present. So, my question being

Why do we have such fewer filings in non-PDF format? What parameters define the content format of the filings?
If we don’t have non-PDF filing available now would that be made available in the future? Do you guys have any process to convert PDF to non-PDF format and make it available in documents API?
I also came across Companies House to download daily file. How can we link the individual file from the zip to the response of filings API or documents API if I want to understand more metadata for the filing.

Thanks in Advance.
Pratik Bajaj

lgeorge · November 1, 2022, 9:29am

I think the answer to question1 is that accounts are provided to Companies House in iXBRL format so these would be tagged XML+XHTML. Accounts are filed relatively less often than other information.

pbajaj · November 4, 2022, 5:57am

Thanks for the reply.
Do you have any idea how the response of stream API and daily zip download content are related. Just want to get an idea do we get same content meta information in stream API with the daily Zip folder uploaded here Companies House ?
P.S. We have use case to download content on daily basis. We want to make sure we are not missing anything which CH APIs provide and also we don’t to process duplicates as well.
Thanks in advance.

ebrian101 · November 7, 2022, 3:39pm

The bulk downloads only contain the actual documents filed. They don’t contain metadata, such as the filing description enum. For that metadata you have to call the API.
You can work out what is the company number and filing date from the file name of the bulk downloaded files.
Alternatively, you can access the xbrl documents one by one using the REST API instead of the bulk downloads.
There are duplicates on the filings streaming API.

pbajaj · November 16, 2022, 12:33pm

Thanks for the reply. Really appreciate it.
So, the filings we get in stream API and in daily bulk zip folder are duplicate? If that so, why we dont have html/xml format present for all the filings in stream API.
And do you have any idea what is the time window between the availability of filing in Stream API vs Bulk zip daily file.
Just wanted to understand this flow so we dont miss any content on our platform.

ebrian101 · November 23, 2022, 4:24pm

Not all of the filings on the streaming API are for accounts. The filings streaming API emits events for all filings, including confirmation statements etc, which don’t have an XML representation.
The bulk downloads ZIP file only contains accounts which were filed with iXBRL (the type of XML). That is why there isn’t XML present for all the events on the filing stream.

As for the time window, there is usually a little bit of a delay between companies house receiving a filing, and emitting it on the stream. This varies throughout the day depending on load. You can see what it is at any point in time by checking https://companies.stream which shows the delay for each stream. At the time of writing its about 4 hours behind, meaning that filings from 4 hours ago are being emitted on the stream now.

It’s not necessary for you to use both products (streaming and ZIP files). I believe you can keep a complete record of all XBRL filings using either one. They both have all the accounts filed in XML.

pbajaj · December 28, 2022, 11:10am

Thank you.
This will surely help.

pbajaj · December 29, 2022, 6:27am

Hi @ebrian101 ,
I observed that for some sets of filings response using stream API the docDate is empty. Do you know the reason for it? And is that information available using any other API?
Do we get multiple versions of the same filing in stream API? If so, what parameters in the filing stream response can help us identify the unique filing for the company? This would allow us to update the document if we find the updated version of the same filing.