Extracting data from scanned PDFs

Hi Companies House team and developers,

I’m interested in extracting quantitative data from scanned PDFs and wanted to make sure I fully understand the data Companies House has access to. I’m wondering if you might be able to clarify whether my understanding of the following points is correct:

  • Most accounts are now filed electronically with Companies House (~80%) however there are still many companies that file accounts as scanned PDFs
  • The only way to extract quantitative data from companies that file accounts as scanned PDFs would be using an OCR tool
  • Companies that file their accounts as scanned PDFs may share text-based PDFs through their own websites (for example Shell PLC has an xhtml version yet in Companies House there is only a scanned PDF version) however they have chosen not to share these documents with Companies House

I’m wondering if there is anything I’m missing? I’ve noticed a number of services online that report financial metrics supposedly from Companies House data including for companies that file accounts as scanned PDFs and I’m trying to access if most of them are getting this information through transcribing scanned PDFs or otherwise. Would love to hear your thoughts?

Additionally thank you for maintaining such an excellent service!

A while ago I made a little table that may be of interest to you: Accounts bulk files | CH Guide
It shows the percentage of companies which filed accounts electronically each year since the company act.
(bear in mind its only a sample, so percentages won’t be exact, but should provide a representation)

Thanks! Nice way of getting this updated statistic. Good to know most docs are now iXBRL