Read .pdf from filling history for financial data with Python

bruno.trevelin · March 5, 2024, 10:08am

GM guys,

I am trying to read the .pdf files from filling history with Python to extract financial data, and as its scan images, seems that it does not work properly, it looses data or make it wrong, I am not sure the best approach here, does any of you have a validated method to read the .pdfs with some tool? Or get this financial data using another method? What is the method used by the other database providers? the goal is to use some technology to have it automated, to read the the financial data for companies like Revenue by year, this data seems to be only inside the .pdfs

voracityemail · March 5, 2024, 2:10pm

This (rather long) thread may be of interest - it covers various topics around “getting financial data” from Companies House.

If you have decided to go down the OCR route for getting machine-readable data from PDFs there are all kinds of solutions. For example both Google and Microsoft provide Cloud computing solutions for OCR / document processing I believe? (We’ve used Google’s but not for Companies House data, for a different task).

bruno.trevelin · March 11, 2024, 4:19pm

Thank you,

I am struggling now in understanding the contextRef for the financial data on .ixbrl files, any hints?

Example:
<ix:nonFraction format=“ixt2:numdotdecimal” contextRef=“icur1” unitRef=“GBP” name=“uk-core:FixedAssets” decimals=“2”>564</ix:nonFraction>

ebrian101 · March 11, 2024, 4:54pm

The contextRef attribute specifies which period and entity the fact relates to. The contexts are all defined before the facts are listed, and they are assigned IDs which each fact references. For further reading, I suggest The XBRL Book which I found very helpful.

bruno.trevelin · March 12, 2024, 11:53am

Alright, can I find a dict of the contextref on this book? Or I need to map the contextref by myself? As I see companies have different naming and patterns here

ebrian101 · March 13, 2024, 12:17pm

The context IDs are defined in each ixbrl document, for example:
<xbrli:context id="icur1"> and it will specify the entity identifier and period which that context refers to. The free online preview of that book explains how they are defined. There is no simple dictionary mapping for how they are formed since different software does it differently.

bruno.trevelin · March 13, 2024, 3:29pm

Thanks for the reply, makes sense!

However as I see, there is no way to “automate” the translation of this context for many .xbrl files for example, we need to analyse it reading and understanding, then create a new column for the specific period and entity, is that right?