Hi @Everyone
I have wrote a script in node js that is getting tha data of more then 700 companies, and parsing that data to a CSV file. that part is done.
But im not able to add accounts information in the CSV file. as companies house api provides PDF files for accounts information.
I have a python script to read the data from the pdfs and im able to read data from a sample file. but there is a problem, as im trying to get accounts information from more then 700 companies and trying to add them to my existing CSV. All the accounts PDF files are different from each other in terms of placements, formating. as my program can work for the documents that have specific formating and placement.
Can anyone help me get bulk companies accounts data into CSV file ? Or somehow i can find Current Assets, Current libilities, Stocks, Cash In hand Deptors and Net Assets/Libilities values of a company against company number ???
Welcome!
It sounds like you want to get accounts information from Companies House. I’m not sure about the exact fields you were asking about:
Current Assets, Current libilities, Stocks, Cash In hand Deptors and Net Assets/Libilities values
… but in general there are two ways to get the detail of accounts filed with Companies House:
a) The Companies House bulk accounts data - large files with XBRL format data with accounts filed in the current and previous years.
b) If - and only if - a company has electronically filed accounts in XBRL format - you can download that data rather than the PDF.
Note you might not find data for all companies in either way - Companies House state for both (a) and (b) above:
data is only available for electronically filed accounts, which currently stands at about 75% of the 2.2 million accounts we expect to be filed each year.
On (b) this is done using the normal Document API to download a file. Once you get the document metadata you need to examine the resources
member (resources.{content_type}
) to see what types of data (mime types) are available. For accounts, as well as the pdf mime type (“application/pdf”) you may find XML ones which should be the XBRL files - application/xml
, application/xhtml+xml
.
(I note Companies House also state you can find application/json
and text/csv
- but I have never encountered this and am not sure what these would be!)
For an example of getting one of these files instead of a PDF see this post:
https://forum.aws.chdev.org/t/solved-xbrl-insead-of-pdf-documents/2292/3
For more information on the process as a whole see this post:
https://forum.aws.chdev.org/t/fetch-a-document-api/978/4
Good luck.
I have already downloaded the pdfs and trying to read these values from the documents. but the problem is different account documenta have different placements and formatings. I.e in one document these values present on page 4 and in other documents have these values in page 2 or other page numbers. and there can be missing details on the documents.
Although i havent tried with XBRL formate, let me try extracting values from this formate. then i’ll get back to you…
I suspect - although not trivial - it’s a rather easier task to use the XBRL. I believe that would be more reliable also. Things are likely to appear in different places and with different formatting as you said. Plus the PDF format is somewhat involved and different generators can lead to all kinds of different data in the file.
Here’s a government guide to the requirements for filing accounts information:
There’s a general guide to XBRL from the UK Government here (has some more format-specific details):
Hope this helps.