PDF Downloaded through CHS API not getting opened

abhijeet.kate · February 16, 2023, 11:26am

Hi all,

I am able to download PDF file using CHS document API by calling correct endpoint with document id as input, however, file size is just getting downloaded is 1KB only, however actual size of the file is more than 2 MB, also the 1 KB file is not getting opened, tried with Adobe and browser both.
Please help to resolve if you can.

Thanks

voracityemail · February 16, 2023, 11:34am

First - did you check the contents of what you have downloaded? I suspect you’ll find it’s e.g. an error message.

The fetch documents API has some features which mean that some people find it difficult to get working on their first attempt. I’ve a post in the following thread which covers the whole process - hope this helps.

abhijeet.kate · February 17, 2023, 12:35pm

Thanks for responding Vora-
Yes, file size is 1 KB and could see file metadata only when opened in notepad like, Date, type of document etc, however actual file has size more than 2 MB and has different data.
couldn’t see any error, its just seems file is getting compressed at CHS end and its not getting decompress when it receives at our end.(its my assumption- correct me if i am wrong.)

voracityemail · February 17, 2023, 8:07pm

Did you follow the steps as e.g. set out in my linked post? (Again - using curl for this may help because it’s really clear what is going on, you can get a more verbose output of the actual communication etc.)

Did you read the documentation on the Document API?

If you do that then you will be able to see which step is not working as you expect and why - e.g. “I requested the document metadata using url https://document-api.company-information.service.gov.uk/document/MzM1Mzk0MDk4MmFkaXF6a2N4 and I get a JSON response (response data), I parse that, ensure that the content type I want (e.g. PDF - application/pdf) is available, request the document using the link in the document metadata (link) but then I get back (some response)”

If you are getting back metadata it sounds like you have only got as far as requesting the entry from the Filing History List, or only requesting the document metadata. There are additional steps to get the actual document data as my comment notes.

If you’re certain you’re requesting the data in the correct way then you could try posting back here exactly what you’re doing, what document you’re trying to download (e.g. companies House document ref / url) the URL you request to download the document (as the actual filings are hosted on amazon servers) and the data you get back. Don’t post your own Companies House API key / username and password though!

abhijeet.kate · February 20, 2023, 12:01pm

Thanks for your response.
Please find below URls used to download document.
“https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY”
got this document ID from metadata column. i am able to download document but it just has metadata and file size is just 1 KB. please guide.
also just to let you know we are not using curl here, we are calling those URLs from Blue prism studio (RPA tool)

voracityemail · February 20, 2023, 12:40pm

As it should be clear above - it sounds like you’re getting to the last step but one. Downloading the document metadata is downloading the metadata. After that you need to parse that data to get the correct URL to download the document content.

Please see my linked post - the steps:

(3) Request the actual document , specifying the mime type (e.g. “application/pdf”).
(4) CH send back a response which is a redirect (http 302) to the document. The documents are stored on Amazon servers ATM.
(5) Request this URI from Amazon again passing the content type you want again.
(6) Amazon send the actual document data.

Again this is why I recommend using curl initially - it’s easier to see what is happening.

You requested:

https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY

That should give you a JSON response, like shown below (I’ve snipped some parts marked “…”)

{
    "company_number": "00048839",
    ...
    "links": {
        "self": ...
        "document": "https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/content"
    },
    "resources": {
        "application/pdf": {
            "content_length": 437669
        }
    }
}

You need to check the “resources” member for available data types (mime types), then request the URL given in “document”. I think the resource type defaults to the given one if there is only one but it makes sense to always specify which one you want - and especially if there is more than one type of resource. You do that - as the documentation says - by setting that in the http “Accept” header).

So I’d request:

curl -u YOUR_API_KEY_HERE: -H "Accept: application/pdf" https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/content

Note the “/content” at the end there - that tells companies House you want the actual data.

You do not get back the actual data immediately. You get a redirect to Amazon’s AWS servers. So that’s an http 302:

< HTTP/1.1 302 Found
< Date: Mon, 20 Feb 2023 12:20:25 GMT
< Location: https://s3.eu-west-2.amazonaws.com/document-api-images-live.ch.gov.uk/docs/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/application-pdf?...

(Again some data snipped here “…” to save space)

Depending on your language / tool you may have to catch that redirect URL in your own code. This is because if your tool / code incorrectly tries to send the Companies House authorisation (e.g. API key) to Amazon (e.g. by automatically following the link) that will cause Amazon to send you an error. You will also need to request the link from Amazon quickly as it is time-limited e.g. it will expire at some point.

So in your example I did this and correctly downloaded a PDF file of 437669 bytes.

Good luck.

abhijeet.kate · February 24, 2023, 7:12am

Thanks for your response and suggestions.

This is what i got in response when called Document API with content/ end point.

%PDF-1.1
%��
1 0 obj
<<
/Type /Catalog
/Pages 3 0 R

endobj
2 0 obj
<<
/CreationDate (D:20210612050113)
/ModDate (D:20210612050113)
/Producer (libtiff / tiff2pdf - 20150912)
/Creator (go-tiff2pdf)

endobj
3 0 obj
<<
/Type /Pages
/Kids [ 4 0 R 9 0 R 14 0 R 19 0 R 24 0 R 29 0 R 34 0 R 39 0 R
44 0 R ]
/Count 9

endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0.0000 0.0000 595.0000 842.0000]
/Contents 5 0 R
/Resources <<
/XObject <<
/Im1 7 0 R >>
/ProcSet [ /ImageB ]

endobj
5 0 obj
<<
/Length 6 0 R

stream
q 595.0000 0.0000 0.0000 832.3821 0.0000 4.8089 cm /Im1 Do Q

endstream
endobj
6 0 obj
62
endobj
7 0 obj
<<
/Length 8 0 R
/Type /XObject
/Subtype /Image
/Name /Im1
/Width 2504
/Height 3503
/BitsPerComponent 1
/ColorSpace /DeviceGray
/Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2504 /Rows 3503>>

stream
�Y�Q��z��|��_��sT��1�d1D��mM�ק��
�*r��B�_��>7�ȱ�j�Y�N�|\V'gڝ�%�Z�(��+=��Ы�DF�v7�WC,��r��dh��MZ٭t�U-rܛF?�r��)��@ҕ��x��G�7XMSL��,�jJ��MT��ep��TM� �TT��(Zt�B�,�J�hXw��^�s�>��Ɗh-q��gN�w�Y�b��n�PA�w��YeK�Y��i ��h?��Q��-0$Y��5��q\<KY�5oWOڏ�e�rq�gg�p8��B��9Pd�r��B$@�Ú�u��~��A�%g��9aع �Qs0�A�3��A��A��k fr��"pM��a4a3 d0g�#L’�N�[�i�wi�A��i��i��z��SN&��×�w��O(�4M�R��˺��Ne��G�7�~�8rЯT�Z�;˷2D𛧧��P7A�A�|6��&�|eA��i�|e�tN�9i��/֛I�m’K�I�o�O��"��dh� ��I��gWN�zi’ik��}}��N�ΆjyOܘO4!��%��U�!��_�u�Jg�l �z3��!p>;T��O�T��/�խ%�It�˿w�˵#�� S��]�MݢP��-R�K e˴�6҈h��8��-

voracityemail · February 24, 2023, 11:40am

Congratulations. That looks like the start of a PDF, does it not?

abhijeet.kate · February 28, 2023, 8:46am

Yes, but how do i get actual PDF the question remains same, i am not getting AWS URL in response.

Thanks

voracityemail · February 28, 2023, 9:59am

@abhijeet.kate - I’m a bit confused. You said initially “i am able to download document but it just has metadata and file size is just 1 KB.” You then clarified that you called the endpoint:

https://document-api.company-information.service.gov.uk/document/{the document id number}/content

… and you posted what looks like a text dump of the (binary) initial content of a PDF.
To me, that looks like you succeeded.
What are you saying is the issue?
That sounds like you are saying “I download a PDF but I am only getting the start of the file e.g. 1KB” - is that correct?
Did you try to open the downloaded file in a PDF editor and if so what happens?
Did you accidentally corrupt the output e.g. by printing it to screen then trying to copy-paste into a file to save it? You don’t want to display the bytes of a PDF as text, you need to save them as received then open in a PDF editor!
Is only part of the content actually being downloaded e.g. the output is truncated - so you get the first part of the PDF file and then it is cut short? Can you check if this is actually your software doing this or is it actually downloading the whole data but only showing you the first part (because you have the system set up to display the output in a text window)?

I would suggest you check your tool / environment (Blue Prism Studio?) to see if there are any settings there which are limiting the size of the download.

As I suggested before if you have issues:
a) I strongly recommend you follow the process through using some tool where you can do one step at a time (like curl)
This is so you can:
i) ensure that this is actually works using the particular URLs you’re using! and
ii) if there is an issue you can see exactly where it occurs. Curl at least allows you to display extra information about every step (via the verbose flag).
Requesting the document/{the document id number}/content endpoint triggers at least two steps. See my previous posts! However to repeat the first is that Companies House server sends back a redirect (to Amazon Web Server - the address is in the http header). You then download the document from there. (There is nothing to stop AWS also sending a redirect too of course!)
It looks like your tool is automatically following the redirect. That’s fine - and indeed you seem to be receiving a PDF. That’s why I’m confused as you did not say clearly what the problem is. If you do have problems you may need to stop it doing so (e.g. catch the redirect in the http header)
If none of the last paragraph makes sense then you may need to learn a bit about the http transport protocol (Mozilla has some helpful info, or Wikipedia of course).

b) If you still cannot make this work again the more information the better.
I probably will not be able to help but someone may. So if I were you I would list:

the software / system / environment you are using e.g. “Using Blue prism studio”
how you’re using it e.g. “I make the request using the e.g. ResourcePC HTTP interface, with parameters x y z” (NOTE: do not list your actual API key here). This should include the Companies House / other URIs you’re requesting e.g. “https://document-api.company-information.service.gov.uk/document/vqCGbDqGHA8ckavsX1nssaN03JVhwAzF7i0p7fXEcuY/content” or whatever
the outcomes e.g. "The program downloads a file ‘foo-bar.pdf’ which is x bytes in size. When I try to open this with e.g. Adobe DC reader I get the error ‘…’ ", or “I see some text in a window (which looks like the start of a PDF file) but is only y bytes long. I tried pasting this into a text package and saving it but …”

Good luck.

abhijeet.kate · March 13, 2023, 8:06am

Hey, Thanks for your guidance, now i am able to download PDF file and its getting opened correctly, thank you so much for your support, it helped me a lot.

Thanks