Many AR01 (annual returns) PDFs are internally 'bad' (corrupt)

We have noticed for some time that a large number of ar01os from Companies House have had problems when we extract the images out of them.

A number of tools (including ImageMagick and Hipchat) end up showing an image of “noise” - just a mesh of black and white dots.

The PDF file appears to be corrupt. The structure of a PDF ends with a section called startxref which then has an offset value indicating where to find the actual xref table (as denoted by the xref label). The 2015 Annual Return filing for Imperial Innovations Group PLC (company number 05796766, dated 26/04/2015) demonstrates this problem.

Here’s is the broken xref sections from the file :

xref
0 59
0000000000 65535 f
0000000016 00000 n
0000000068 00000 n
0000000217 00000 n
0000000352 00000 n
0000000528 00000 n
0000000647 00000 n
0000000665 00000 n
0000000908 00000 n
0000000925 00000 n
0000001103 00000 n
0000001224 00000 n
0000001243 00000 n
0000001488 00000 n
0000001506 00000 n
0000001685 00000 n
0000001806 00000 n
0000001825 00000 n
0000002070 00000 n
0000002088 00000 n
0000002267 00000 n
0000002388 00000 n
0000002407 00000 n
0000002652 00000 n
0000002670 00000 n
0000002849 00000 n
0000002970 00000 n
0000002989 00000 n
0000003234 00000 n
0000003252 00000 n
0000003431 00000 n
0000003552 00000 n
0000003571 00000 n
0000003816 00000 n
0000003834 00000 n
0000004013 00000 n
0000004134 00000 n
0000004153 00000 n
0000004398 00000 n
0000004416 00000 n
0000004595 00000 n
0000004716 00000 n
0000004735 00000 n
0000004980 00000 n
0000004998 00000 n
0000005177 00000 n
0000005298 00000 n
0000005317 00000 n
0000005562 00000 n
0000005580 00000 n
0000005760 00000 n
0000005882 00000 n
0000005901 00000 n
0000006147 00000 n
0000006165 00000 n
0000006345 00000 n
0000006467 00000 n
0000006486 00000 n
0000006732 00000 n
trailer
<<
/Size 59
/Root 1 0 R
/Info 2 0 R
/ID[<515C14B84197369703F7B647538B1788><515C14B84197369703F7B647538B1788>]

startxref
6750

The correct offset in this file is actually around 13000 and not 6750 as stated above.

Running a tool over the PDF (pdftk) to regenerate the xref offsets ‘fixes’ it and the images can then be converted and extracted properly.

Also if you open this file in acroreader, when you close it it offers to save the file for you (even though you have not changed it). It does this because acroreader automatically fixes this issue by rebuilding the correct indices.

The creator of the PDF is go-tiff2pdf as shown by the following attribute:
/Creator (go-tiff2pdf)

This is a project hosted by the companies house github repo.

Are you aware of this problem and do you have any plans to fix it / fix up bad PDFs?

1 Like

I raised the same issue back in June (http://forum.aws.chdev.org/t/pdf-compliance-issue-inc-rendering-in-firefox/109).

Whatever Firefox uses to render PDFs is unable to deal with the corrupt xref. A message was added to beta search results warning about using Firefox. I just had a look and the documents I had issues with seem to be rendering in Firefox now (although the warning message is still there).

So I don’t know if they fixed something or Firefox updated their ability to render corrupt PDFs. If your still encountering issues with this it might indicate the latter?

When I encountered the problem it was all electronically filed documents that were effected.

Would it be possible to get some kind of response update on this issue?

My concern, which I raised directly and got no response to, is that now the documents render in Firefox there will be less incentive to fix the corrupt documents, however it is still a major issue for any PFD rendering engine that doesn’t automatically recreate these tables when they are corrupt.

Ash,

We have made some changes which we believe have corrected the issue for new images, and have performed some initial tests but it would be useful if you could confirm if they are OK with your particular renderer. We have not performed any back data correction, and this will be low on the backlog, because its not causing issues for the majority of our customers.

Thanks,

Mark.

Hi Mark,

‘for new images’

Should I interpret that as any PDF for any document/filing created after 3pm GMT on Monday 16th Nov 2015?

If you could give us some specific company names / filings or URLs we’ll check.

Regards,

Nick

Nick,

The fix has been in for around a month as we wished to perform some checks internally before announcing. Some examples are: -


"company_number" : "OC402937",
"category" : "incorporation",
    "links" : {
        "self" : "/company/OC402937/filing-history/MzEzNTM1NDMwMWFkaXF6a2N4",
        "document_metadata" : "/document/xt5FXjpF5qxZRrj1dSCID9hcPZmsFAinrw6Ds9HV2SQ"
    }

"company_number" : "01028948"
"date" : ISODate("2015-11-13T15:30:48.000Z"),
"category" : "capital",
"description" : "legacy",
"type" : "SH20",
"links" : {
    "self" : "/company/01028948/filing-history/MzEzNTEzNTE5OWFkaXF6a2N4",
    "document_metadata" : "/document/ctQFZMfS9TvKt6LN4dT6FW49EruBYogYtfuv_aUQ9N4"
 }

"company_number" : "04692977"
"original_description" : "Final gazette: dissolved ex-liquidated",
"data" : {
    "date" : ISODate("2015-11-13T12:00:13.000Z")
    "category" : "gazette",
    "links" : {
        "self" : "/company/04692977/filing-history/MzEzNTExODc5N2FkaXF6a2N4",
        "document_metadata" : "/document/M5wXUK4M1RDKJnFgpOGc5MTqY5TD7_Fb3rT5L4ZCbf0"
    },
    "type" : "GAZ2"

Thanks

Mark.

Some of the pdfs for the latest full accounts documents are still corrupt. For example, the latest accounts form for the Royal Bank of Scotland (document_id: 4qatZjlfmp7aPfa3u8y7d04o9c_MwCjuX8BQeBtzl5g) hits an XREF table error when I try to process it using ImageMagick.

Has the fix been implemented?

1 Like