PDF Compliance Issue (inc rendering in Firefox)

The following will be a big issue for us going forward as if it’s not resolved we’re going to need to get new internal PDF compnonents (and those aren’t cheap) plus do all the coding, which includes a lot of legacy systems, to switch from one to the other.

Ignore references to attachments as this is copied from an email:

On the developer forum someone commented that some files do not render in Firefox (Doc images do not render in FireFox). Chris Smith responded that this was a known issue related to Firefox’s PDF rendering and it was being investigated.

However I’ve been noticing some odd results with some files that suggest to me that this might be a bigger problem. For me this manifests itself in errors when trying to manipulate documents using a third party PDF component (Aspose.pdf). However, it’s worth noting that this is not universal and I’ve tried some other third party services that don’t have problems with these files.

It’s also worth noting that not all images have problems rendering in Firefox. Some are fine. I can’t say for certain but I think the distinction is between documents that have been filed electronically and documents that have been scanned.

For the purpose of this I’m looking at two documents from the filing history of XYZ LIMITED
(02533344) (https://beta.companieshouse.gov.uk/company/02533344/filing-history)

The first is the first item in the filing history, an annual return submitted electronically on the first of June. I’m going to call this DOCUMENT E.

The second is the second item in the filing history, this is an MRO4 Satisfaction of Charge submitted on paper mid-may this year and scanned. I’m going to call this DOCUMENT S.

DOCUMENT E
Renders in Chrome
Renders in IE
Does not render in Firefox
Downloaded from beta API - Crashes Aspose.pdf
Downloaded from CH Direct – Does not crash Aspose.pdf

DOCUMENT S
Renders in Chrome
Renders in IE
Renders in Firefox
Downoaded from beta API - Does not crash Aspose.pdf
Downloaded from CH Direct – Does not crash Aspose.pdf

I think the last results are the key. The version of Document E from the API causes problems, the version from CH Direct does not. I’m calling these Eapi and Edirect and have attached them to this email.

There’s not much difference between them. A few bytes and they have been produced by different versions of the libtiff/tiff2pdf.

I ran them both (as well as the API version of Document S) through the PDF validator at pdf-tools.com. The results are meaningless to me as I don’t speak PDF, but it’s clear that both the Scanned API document and the Electronically filed CH Direct document don’t have the same issues the the Electronically filed API document has.

Validating file “Sapi.pdf” for [SCANNED DOCUMENT DOWNLOADED THROUGH API] conformance level pdfa-1b
The key Metadata is required but missing.
A device-specific color space (DeviceGray) without an appropriate output intent is used.
The document does not conform to the requested standard.
The document contains device-specific color spaces.
The document’s meta data is either missing orinconsistent or corrupt.
Done.

Validating file “Edirect.pdf” for [ELECTRONIC DOC DOWNLOADED THROUGH BETA API]
conformance level pdfa-1b
The key Metadata is required but missing.
A device-specific color space (DeviceGray) without an appropriate output intent is used.
The document does not conform to the requested standard.
The document contains device-specific color spaces.
The document’s meta data is either missing or inconsistent or corrupt.
Done.

Validating file “Eapi.pdf” [ELECTRONIC DOC DOWNLOADED THROUGH API]
for conformance level pdfa-1b
The ‘xref’ keyword was not found or the xref table is malformed.
The file trailer dictionary is missing or invalid.
The key Metadata is required but missing.
The “Length” key of the stream object is wrong.
The separator before ‘endstream’ must be an EOL. (6)
A device-specific color space (DeviceGray) without an appropriate output intent is used.
The “Length” key of the stream object is wrong.
The “Length” key of the stream object is wrong.
The “Length” key of the stream object is wrong.
The “Length” key of the stream object is wrong.
The “Length” key of the stream object is wrong.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document contains device-specific color spaces.
The document’s meta data is either missing or inconsistent or corrupt.
Done.

Based on all that my tentative conclusion is that there is an issue with PDF documents that are being generated by the new API system where the original form was submitted electronically.

This issue manifests itself as some kind of corruption/non-comformation in some PDF rendering engines (including Firefox and Aspose.pdf, pdf-tools.com) but not others (Chrome, IE, Adobe).

The same issue does not seem to effect documents generated by the new API when the original document is scanned or the documents are downloaded through CH direct.

Hope that helps troubleshoot it as this has potential to be a major issue for us.

Additional:

If you open the files in a text editor you can see that they’re nearly identical. However there is one big difference. The files are made up of a seriues of objects that contain metadata and sometimes a stream. Following each stream there is an object that contains only a single integer: They always seem to come in pairs.

17 0 obj
<<
/Length 18 0 R
/Type /XObject
/Subtype /Image
/Name /Im3
/Width 1650
/Height 2200
/BitsPerComponent 1
/ColorSpace /DeviceGray
/Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 1650 /Rows 2200>>
STREAM DATA I’VE CUT OUT
endstream
endobj
18 0 obj
7474
endobj

I’m pretty sure the integer is the length of the stream.

However in the API version of the file this integer is always zero. This would make sense given the repeating “The “Length” key of the stream object is wrong.” error message when validating.

There are other issues as well, but it’s a start!

Also, if I run the file through the PDF repair tool here: PDF Tools Online - Repair PDF it will open in Firefox/work with Aspose.pdf etc.

Some further investigation…

The problem documents (electronically filed and retrieved through the API) are not taking into account the length of various stream objects in the document. I assume this is happening when the tiff is converted to a pdf

This manifests itself in three places:

  • The length attribute that appears in the object following the stream object (showing as 0 rather than the actual length)
  • The various positions of objects in the XREF table
  • The STARTXREF attribute

The latter two values are probably derived from the first, so all the offsets are wrong because it thinks the streams are all zero length. You can actually go into the source of the PDF and manipulate these values to correct them. However it’s a messy process and shouldn’t really be nescessary.

I think most third party programs that can open/render these documents are ignoring the corrupt XREF table and just doing everything on the fly. This means, in theory, that these documents will take longer to open and render so even in situations where the document can be used it’s not ideal.

Thank you for going to such effort. This information has been added to our own investigation, which is ongoing. Primarily we are trying to identify the cause to prevent further PDF problems, and intend to put in place a solution to resolve the current issue.

Thanks Chris, in the meantime I’ve figured our a workaround for Aspose (essentially loading the document, saving it to a stream and then loading it again will fix the XREF table and allow me to merge etc without error), so even if it continues to be a know issue with regards to Firefox, we can still work with these files.

Does anyone have any feedback on this issue from Companies House ? This seems like a relatively extensive issue and wanted to see if there was any suggestions of a fix been put in place by Companies House?