Document Images

dereka · July 9, 2015, 12:51pm

Ian, thanks for the replies.

When posting a GET to https://document-api.companieshouse.gov.uk/document/pvvT0RkxQzCDgzOY86TAfuDu9zoVDquEJ5gLcTepWLo/content

We get the following response:

Status
400 Bad Request Show explanation Loading time: 334
Request headers
Accept: application/pdf
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36
CSP: active
Authorization: Basic
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Response headers
x-amz-request-id: E1418A026E390222
x-amz-id-2: NqPA0NwmyPqGYco+fYAXUXHyDvDZl251QOEdMa6gfxHasCT2WVikmYzBu4rfP28VwmlxI2+hoQw=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 09 Jul 2015 12:44:58 GMT
Connection: close
Server: AmazonS3
Raw
XML

With body:

<?xml version="1.0" encoding="UTF-8"?>

InvalidArgumentOnly one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specifiedAuthorizationBasic E1418A026E390222NqPA0NwmyPqGYco+fYAXUXHyDvDZl251QOEdMa6gfxHasCT2WVikmYzBu4rfP28VwmlxI2+hoQw=

When i mentioned not wanting to make 2 requests, I was referring to the /content/ api, I would expect to be able to make one call to get the pdf rather than having to try and get the headers from the re-direct and look for a location header?

iankent · July 9, 2015, 1:03pm

This still looks like an authorization error, both the original Authorization header for the Document API and the Signature query string parameter are present in the request to Amazon S3.

Even with redirects disabled, you need to make sure you’re not reusing the original request to the Document API when requesting the actual resource from S3 - or at the least, remove the Authorization header from the original request before reusing it.

This is by design - instead of us proxying the content from S3 via the Document API, we return a redirect to the actual document. In most cases this works as expected - requesting the /content endpoint and following the redirect automatically will return the PDF content.

Unfortunately, this is dependent on your HTTP client library - some clients include the Authorization header on the redirected request while others don’t. The correct behaviour (in this instance) is to ensure the Authorization header is removed before following the redirect. If this isn’t how your HTTP client library works, then it means an additional step to follow the redirect manually.

dereka · July 9, 2015, 1:04pm

So how do we obtain the redirect url? It is not present in the response we are getting from the first call?

iankent · July 9, 2015, 1:10pm

It should be in the response from the Document API.

Looking at the earlier question - you sent a GET request to https://document-api.companieshouse.gov.uk/document/pvvT0RkxQzCDgzOY86TAfuDu9zoVDquEJ5gLcTepWLo/content, then got a 400 response with a Server: AmazonS3 header - this suggests the client has followed the redirect automatically (and sent the Authorization header with it).

If the redirect hasn’t been followed you should get a 302 response from the Document API, which should include the redirect as a Location header.

dereka · July 9, 2015, 3:26pm

To disable the follow redirects option using the Spring RestTemplate we would need to switch to using a different ClientRequestHttpFactory and that then causes us SSL handshake errors.

The current implementation makes the document part of the API almost incompatible with the Java Spring RestTemplate: http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/web/client/RestTemplate.html

Is there no way for us to make the call to your document content endpoint and receive the document in one call? I am not sure of the reason/benefit of needing to make 2 separate calls when as a consumer I am not really concerned with the file hosting solution?

adrian_calder · November 18, 2015, 4:26pm

I have been asked by my dev team to find out how we can ‘match up’ documents (i.e the filing history line item and image) that we have pulled into our web app through CH Direct or the XMLGW Output Service over the past few years, with the new free doc images available through the new API. Back in 2012 we were told to use the non-advertised unique identifier “Companies House ID” so that’s what distinguishes each of the thousands of docs in our web app currently. However, the “Companies House ID” does not appear to be available through the new API - only a unique identifier referred to as a ‘Transaction ID’ hence why at present we can’t match the docs up (which is a key requirement for us to implement our phased migration away from CHD / XMLGW to the new API).

mfairhurst · November 25, 2015, 10:21am

Adrian,

Unfortunately there is no data published that allows documents to be matched between the API and CHD or XMLGW output service and we never considered this as a requirement. I can add this to the backlog for consideration but it wouldn’t be a priority.

Thanks

Mark.