Document Images

iankent · June 4, 2015, 8:14am

Sorry about that - we’ve missed the link from the menu.

The Document API documentation is available here:

ash · June 8, 2015, 12:41pm

Should these calls be working or is it only available on the search page at present?

I haven’t been able to get either the Metadata or the Document call to work either in code or through the test pages (eg: https://developer.companieshouse.gov.uk/document/docs/document/id/fetchDocumentMeta.html)

For the latter it may be something to do with the fact that it is asking for authorisation as well as having the application dropdown. In code I’m just getting 404 errors.

I’m operating on the assumption that the documents unique ID is the Transaction ID in the Filing History item. Is this correct?

Also the page about metadata ( https://developer.companieshouse.gov.uk/document/docs/document/id/fetchDocumentMeta.html) has a trailing “/metadata” in the GET URL in the example, but not elsewhere in the documentation.

iankent · June 9, 2015, 7:49am

These calls should be working, though our documentation may be a bit unclear.

The process to retrieve an image is:

Fetch the filing history resource for the company you want - e.g. https://api.companieshouse.gov.uk/company/00002065/filing-history
Get the links.document_metadata URL from a filing history resource, which is a fully qualified URL pointing to a specific image on the Document API
Request the Document metadata using the provided document_metadata URL - e.g. https://document-api.companieshouse.gov.uk/document/AnDp9GA5JHgybR-BeUMb4n4D9QhDP4_4N25uPi4aB5M
From the metadata, get an available resource type, e.g. application/pdf
Request the content by appending /content, e.g. https://document-api.companieshouse.gov.uk/document/AnDp9GA5JHgybR-BeUMb4n4D9QhDP4_4N25uPi4aB5M/content

application/pdf is returned as the default resource. If other resource types are returned in the metadata, these can be requested from the same /content URL by setting the Accept header, e.g. Accept: application/xhtml+xml.

The API key you use for https://api.companieshouse.gov.uk should also work for https://document-api.companieshouse.gov.uk, and will be rate limited independently.

‘Explore this API’ for Document API on the Developer site doesn’t currently work as expected, but the API endpoints and returned content should match the documentation.

hywel_bromby · June 9, 2015, 12:37pm

I’m afraid I cannot get anything but the following by following the steps outlined above.

If I use the URL returned in links.document_metadata (http://document-api-prod-gaigve6x7y.elasticbeanstalk.com/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content )
Or
If I construct the URL using your example (https://document-api.companieshouse.gov.uk/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content)

The remote server returned an error: (403) Forbidden

Using the documentation example section

GET document/gh438fghd09euthg8294ughehwieugh397/content HTTP/1.1
Host: undefined
Authorization: Basic bXlfYXBpX2tleTo=
Accept: application/pdf

all I get is

0 Error: Access is denied.

Do I need to register MY API key separately for this somehow?

iankent · June 9, 2015, 1:27pm

You won’t need to register the API key anywhere, if it works for the API it should also work for the Document API - but you’ll need to send the API key in the Authorization header using Basic Authentication.

An example with curl (note the colon after the API key):
curl -v -u your-api-key: https://document-api.companieshouse.gov.uk/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content

The “Explore this API” section is currently broken as it attempts to use an invalid hostname (“document”), which we’ll need to investigate, but using the API from code or command line should still work.

hywel_bromby · June 9, 2015, 1:38pm

I have been (successfully) using the rest of the API (with not a curl to be seen) for a while now.

Method(URL, Accept)

All I am doing is sending a new URL, and the relevant Accept statement. No luck yet.

Has anyone else had any luck getting it working?

iankent · June 9, 2015, 2:50pm

I can’t see any reason it wouldn’t work for the Document API if it works for other API endpoints - the authorisation mechanism is the same.

Could you confirm if the curl request against Document API works using your API key, or any alternative command line tool? Just so we can eliminate the API key being a problem - if you use something other than curl, could you provide the full command used (but without the API key) so I can try to reproduce the problem locally.

The full curl command I’m using is this:
curl -v -u api-key: http://document-api-prod-gaigve6x7y.elasticbeanstalk.com/document/reM3Rsb08Cze0jMeKyVjPuAamiNEuXYTVYFSLSudtS4/content
which should return a link to the image for an AR01 for Lloyds Bank PLC.

ash · June 9, 2015, 3:12pm

Thanks Ian, I managed to get both calls working based on that (“that” being your earlier post, not all the CURL stuff!)

stuart_boulton · June 17, 2015, 9:06am

Hi Iain,

I have been able to get the filing history and document meta data simply, without using any curl but I am not able to access the document image and I am getting a different error to any mentioned here:

Filing History:

GET https://api.companieshouse.gov.uk/company/FC013908/filing-history
Headers:
Authorization: Basic
Response:
200 Ok

Document Metadata:

GET https://document-api.companieshouse.gov.uk/document/IkkBE_KCgZM5V5jzLKKqtN-EQnJ2wbfSQEhjSq8HURg
Headers:
Authorization: Basic
Response: 200 Ok

Document Content:

GET https://document-api.companieshouse.gov.uk/document/IkkBE_KCgZM5V5jzLKKqtN-EQnJ2wbfSQEhjSq8HURg/content
Headers:
Authorization: Basic
Accept: “application/pdf”
Response: 400
<"?xml version=“1.0” encoding=“UTF-8”?">
<“Error”>
<“Code”>InvalidArgument<"/Code">
<“Message”>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified<"/Message">
<“ArgumentName”>Authorization<"/ArgumentName">
<“ArgumentValue”>Basic cDVDM1NTT1JhWlEwdzdLQ08xX2MtRVpIOGlGcEUwSFFkeWNLeHFSNzo=<"/ArgumentValue">
<“RequestId”>661E800659049616<"/RequestId">
<“HostId”>nLkGjNc8FotcfOy37bRM92AizPm0PlQp2iJhtEiy8XQhrb9jVmpHYI/oeQXNHoNJLw7jKoCr7TY=<"/HostId">
<"/Error">

iankent · June 19, 2015, 7:00am

This happens when a HTTP client library automatically follows the redirect and sends the original Authorization header with the redirected request.

The solution is to prevent the HTTP client from following redirects, retrieve the Location header from the response, then create a new request to that URL without the Authorization header.

For example, in .Net you can do something like this:

string docID = "...";
string apiKey = "...";
string url = "https://document-api.companieshouse.gov.uk/document/" + docID + "/content";
 
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Headers["Authorization"] = "Basic " + Convert.ToBase64String(Encoding.Default.GetBytes(apiKey + ":"));
request.AllowAutoRedirect = false;

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

HttpWebRequest docRequest = (HttpWebRequest)WebRequest.Create(response.Headers["Location"]);
// ...

dereka · July 9, 2015, 11:33am

Even when we disable following the redirects we still get a 400 response? I am confused as to why calling the documented api causes a 400 anyway but even when we look at the headers for this reponse, there is no location header, only these:

x-amz-request-id: 9D0ACAD76BA8091A
x-amz-id-2: WigvCqlWrcBOgXIkkoepsjGVpc3gCLOGZCD2vNPVskCU39LfkK8m1ovbso4FXSENgVUYZoBqdz0=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 09 Jul 2015 11:16:27 GMT
Connection: close
Server: AmazonS3

We are using the Java Spring RestTemplate

dereka · July 9, 2015, 12:00pm

I also don’t want to have to make 2 requests, there should be a way of obtaining the document via the document api using the url returned from the filing history call alone.

iankent · July 9, 2015, 12:26pm

Since this is a response from Amazon S3, a 400 response is likely to be authorisation related, but without seeing the full response from Amazon it’s difficult to say. Did the response body give any further information?

If you assume that a PDF image will always exist for every document, you could theoretically bypass the metadata call and request the /document/{document_id}/content resource directly (essentially taking the document_metadata link and appending /content).

However, there may be a small number of documents which either have no resource or which only have resources of a different content type, which would then return a 406 Not Acceptable response instead.

dereka · July 9, 2015, 12:51pm

Ian, thanks for the replies.

When posting a GET to https://document-api.companieshouse.gov.uk/document/pvvT0RkxQzCDgzOY86TAfuDu9zoVDquEJ5gLcTepWLo/content

We get the following response:

Status
400 Bad Request Show explanation Loading time: 334
Request headers
Accept: application/pdf
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36
CSP: active
Authorization: Basic
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Response headers
x-amz-request-id: E1418A026E390222
x-amz-id-2: NqPA0NwmyPqGYco+fYAXUXHyDvDZl251QOEdMa6gfxHasCT2WVikmYzBu4rfP28VwmlxI2+hoQw=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 09 Jul 2015 12:44:58 GMT
Connection: close
Server: AmazonS3
Raw
XML

With body:

<?xml version="1.0" encoding="UTF-8"?>

InvalidArgumentOnly one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specifiedAuthorizationBasic E1418A026E390222NqPA0NwmyPqGYco+fYAXUXHyDvDZl251QOEdMa6gfxHasCT2WVikmYzBu4rfP28VwmlxI2+hoQw=

When i mentioned not wanting to make 2 requests, I was referring to the /content/ api, I would expect to be able to make one call to get the pdf rather than having to try and get the headers from the re-direct and look for a location header?

iankent · July 9, 2015, 1:03pm

This still looks like an authorization error, both the original Authorization header for the Document API and the Signature query string parameter are present in the request to Amazon S3.

Even with redirects disabled, you need to make sure you’re not reusing the original request to the Document API when requesting the actual resource from S3 - or at the least, remove the Authorization header from the original request before reusing it.

This is by design - instead of us proxying the content from S3 via the Document API, we return a redirect to the actual document. In most cases this works as expected - requesting the /content endpoint and following the redirect automatically will return the PDF content.

Unfortunately, this is dependent on your HTTP client library - some clients include the Authorization header on the redirected request while others don’t. The correct behaviour (in this instance) is to ensure the Authorization header is removed before following the redirect. If this isn’t how your HTTP client library works, then it means an additional step to follow the redirect manually.

dereka · July 9, 2015, 1:04pm

So how do we obtain the redirect url? It is not present in the response we are getting from the first call?

iankent · July 9, 2015, 1:10pm

It should be in the response from the Document API.

Looking at the earlier question - you sent a GET request to https://document-api.companieshouse.gov.uk/document/pvvT0RkxQzCDgzOY86TAfuDu9zoVDquEJ5gLcTepWLo/content, then got a 400 response with a Server: AmazonS3 header - this suggests the client has followed the redirect automatically (and sent the Authorization header with it).

If the redirect hasn’t been followed you should get a 302 response from the Document API, which should include the redirect as a Location header.

dereka · July 9, 2015, 3:26pm

To disable the follow redirects option using the Spring RestTemplate we would need to switch to using a different ClientRequestHttpFactory and that then causes us SSL handshake errors.

The current implementation makes the document part of the API almost incompatible with the Java Spring RestTemplate: http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/web/client/RestTemplate.html

Is there no way for us to make the call to your document content endpoint and receive the document in one call? I am not sure of the reason/benefit of needing to make 2 separate calls when as a consumer I am not really concerned with the file hosting solution?

adrian_calder · November 18, 2015, 4:26pm

I have been asked by my dev team to find out how we can ‘match up’ documents (i.e the filing history line item and image) that we have pulled into our web app through CH Direct or the XMLGW Output Service over the past few years, with the new free doc images available through the new API. Back in 2012 we were told to use the non-advertised unique identifier “Companies House ID” so that’s what distinguishes each of the thousands of docs in our web app currently. However, the “Companies House ID” does not appear to be available through the new API - only a unique identifier referred to as a ‘Transaction ID’ hence why at present we can’t match the docs up (which is a key requirement for us to implement our phased migration away from CHD / XMLGW to the new API).

mfairhurst · November 25, 2015, 10:21am

Adrian,

Unfortunately there is no data published that allows documents to be matched between the API and CHD or XMLGW output service and we never considered this as a requirement. I can add this to the backlog for consideration but it wouldn’t be a priority.

Thanks

Mark.