Document Images

danny · June 9, 2015, 7:10am

Is there a method for ‘pulling’ document images that have been identified from the ‘filing history’ method ?

csmith · February 26, 2015, 11:17am

Absolutely !
This is currently in development and will be scheduled for a future release.
We will be releasing further data sets and enhancements incrementally as we develop them, and we move out of beta.

danny · February 26, 2015, 11:17am

Thanks Chris.
That’s good to know
Will you announce new features etc in this forum or in the developer hub pages(s) ?

Danny

hywel_bromby · April 7, 2015, 3:21pm

Any idea when this will become available?

danny · May 14, 2015, 2:09pm

Hi,

Has there been any development on this?

Cheers

hywel_bromby · June 1, 2015, 10:55am

Hi,

Whilst the images seem to have appeared on the beta site, and are returned in the JSON of the API, I am currently not able to get the images from the API. Is this an error on my part, or is this not available yet?

hywel_bromby · June 2, 2015, 3:32pm

items[].links.document_metadata string
“Link to the document metadata associated with this filing history item. See the Document API documentation for more details.”

Where is the "Document API documentation " mentioned in the help?

iankent · June 4, 2015, 8:14am

Sorry about that - we’ve missed the link from the menu.

The Document API documentation is available here:

ash · June 8, 2015, 12:41pm

Should these calls be working or is it only available on the search page at present?

I haven’t been able to get either the Metadata or the Document call to work either in code or through the test pages (eg: https://developer.companieshouse.gov.uk/document/docs/document/id/fetchDocumentMeta.html)

For the latter it may be something to do with the fact that it is asking for authorisation as well as having the application dropdown. In code I’m just getting 404 errors.

I’m operating on the assumption that the documents unique ID is the Transaction ID in the Filing History item. Is this correct?

Also the page about metadata ( https://developer.companieshouse.gov.uk/document/docs/document/id/fetchDocumentMeta.html) has a trailing “/metadata” in the GET URL in the example, but not elsewhere in the documentation.

iankent · June 9, 2015, 7:49am

These calls should be working, though our documentation may be a bit unclear.

The process to retrieve an image is:

Fetch the filing history resource for the company you want - e.g. https://api.companieshouse.gov.uk/company/00002065/filing-history
Get the links.document_metadata URL from a filing history resource, which is a fully qualified URL pointing to a specific image on the Document API
Request the Document metadata using the provided document_metadata URL - e.g. https://document-api.companieshouse.gov.uk/document/AnDp9GA5JHgybR-BeUMb4n4D9QhDP4_4N25uPi4aB5M
From the metadata, get an available resource type, e.g. application/pdf
Request the content by appending /content, e.g. https://document-api.companieshouse.gov.uk/document/AnDp9GA5JHgybR-BeUMb4n4D9QhDP4_4N25uPi4aB5M/content

application/pdf is returned as the default resource. If other resource types are returned in the metadata, these can be requested from the same /content URL by setting the Accept header, e.g. Accept: application/xhtml+xml.

The API key you use for https://api.companieshouse.gov.uk should also work for https://document-api.companieshouse.gov.uk, and will be rate limited independently.

‘Explore this API’ for Document API on the Developer site doesn’t currently work as expected, but the API endpoints and returned content should match the documentation.

hywel_bromby · June 9, 2015, 12:37pm

I’m afraid I cannot get anything but the following by following the steps outlined above.

If I use the URL returned in links.document_metadata (http://document-api-prod-gaigve6x7y.elasticbeanstalk.com/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content )
Or
If I construct the URL using your example (https://document-api.companieshouse.gov.uk/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content)

The remote server returned an error: (403) Forbidden

Using the documentation example section

GET document/gh438fghd09euthg8294ughehwieugh397/content HTTP/1.1
Host: undefined
Authorization: Basic bXlfYXBpX2tleTo=
Accept: application/pdf

all I get is

0 Error: Access is denied.

Do I need to register MY API key separately for this somehow?

iankent · June 9, 2015, 1:27pm

You won’t need to register the API key anywhere, if it works for the API it should also work for the Document API - but you’ll need to send the API key in the Authorization header using Basic Authentication.

An example with curl (note the colon after the API key):
curl -v -u your-api-key: https://document-api.companieshouse.gov.uk/document/tcQKo6XWJ1_aaU3WjD8kTqoSqCV-KcnGjwJl91hy9no/content

The “Explore this API” section is currently broken as it attempts to use an invalid hostname (“document”), which we’ll need to investigate, but using the API from code or command line should still work.

hywel_bromby · June 9, 2015, 1:38pm

I have been (successfully) using the rest of the API (with not a curl to be seen) for a while now.

Method(URL, Accept)

All I am doing is sending a new URL, and the relevant Accept statement. No luck yet.

Has anyone else had any luck getting it working?

iankent · June 9, 2015, 2:50pm

I can’t see any reason it wouldn’t work for the Document API if it works for other API endpoints - the authorisation mechanism is the same.

Could you confirm if the curl request against Document API works using your API key, or any alternative command line tool? Just so we can eliminate the API key being a problem - if you use something other than curl, could you provide the full command used (but without the API key) so I can try to reproduce the problem locally.

The full curl command I’m using is this:
curl -v -u api-key: http://document-api-prod-gaigve6x7y.elasticbeanstalk.com/document/reM3Rsb08Cze0jMeKyVjPuAamiNEuXYTVYFSLSudtS4/content
which should return a link to the image for an AR01 for Lloyds Bank PLC.

ash · June 9, 2015, 3:12pm

Thanks Ian, I managed to get both calls working based on that (“that” being your earlier post, not all the CURL stuff!)

stuart_boulton · June 17, 2015, 9:06am

Hi Iain,

I have been able to get the filing history and document meta data simply, without using any curl but I am not able to access the document image and I am getting a different error to any mentioned here:

Filing History:

GET https://api.companieshouse.gov.uk/company/FC013908/filing-history
Headers:
Authorization: Basic
Response:
200 Ok

Document Metadata:

GET https://document-api.companieshouse.gov.uk/document/IkkBE_KCgZM5V5jzLKKqtN-EQnJ2wbfSQEhjSq8HURg
Headers:
Authorization: Basic
Response: 200 Ok

Document Content:

GET https://document-api.companieshouse.gov.uk/document/IkkBE_KCgZM5V5jzLKKqtN-EQnJ2wbfSQEhjSq8HURg/content
Headers:
Authorization: Basic
Accept: “application/pdf”
Response: 400
<"?xml version=“1.0” encoding=“UTF-8”?">
<“Error”>
<“Code”>InvalidArgument<"/Code">
<“Message”>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified<"/Message">
<“ArgumentName”>Authorization<"/ArgumentName">
<“ArgumentValue”>Basic cDVDM1NTT1JhWlEwdzdLQ08xX2MtRVpIOGlGcEUwSFFkeWNLeHFSNzo=<"/ArgumentValue">
<“RequestId”>661E800659049616<"/RequestId">
<“HostId”>nLkGjNc8FotcfOy37bRM92AizPm0PlQp2iJhtEiy8XQhrb9jVmpHYI/oeQXNHoNJLw7jKoCr7TY=<"/HostId">
<"/Error">

iankent · June 19, 2015, 7:00am

This happens when a HTTP client library automatically follows the redirect and sends the original Authorization header with the redirected request.

The solution is to prevent the HTTP client from following redirects, retrieve the Location header from the response, then create a new request to that URL without the Authorization header.

For example, in .Net you can do something like this:

string docID = "...";
string apiKey = "...";
string url = "https://document-api.companieshouse.gov.uk/document/" + docID + "/content";
 
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Headers["Authorization"] = "Basic " + Convert.ToBase64String(Encoding.Default.GetBytes(apiKey + ":"));
request.AllowAutoRedirect = false;

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

HttpWebRequest docRequest = (HttpWebRequest)WebRequest.Create(response.Headers["Location"]);
// ...

dereka · July 9, 2015, 11:33am

Even when we disable following the redirects we still get a 400 response? I am confused as to why calling the documented api causes a 400 anyway but even when we look at the headers for this reponse, there is no location header, only these:

x-amz-request-id: 9D0ACAD76BA8091A
x-amz-id-2: WigvCqlWrcBOgXIkkoepsjGVpc3gCLOGZCD2vNPVskCU39LfkK8m1ovbso4FXSENgVUYZoBqdz0=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 09 Jul 2015 11:16:27 GMT
Connection: close
Server: AmazonS3

We are using the Java Spring RestTemplate

dereka · July 9, 2015, 12:00pm

I also don’t want to have to make 2 requests, there should be a way of obtaining the document via the document api using the url returned from the filing history call alone.

iankent · July 9, 2015, 12:26pm

Since this is a response from Amazon S3, a 400 response is likely to be authorisation related, but without seeing the full response from Amazon it’s difficult to say. Did the response body give any further information?

If you assume that a PDF image will always exist for every document, you could theoretically bypass the metadata call and request the /document/{document_id}/content resource directly (essentially taking the document_metadata link and appending /content).

However, there may be a small number of documents which either have no resource or which only have resources of a different content type, which would then return a 406 Not Acceptable response instead.