Documents not downloading, invalid ID

Hi, I am using the following PHP code:

$url5="https://document-api.companieshouse.gov.uk/document/".$itemnew['transaction_id']."/content";
$ch5 = curl_init();
curl_setopt($ch5, CURLOPT_URL,$url5);
curl_setopt($ch5, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch5, CURLOPT_USERPWD, $api5.':');
curl_setopt($ch5, CURLOPT_HTTPHEADER,'Accept:application/pdf');
$result5 = curl_exec($ch5);
curl_close($ch5);
$jsonObj5 = json_decode($result5, true);
var_dump($jsonObj5);

But the response I’m getting is invalid ID, any ideas?

array(2) { ["error"]=> string(19) "Invalid document ID" ["type"]=> string(10) "ch:service" } array(2) { ["error"]=> string(19) "Invalid document ID" ["type"]=> string(10) "ch:service" } array(2) { ["error"]=> string(19) "Invalid document ID" ["type"]=> string(10) "ch:service" } array(2) { ["error"]=> string(19) "Invalid document ID" ["type"]=> string(10) "ch:service" } array(2) { ["error"]=> string(19) "Invalid document ID" ["type"]=> string(10) "ch:service" }

@jack,

What transaction ID are you using to request from the document-api endpoint? The following post may be of help?

Thanks

@mfairhurst

I’m getting the transaction_id field directly from the filing history.

I’m not looking to download images, I simply want a link to download the PDF of each filing. I have tried using the document_metadata too and I get the same issue so I’m at a loss!

@jack,

I have tested using the settings below with success.

The doc_url is the document_metadata link returned from a filing history request. If I modify the the link (i.e. remove the 8 from the end) then the request fails with an invalid document ID error.

Are you confident that the document URL is being passed correctly? Could you hardcode the link and test initially to see if it works?

Thanks

@mfairhurst

$doc_url = "https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8";
curl_setopt($curl, CURLOPT_URL, $doc_url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_VERBOSE, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Accept:application/pdf'));
curl_setopt($curl, CURLOPT_USERPWD,"my_api_key");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

When I get to outputted the results, I simply get a NULL output, full code below:

$url5 = “https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8”;

		$ch5 = curl_init();
		curl_setopt($curl, CURLOPT_URL, $url5);
		curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
		curl_setopt($curl, CURLOPT_VERBOSE, true);
		curl_setopt($curl, CURLOPT_HTTPHEADER, array('Accept:application/pdf'));
		curl_setopt($curl, CURLOPT_USERPWD,$api5);
		curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
		
		$result5 = curl_exec($ch5);
		curl_close($ch5);
					
		$jsonObj5 = json_decode($result5);
		var_dump($jsonObj5);

Hi Jack.
Just wanted you to know that you cannot get a direct link to the document that you can store for later (or repeat) retrieval. You will always have to call the document api to get the one-time-use, short lived URL for the document.

The sequence is:

  1. Call filing history
  2. Call document api for transaction you are interested in.
    2.1 document api returns a 302 Moved response with a Location: header pointing at the document
  3. Get document by following the Location: URL. This can only be done once, without repeating 2, and must be done within a short time window, before it expires.

Hope that helps design your logic.
Chris

I’ve fixed your script (you were using $curl when calling curl_setopt, but were defining $ch5, which leaves $curl undefined and your PHP code will fall over.

#!/usr/bin/php
<?php
$url5 = "https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJh
KdDk2XywNgqIOX9MnUOzgPKeV7O8";
$api5 = "PUT YOUR API KEY HERE";

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url5);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_VERBOSE, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Accept:application/pdf'));
curl_setopt($curl, CURLOPT_USERPWD,$api5);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

$result5 = curl_exec($curl);
curl_close($curl);

$jsonObj5 = json_decode($result5);
var_dump($jsonObj5);
?>

This will return you all the information you may want about the document. To get the one-use, limited-time URL of the actual PDF document, do this (note the /content in $ur5):

#!/usr/bin/php
<?php
$url5 = "https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJh
KdDk2XywNgqIOX9MnUOzgPKeV7O8/content";
$api5 = "PUT YOUR API KEY HERE";

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url5);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_VERBOSE, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Accept:application/pdf'));
curl_setopt($curl, CURLOPT_USERPWD,$api5);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

$result5 = curl_exec($curl);
curl_close($curl);

$jsonObj5 = json_decode($result5);
var_dump($jsonObj5);
?>

Hope that helps!
Chris

1 Like

Hi there Chris!

Thank you for your post to this thread.

I am working on this exact same problem. The document url that I am testing on is this:
https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s

When I run your first script I get this response:
object(stdClass)#13 (10) { [“company_number”]=> string(8) “04189193” [“barcode”]=> string(8) “X25L10E1” [“significant_date”]=> string(20) “2013-03-28T00:00:00Z” [“significant_date_type”]=> string(0) “” [“category”]=> string(14) “annual-returns” [“pages”]=> int(6) [“created_at”]=> string(30) “2015-02-01T05:13:42.502030013Z” [“etag”]=> string(0) “” [“links”]=> object(stdClass)#14 (2) { [“self”]=> string(95) “https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s” [“document”]=> string(103) “https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s/content” } [“resources”]=> object(stdClass)#15 (1) { [“application/pdf”]=> object(stdClass)#16 (1) { [“content_length”]=> int(63926) } } }

HOWEVER, when I add /content to the url I get: NULL

Any ideas?

@mfairhurst
@csmith

@deskildsen

To explain what is happening I will walkthrough the sequence that has been highlighted in the post previously but with further explanations and code snippets where applicable.

The sequence

  1. Call filing history
  2. Call document api for transaction you are interested in.
    2.1 document api returns a 302 Moved response with a Location: header pointing at the document
  3. Get document by following the Location: URL. This can only be done once, without repeating 2, and must be done within a short time window, before it expires.

Sequence Details

1. Call filing history
Search for the transaction you interested in via the filing history endpoint. Given that you have a document you are testing with I am going to assume that this step needs no further explanation.

2. Call document api for the transaction you are interested in
The reason for performing this step is to retrieve the metatdata related to the transaction. This defines information such as the company number and the category but more importantly the content types that may be available. Currently the default is PDF (application/pdf) but future types will include XBRL/iXBRL (application/xhtml+xml) for accounts filings, for example.

This is the step that you have performed using the example code in the post and provided the response.

2.1 Call the document API to retrieve the one time use URL to access the document.
This is achieved by calling the “document” link returned from the request above, the following in the example response provided, the URL with /content on the end:

["links"]=> object(stdClass)#14 (2) { ["self"]=> string(95) "https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s" ["document"]=> string(103) "https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s/content" }

This is the step you are performing with the NULL returned. When this URL is requested it provides a 302 redirect and a location header which is the one time use URL to access the document. The code you are executing has been configured to follow these redirects automatically via:

> curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

so the code automatically jumps to the next step.

3. Get the document using the URL from the location header returned from 2.1
Retrieve the document , a PDF image in this example, using the one time use URL.

Given that the example code automatically follows the redirect it has extracted the image as binary data and stored in the $result5. The code is then treating this as json, performing a json_decode and then dumping the contents which is the NULL output (I suspect that a json_decode of binary data is failing, hence the NULL!). If you change the script above to echo $result5, you will see the “image” data output.

To further explain the following is a snippet of code (developed to run in browser) performing 2.1 and 3.

<html>
 <head>
  <title>PHP Test</title>
 </head>
 <body>

<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://document-api.companieshouse.gov.uk/document/dSD6iyVGq4Nx5jnj0IOlqti5a7veemHTJREfRO0Gm6s/content');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($curl, CURLOPT_VERBOSE, true);
curl_setopt($curl, CURLOPT_USERPWD,"<<YOUR_API_KEY>>");
#curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

$response = curl_exec($curl);
$redirect = curl_getinfo($curl, CURLINFO_REDIRECT_URL);  #Retrieve the re-direct URL
curl_close($curl);

echo $response;

echo "<a href='".$redirect."'>Click for image</a>";
?>

</body>
</html>

As you can see the followlocation has been commented out which means that we do not follow the re-direct returned when we call the document-api with the /content appended. We can then select the URL from the header returned and then use this URL as a link to open the PDF in a browser.

Hope this provides some further clarification

Thanks,

@mfairhurst

@mfairhurst

Thanks for your very clear step-by-step explaination. However, after following it I must say that I am still experiencing the same problem as @deskildsen described GETting /content (when I do not provide an Authorization header), and also the same Auth error msg many others have described (when I do provide an Authorization header).

I am using the Postman Chrome extension to test the CH API.

You can assume that every GET call (including /content) sends the same Authorization header, that looks something close to this:

Authorization: Basic XzbhkdWNjcm5uMVFxTjJLR2ztSnpzVi1EUUJSSTRFaG1pT251lTF9aNjo=

With that, these are the GET requests/headers, and responses/headers I have:


**req**:
GET https://api.companieshouse.gov.uk/company/00002065/filing-history

**resp**:
Status: 200
Access-Control-Expose-Headers →Location,www-authenticate
Cache-Control →no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Length →14443
Content-Type →application/json
Date →Tue, 31 May 2016 13:33:55 GMT
Pragma →no-cache
X-Ratelimit-Limit →600
X-Ratelimit-Remain →599
X-Ratelimit-Reset →1464701935

{
  "start_index": 0,
  "items_per_page": 25,
  "total_count": 2270,
  "filing_history_status": "filing-history-available",
  "items": [
    {
      "category": "officers",
...
      "links": {
        "self": "/company/00002065/filing-history/MzE0ODU4NjgxOWFkaXF6a2N4",
        "document_metadata": "https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM"
      },
...

**req**:
GET https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM

**resp**:
Status: 200
Access-Control-Allow-Origin →*
Connection →keep-alive
Content-Encoding →gzip
Content-Length →298
Content-Type →application/json
Date →Tue, 31 May 2016 13:48:39 GMT
Server →nginx/1.8.0
X-Ratelimit-Limit →600
X-Ratelimit-Remaining →599
X-Ratelimit-Reset →1464702819

{
  "company_number": "00002065",
  "barcode": "X575BVCQ",
  "significant_date": null,
  "significant_date_type": "",
  "category": "officers",
  "pages": 1,
  "created_at": "2016-05-16T10:04:12.6749853Z",
  "etag": "",
  "links": {
    "self": "https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM",
    "document": "https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM/content"
  },
  "resources": {
    "application/pdf": {
      "content_length": 14696
    }
  }
}

**req**:
GET https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM/content

Accept: application/pdf

**resp**:
Status: 400
Connection →close
Content-Type →application/xml
Date →Tue, 31 May 2016 13:52:29 GMT
Server →AmazonS3
Transfer-Encoding →chunked
x-amz-id-2 →Y6rWZj7ijRTZCOHUQryo0AEqsmdgOa+eRP5Ii1wteuMxKft273PfpoaVGmP7Qrbh8q5BixSv7uQ=
x-amz-request-id →D1DD913A4BE20D0F

<?xml version="1.0" encoding="UTF-8"?>
<Error>
    <Code>InvalidArgument</Code>
    <Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message>
    <ArgumentName>Authorization</ArgumentName>
    <ArgumentValue>Basic XzhkdWNjcm5uMVFxTjJLRzhkSnpzVi1EUUJSSTRFaG1pT25lTF9aNjo=</ArgumentValue>
    <RequestId>D1DD913A4BE20D0F</RequestId>
    <HostId>Y6rWZj7ijRTZCOHUQryo0AEqsmdgOa+eRP5Ii1wteuMxKft273PfpoaVGmP7Qrbh8q5BixSv7uQ=</HostId>
</Error>

Thanks for helping out!

@spam2steve

The issue is that the Postman extension is following the redirect and sending the original authorisation header with both requests. So when you call the endpoint: -

> https://document-api.companieshouse.gov.uk/document/Wnzo1HaxKGzsajuSk-VXqPcZP6MuQ58ht-cUR8HEDpM/content 

the initial response is a 302 redirect with a one time use URL to retrieve the document provided in the location header. The request to the one time use URL does not require the same basic authentication details as it contains a x-amz-security-token as part of the URL, hence the error message.

To resolve you need to stop the redirect from being followed automatically and capture the “location” from the header and then call this URL without the basic authentication. Postman provides an extension called interceptor which can be installed which then allows a setting to be configured to switch off automatically following redirects.

The following forum post provides further details

Hope this helps

@mfairhurst

@mfairhurst

For the official record, installing Postman’s interceptor and getting the 302 response’s Location did help!

The response headers from an auth’ed GET .../content endpoint looked like this:

Status →302
Connection →keep-alive
Content-Length →0
Content-Type →text/plain; charset=utf-8
Date →Wed, 01 Jun 2016 16:05:35 GMT
Location →https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/nE4dUaWK86_U15KYuSdz0aVTfKVPy1XSEn8vYe4RCn8/application-pdf?AWSAccessKeyId=ASIAJU7G663WH5GJBUWQ&Expires=1464797195&Signature=9a5BXkhuPmmavx%2BtyZ2ov0se2Fw%3D&x-amz-security-token=FQoDYXdzEDEaDAodJQ3cJWCs0BXo5CKZA7fN8FHXf0R%2BGnBppBX%2Bx26O62zVtgPtzeE30l%2FMMqi18fY2us0z6m43LFXdWBcPxNNf%2F%2BSyCQHD1DR3J3Y4DqRpdWZ5AevSg0cm9ikmjcxiVkgiaQjowce5ZysUrHutmd1iGVJamYbhFCdzZzOwwe1owLP0j2Yd94Ln2KaXopF2EZ%2FZDrZOlG%2BkRFr8FaUECllCJexvUbC1JC64L720wCZM078veWt7fo1KsDx8Hs%2BqjpDJtWhk78lyzWxQZJWYuwMQyAKNrhXMJCoPE1jqFpI7nq4AZ7suVgFPuVlZ60LqQSx1gJBCGG85ogxNdPV1pYCwQQ2qJOKn5JePoBc62vH39JhwLDau1gVwv8iCnQWV3C1i40LJqn3kO%2B4eQtIp%2Fk%2BA0sQC5LYayGTcckXvULTeAhhlO7OtcimkGOjN3T5hbc1NFG8HS78nQNSuxDHlFA%2Fug7%2FAu86X5OKq0klWaZEMJTIcKDArW7BKbcglm8yDexOKakm92d04VHFGvuMbGnb9BCqmTn%2B7X8uUDRI2dZ3aLNQQv%2B5s%2FuMokY28ugU%3D
Server →nginx/1.8.0
X-Ratelimit-Limit →600
X-Ratelimit-Remaining →595
X-Ratelimit-Reset →1464797147

Following the Location URL streamed the PDF. And look, there’s even the request’s Signature param right there (which was accurately complained about in the Error msg earlier)!

(Note that Postman didn’t download the PDF bytes correctly, but using curl -o resp.pdf <LocationUrlHere> did.)

Thanks a bunch.

Glad to hear the issue is resolved!

Thank you very much for this information. I got it working but I had trouble using curl in php and what I ended up doing was using curl from the command line (linux).

Here is my code:

$url = "https://document-api.companieshouse.gov.uk/document/" . $docid . "/content";
$requestBase = "curl -v -uAPI-KEY: ";
$requestCommand = $requestBase . ' -H "Accept: application/pdf"  ' . $url . ' 2>&1';
$response = shell_exec($requestCommand);
$pos = strpos($response, 'Location: ');
$redirectportion = substr($response, $pos+10);
$pos = strpos($redirectportion, '* Server');
$redirecturl =  substr($redirectportion, 0, $pos); 
header("Location: $redirecturl", true);

Great news :smile:

@mfairhurst