Can't access documents from AmazonS3 server

Hi there, I’m trying to fetch documents from the API using R. Appreciate the clarification of the process in this post. I’ve been following the above steps with partial success, but still fail the last step to get access to documents’ content:

  1. Find the document filing you’re interested in (e.g. make a filing history request1 for the company). Parse the response for the link to the document in the field “links” : { “document_metadata” : “link URI fragment here” }.

No problem:

library(httr)
library(jsonlite)
library(openssl)

### retrieving filing history ####
company_num = 'FC013908'
key = 'my_key'
fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
fh_url <- modify_url("https://api.companieshouse.gov.uk/", path = fh_path)
fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
docs <- fh_parsed$items

Done.

2 For a given document request the document metadata via CH Document API3. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).

No problems here:

md_meta_url = docs$links.document_metadata[1]  
key_pass <- paste0(key,":")
decoded_auth <- paste0('Basic ', base64_encode(key_pass))

md_test <- GET(md_meta_url,
               add_headers(Authorization = decoded_auth)
               )
md_test #status_code = 200!
md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)

This way I can obtain the content URL:

cont_url = md_parsed$links$document

Request the actual document9, specifying the mime type (e.g. “application/pdf”).

I do it while NOT following the redirect and, as expected, I get the 302 status code with the location header:

accept = 'application/pdf'
cont_test <- GET(cont_url, 
           add_headers(Authorization = decoded_auth,
                       Accept = accept),
           config(followlocation = FALSE)
)

final_url <- cont_test$headers$location

> final_url
[1] "https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D"

However, when I try to

Request this URI from Amazon again passing the content type you want again.
I get 400 error:

 final_test <- GET(final_url, 
                 add_headers(Authorization = decoded_auth,
                             Accept = accept
                             ))

> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D]
  Date: 2018-06-20 08:37
  Status: 400
  Content-Type: application/xml
  Size: 523 B
<BINARY BODY>

Needless to say, executing

browseURL(final_test$url)

returns Access Denied error. I suspect it may have something to do with AWS authorization problems similar to those described here. Any ideas how to solve this final hurdle? @voracityemail, can you help?

Thanks!

You’re almost there. As it says in the thread you mentioned:

So your final request:

final_test <- GET(final_url, 
             add_headers(Authorization = decoded_auth,
                         Accept = accept
                         ))

… should be like:

final_test <- GET(final_url,
             add_headers(Accept = accept
                        ))

(Apologies for formatting - I don’t speak R).

Amazon require authentication but all their authentication is done via the contents of the “final_url” e.g. the authentication is passed as parameters in the query string. So if you also include the http headers for authorisation from Companies House (the “Authorization = decoded_auth” ), this will confuse the Amazon servers.

You can check this by looking at the response you got back: (this is what’s being returned with Content-Type: application/xml in your last example). It will be something like:

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code>
<Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter,
Signature query string parameter or the Authorization header should be specified</Message>
<ArgumentName>Authorization</ArgumentName>
<ArgumentValue>Basic {your api key would be here}</ArgumentValue>
<RequestId>{blah}</RequestId>
<HostId>{long hostid string}</HostId></Error>

Again, check - among all the stuff in the final amazon URL you get in the http redirect (302) from Companies House you’ll see e.g.:

AWSAccessKeyId={their access key}&Expires={token expiry time}
&Signature={signature}&x-amz-security-token={very long security token}

So just omit the CH Authorisation header at the last step and you should be fine.

Here’s (yet another) plug for the free cUrl library / command line utility for diagnosing these issues / investigating REST interfaces. Although it’s old and there are more specialised tools (one I’ve used is SOAPui) it’s fast and simple. For info the most useful cUrl switches here are:

  • Send username (and optionally password): -uUSERNAME:PASSWORD or --user U:P
  • Show http headers: -I or --head
  • Dump http headers to a file: -D filename or --dump-header filename
  • Add a header line (e.g. Accept: content-type): -H headerline or --header headerline
  • Automatically follow redirects: -L or --location

Don’t forget to quote URLs, header lines etc. In particular, unquoted URL characters like “&” will cause problems in most shells / command line environments…

2 Likes

it worked brilliantly, thankyouthankyouthankyou!

I’m struggling with the final part here too - I’ve disabled redirects, I’m getting the 302, I’ve got the AWS location of the document - If I copy that whole URL string into my browser I can get the document… However, when I set up a secondary (new) request (via Python Requests) including the Accept and excluding the Authorisation to that long URL, I get an SSL handshake error: ‘(Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘ssl3_get_server_certificate’, ‘certificate verify failed’)],)”,),))’ - Do I need to split apart the long AWS URL string and pass some of it into a new Authorization string or something?

Could it have something to do with your Python environment? See this post as an example.

I haven’t resolved it yet, but yes I think you’re right, thanks for pointing me in the right direction!