Document metadata 500 error

alastair · July 10, 2018, 7:01pm

Hi,

I have been following the documentation and various guidance on this forum to try and get this data, but I keep getting a 500 Internal Server error when trying to download metadata for a document.

The URL is http://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8

In terms of parameters, I am only sending the API key with a colon on the end (I have tried without the colon - same issue) as per the documentation. However, if it was an Auth issue - I would have expected a 4xx error - but a 500 error suggests I have passed all the correct details but there is a server-side problem.

I am getting the IDs from the Filing History API call.

Now the only thing that I have noticed in the documentation (and it is subtle) is the example has a URL with /metadata on the end, however it isn’t specified in the documentation outside of the example.

I have read in other forum posts that people having similar problems getting this working, and I have had zero issues getting any of the other API calls working, which suggests this has been developed separately.

Has anyone resolved this or come across similar problems? Am I using the wrong key - is there a separate one from the App key? Are there genuine server side problems going on for the past 48 hours?

Thanks

alastair · July 10, 2018, 8:29pm

I also took this outside of Python in case there was a side effect there, but I am running into the same issues using Postman - also passing
Accept: application/json and
Host: document-api.companieshouse.gov.uk
as per the example on the page.

voracityemail · July 11, 2018, 9:18am

(Just another punter here)

Oh - hang on - you’ve got “http://” in your example. Is that a typo? All CH api should use https.
Oddly, this works when I try it … but I’m surprised.

Quick answers:

I don’t think the example with “/metadata” is the way to go. As I understand it, you follow the link you provided for the metadata, or append “/content” to get the actual content. (Getting content wasn’t exhaustively documented last I looked, but search on the forum for the details of this if you need).
Same API key is used across the API - there isn’t a separate key for the documents part of this. (Just for completeness - and this is documented elsewhere - in the final step of requesting the PDF / XML version of a document - when you call the Amazon server you don’t provide the key.)
I suspect it may be how you’re doing this (unless there was a glitch last night).
I assume calls to the rest of the API work for you?

I tried your example link (using curl) and this responds correctly:

curl -uOUR_API_KEY_HERE: "http://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8"

…returns:

{ "company_number":"01777777", "barcode":"X42HHJRF", 
"significant_date":null, "significant_date_type":"","category":"mortgages",
"pages":26, "created_at":"2015-05-29T23:30:55.388825345Z", "etag":"",
"links": { "self":
"https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8",
"document":"https://document-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8/content"
},"resources":{"application/pdf":{"content_length":669701}}}

As you point out if it was just the authentication details at fault you’d expect a 4xx response.

Anything else returned (e.g. in body) with the 500?

Any other companies you’ve tried?

It seems that you’re trying to look up this entry:
BRITISH AIRWAYS PLC. Company number 01777777
…so I presume you’re calling the endpoint “https://api.companieshouse.gov.uk/company/01777777/filing-history” (with your API key passed in the correct http header parameter).
When I do this, I see the entry you’re after the metadata for:

{
“type”: “MR01”,
“category”: “mortgage”,
“subcategory”: “create”,
“date”: “2015-03-04”,
“description_values”:
{
“charge_number”: “017777770805”,
“charge_creation_date”: “2015-02-26”
},
“action_date”: “2015-02-26”,
“description”: “mortgage-create-with-deed-with-charge-number-charge-creation-date”,
“links”:
{
“document_metadata”: “https://frontend-doc-api.companieshouse.gov.uk/document/xtXcP0MpmXVAgJhKdDk2XywNgqIOX9MnUOzgPKeV7O8”,
“self”: “/company/01777777/filing-history/MzExODU1NTE1OGFkaXF6a2N4”
},
“pages”: 26,
“barcode”: “X42HHJRF”,
“transaction_id”: “MzExODU1NTE1OGFkaXF6a2N4”
}

(Is this correct?) (You’ll notice that the link here for document metadata is slightly different, but following this one should work as should the one documented in the api).

alastair · July 11, 2018, 11:41am

Hey,

Thanks for taking the time on the response.

So I went and tried what you suggested with curl and it worked first time, no problems at all. I then took the same command passed to curl (without looking at what I was doing last night to give myself a fresh perspective), put it through Postman and it is still returning a 500 error… its the oddest thing. I was trying to avoid having to break out Wireshark for it - as there must be something different in the way the 2 are making the HTTP call.

To answer some of your other points

I’m not using HTTPS because the documentation has the endpoint as HTTP
There isn’t any response body to give any extra info, there are some fairly uninformative header fields:

Age →0
Connection →keep-alive
Content-Length →0
Content-Type →text/plain; charset=utf-8
Date →Wed, 11 Jul 2018 11:34:51 GMT
Server →nginx/1.12.1
And in terms of the chosen document, it was just a document that someone else had returned in another forum post somewhere, so I chose that as an example as they eventually got it working - so there was no special significance to that particular doc.

I’ll do a bit more analysis with Wireshark - but would be interested to know if anyone has had any luck with Postman or Python with this particular issue.

Thanks again

voracityemail · July 11, 2018, 2:18pm

the endpoint as HTTP {not https}

Indeed you’re right, the documentation for both document endpoints says “http” (e.g. at https://developer.companieshouse.gov.uk/document/docs/document/id/fetchDocumentMeta.html). I had just assumed https everywhere because the rest of the API states this and the guidelines do:

The API can only be accessed over TLS. We recommend using TLS 1.2.

Unfortunately the documentation has not been corrected / updated since it was created (AFAIK) and has some “features” of its own (as of last cursory look a month or so ago). I was surprised that http worked since this rather waters down the “security” of having an API key e.g. you’re sending this in the clear.

Don’t have time to investigate what’s happening with Python / Postman ATM plus I don’t use either. Do post if you find the solution as that’s always helpful.

alastair · July 11, 2018, 3:15pm

Ok, so I’ve found the Postman issue (I suspect Python will be similar) and it appears to be down to my own inexperience with this, but for the sake of anyone who comes across this in the future I’ll post what the issue was…

In Postman you can set each header field individually (including Authorization), but there is also a tab for Authorization separately which handles the different types of Authorization (Basic, OAUTH, etc.) - what it doesn’t warn you about is if you set a header field for Authorization directly in the Header section as this doesn’t work as I had expected.

So if you set an Authorization header field directly as “Basic 123XYZ:” then it literally passes that string in the header, whereas if you go to the Authorization tab and set “Basic” as the Type and XYZ123: as the Username, then it does the proper encryption of the Username/Password before sending it as part of the GET request.

I only noticed this when I used the following command that helps debug the curl GET request

curl -vvv -u <APIKey> "doc-api-url"

I noticed that the Authorization header field didn’t match what I was setting.

This is probably quite a rookie error with this, but I also feel that the “500 Internal Server error” response is a bit of a misnomer in this case, and should be set to something more appropriate around not being authorised.

Given my new found knowledge I expect the Python to be similar.

voracityemail · July 12, 2018, 6:23pm

Yup, that comes under “how http Basic authentication works” heading but as you go on to say that’s often abstracted away by a library or package and it may not be obvious how it’s handled. This has come up quite a few times on the forum, maybe one for a FAQ / emphasis should CH update their documentation.

Good investigation anyway and you’re working now.