Does anyone know what Character set is used to Hold and return the data - A lot of the Company’s have what I assume are Unicode characters - so it would be useful to know what character set is being used to store that data and send it in the JSON Responses.
(I’m not from Companies House). No idea what character encoding data might be stored in. Probably more pertinent would be what particular legal restrictions there are on e.g. what can go in a company’s name.
Of course that is also subject to the fact that this is not just a public record, but a record where the public actually fill in much of the data and Companies House primarily have a legal duty to record it! By fiat or legally there doesn’t seem to be a great deal of validation / cross checking applied.
If you really want to dig deep then there are the XML specifications, which may hint at limits in what data will be accepted / represented (this was the main way of accessing things prior to the REST API).
Of course - the Companies House data set itself pre-dates digital computers…
In terms of what you get back, for all the JSON that is returned I’d expect them to stick to the spec (and I’m not aware we’ve received anything invalid in practice):