[Errno 54] Connection reset by peer

Hi,
We are trying to scrape company numbers from director IDs with the API, but keep getting these below errors after it runs for a while:

ConnectionResetError: [Errno 54] Connection reset by peer
ProtocolError: (‘Connection aborted.’, ConnectionResetError(54, ‘Connection reset by peer’))
ConnectionError: (‘Connection aborted.’, ConnectionResetError(54, ‘Connection reset by peer’))

We created a status code dictionary in the code so that we could monitor any status code errors we may get. From the last attempt, we did not find any evidence that the problem was a rate limit problem (we have a 5min sleep which hopefully addresses this.

We’ve included our code below. Any advice would be great!

Our code:

import requests
import pandas as pd
import json
import time
from tqdm import tqdm
from datetime import datetime
request_number = 0
co_list_a =[]
dir_num_error ={}

for item in tqdm(dir_ids_0):
if request_number > 599:
print(“sleeping”)
time.sleep(300)
request_number = 0
else: pass
response = requests.get(f"{base_url}{item}",auth=(api_key,’’))
request_number = request_number + 1
dir_num_error[item] = {}
dir_num_error[item][‘status_code’] = response.status_code
dir_num_error[item][‘timestamp’] = str(datetime.now())
dir_num_error[item][‘request_number’] = request_number
if response.status_code == 429:
print(“429_sleeping”)
time.sleep(300)
request_number = 0
continue
else: pass
if response.status_code != 200:
continue
else: pass

json_search_result = response.text
data = json.JSONDecoder().decode(json_search_result)
for item in data['items']:
      co_list_a.append(item['links']['company'])

Might not be relevant to your problem, but if you are looking for a list of all company numbers, instead of scraping you can download a massive CSV file of them at Companies House . If you are looking specifically for director appointments you can contact their customer care and they will send you a big file of all directors (officers).

1 Like

We suggest that you try making the sleep period slightly longer to make sure you are not exceeding the rate limit.

1 Like

Have you tried encapsulating your code in a try and catch block and getting the exception information that’s being thrown? Since it looks like the connection is being terminated from the Remote side do you have any transparent proxy servers between yourself and the internet that might also be causing the connection to terminate.

Does the connection terminate at random or after x records?

Thanks. I’m just waiting for a slightly increased rate limit to be approved and then will re-run with more of a buffer.

These is beyond my programming skills atm, and not so sure on the proxy server question. Am going to try running it from different location and machine. Thanks for help.

Update - apologies - I see you already requested the bulk data so you can probably ignore this post now! I’ll leave it here anyway.

I don’t know what exactly you want to achieve with:

trying to scrape company numbers from director IDs with the API

…but as @ebrian101 says requesting the list of officers would be a simple way to achieve that - you could then simply parse that for the information you needed. This is doubly the case if you’re not entirely confident in the vagaries of REST APIs / http.

To request that information post on the following thread:

I don’t know for sure if this error is due to you hitting the limit but it would be sensible to write the code to take account of this. Although you could just try (at the simplest) making your wait period / number of requests more conservative I think you’d be better off making use of a library to:
a) cover some of the detail for you (so you can concentrate on your task) and
b) properly implement the rate limit system.

I’m not a Python coder but it looks like you’re using Python there. I believe there are existing Python libraries which will cover that task, for example:

This one certainly appears to handle the rate limiting information from Companies House - if you look at the RateLimitAdapter class here:

Again I haven’t used this one personally though and can’t guarantee it works now or advise on how to get that to assist with your task.

Still want to do this yourself?

If you do want to implement this it is not especially difficult especially if you only have one thread / single code process making the requests. Companies House provides rate limiting information via http headers on each request giving the “rate window start time period” and “how many requests you have left”. See their documentation. The following posts have details; I would double-check the data your code receives when trying this out as some are older threads:

Whatever you end up using I notice that Companies House now have some test environments - it would make sense to test whatever you end up using there so you don’t get yourself blocked if the code doesn’t work as expected!

Good luck.

Thanks so much for detailed reply. I’m hoping to be able to get the relevant bulk data. If that includes information on directors and all of their appointments then they’ll be no need to scrape anything as you say. If that info isn’t available though I’m going to try a longer break.