[Errno 54] Connection reset by peer

tamills · July 19, 2021, 10:18am

Hi,
We are trying to scrape company numbers from director IDs with the API, but keep getting these below errors after it runs for a while:

ConnectionResetError: [Errno 54] Connection reset by peer
ProtocolError: (‘Connection aborted.’, ConnectionResetError(54, ‘Connection reset by peer’))
ConnectionError: (‘Connection aborted.’, ConnectionResetError(54, ‘Connection reset by peer’))

We created a status code dictionary in the code so that we could monitor any status code errors we may get. From the last attempt, we did not find any evidence that the problem was a rate limit problem (we have a 5min sleep which hopefully addresses this.

We’ve included our code below. Any advice would be great!

Our code:

import requests
import pandas as pd
import json
import time
from tqdm import tqdm
from datetime import datetime
request_number = 0
co_list_a =[]
dir_num_error ={}

for item in tqdm(dir_ids_0):
if request_number > 599:
print(“sleeping”)
time.sleep(300)
request_number = 0
else: pass
response = requests.get(f"{base_url}{item}",auth=(api_key,’’))
request_number = request_number + 1
dir_num_error[item] = {}
dir_num_error[item][‘status_code’] = response.status_code
dir_num_error[item][‘timestamp’] = str(datetime.now())
dir_num_error[item][‘request_number’] = request_number
if response.status_code == 429:
print(“429_sleeping”)
time.sleep(300)
request_number = 0
continue
else: pass
if response.status_code != 200:
continue
else: pass

json_search_result = response.text
data = json.JSONDecoder().decode(json_search_result)
for item in data['items']:
      co_list_a.append(item['links']['company'])

ebrian101 · July 20, 2021, 8:54pm

Might not be relevant to your problem, but if you are looking for a list of all company numbers, instead of scraping you can download a massive CSV file of them at Companies House . If you are looking specifically for director appointments you can contact their customer care and they will send you a big file of all directors (officers).

lgeorge · July 21, 2021, 8:31am

We suggest that you try making the sleep period slightly longer to make sure you are not exceeding the rate limit.

dlewis2 · July 21, 2021, 9:18am

Have you tried encapsulating your code in a try and catch block and getting the exception information that’s being thrown? Since it looks like the connection is being terminated from the Remote side do you have any transparent proxy servers between yourself and the internet that might also be causing the connection to terminate.

Does the connection terminate at random or after x records?

tamills · July 21, 2021, 9:41am

Thanks. I’m just waiting for a slightly increased rate limit to be approved and then will re-run with more of a buffer.

tamills · July 21, 2021, 9:43am

These is beyond my programming skills atm, and not so sure on the proxy server question. Am going to try running it from different location and machine. Thanks for help.

voracityemail · July 21, 2021, 10:52am

Update - apologies - I see you already requested the bulk data so you can probably ignore this post now! I’ll leave it here anyway.

I don’t know what exactly you want to achieve with:

trying to scrape company numbers from director IDs with the API

…but as @ebrian101 says requesting the list of officers would be a simple way to achieve that - you could then simply parse that for the information you needed. This is doubly the case if you’re not entirely confident in the vagaries of REST APIs / http.

To request that information post on the following thread:

I don’t know for sure if this error is due to you hitting the limit but it would be sensible to write the code to take account of this. Although you could just try (at the simplest) making your wait period / number of requests more conservative I think you’d be better off making use of a library to:
a) cover some of the detail for you (so you can concentrate on your task) and
b) properly implement the rate limit system.

I’m not a Python coder but it looks like you’re using Python there. I believe there are existing Python libraries which will cover that task, for example:

This one certainly appears to handle the rate limiting information from Companies House - if you look at the RateLimitAdapter class here:

github.com

JamesGardiner/chwrapper/blob/develop/chwrapper/services/base.py

# -*- coding: utf-8 -*-

# Copyright (c) 2016 James Gardiner

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

This file has been truncated. show original

Again I haven’t used this one personally though and can’t guarantee it works now or advise on how to get that to assist with your task.

Still want to do this yourself?

If you do want to implement this it is not especially difficult especially if you only have one thread / single code process making the requests. Companies House provides rate limiting information via http headers on each request giving the “rate window start time period” and “how many requests you have left”. See their documentation. The following posts have details; I would double-check the data your code receives when trying this out as some are older threads:

Whatever you end up using I notice that Companies House now have some test environments - it would make sense to test whatever you end up using there so you don’t get yourself blocked if the code doesn’t work as expected!

Good luck.

tamills · July 22, 2021, 8:32am

Thanks so much for detailed reply. I’m hoping to be able to get the relevant bulk data. If that includes information on directors and all of their appointments then they’ll be no need to scrape anything as you say. If that info isn’t available though I’m going to try a longer break.