Issues with Selenium Headless WebDriver: [Errno 104] Connection Reset by Peer

FlyingEagle · January 4, 2025, 2:13pm

I am attempting to collect data from various URLs, but I encounter issues when executing browser.fetch(link). On occasion, I receive the error [Errno 104] Connection reset by peer, while at other times it’s [Errno 111] Connection refused. Interestingly, it works perfectly on my Mac with a standard browser. This suggests that my scraper is functioning correctly.

I have experimented with several solutions, including waiting for page elements, implementing implicit waits, and using selenium-requests to ensure accurate request headers, yet I have not found success.

Here are the URLs I am targeting:

http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals

I am using Python, Selenium, and the headless Firefox WebDriver on CentOS 6.5.

Additionally, I have multiple AJAX heavy pages that I have been able to scrape without issues; examples include:

http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals

Despite dedicating numerous days to troubleshoot this problem, I have not achieved any results. Any assistance would be greatly appreciated.

Claire29 · January 9, 2025, 4:56pm

Hey FlyingEagle,

It sounds like you're facing network-level issues with headless browsing on CentOS. Try these fixes:

Use options.add_argument('--no-sandbox') and options.add_argument('--disable-gpu') when initializing the WebDriver if not yet applied.
Check if there’s any firewall or network policy blocking requests.
Increase your WebDriver's timeout settings for connections.
Consider using a proxy or a VPN to see if the issue persists.
Test switching to a different headless browser, like Chromium.

Give these a try and let me know if you're still stuck.

Dave_17Sketch · January 7, 2025, 12:15pm

Another approach to handle network-related errors in Selenium is to ensure that you verify the network conditions of the environment where your script is running. Here are some additional strategies to consider:

Update Geckodriver and Selenium: Make sure both the Geckodriver and Selenium packages are up to date. Newer versions often contain bug fixes that could resolve network-related issues.

<li><strong>Check Resource Limits:</strong> On CentOS, verify if the system's resource limits aren’t too low, as these can impact performance. For instance, check the <code>ulimit</code> settings for opened files and adjust them if necessary.</li>

<li><strong>Use Retries with Exponential Backoff:</strong> Implement a retry mechanism with exponential backoff to handle transient connection resets. Here's a simple Python example:</li>

import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

url = 'http://www.snapdeal.com/offers/deal-of-the-day'

for attempt in range(3):
    try:
        browser = webdriver.Firefox(options=options)
        browser.get(url)
        # Your scraping logic here
        break
    except Exception as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(2 ** attempt)  # Exponential backoff
    finally:
        browser.quit()

<li><strong>Network Monitoring:</strong> Use network monitoring tools to check if there are temporary interruptions or resource limits affecting the requests. Tools like <code>nload</code> or <code>wireshark</code> might give insights.</li>

<li><strong>Inspect Server Response:</strong> Occasionally, websites might alter their behavior based on headless browsing. Use browser developer tools to compare the network activity when accessed from your local browser and headless browser.</li>

By addressing these aspects, you may overcome the underlying causes of connection resets or refusals.