Issues with Selenium Headless WebDriver: [Errno 104] Connection Reset by Peer

I am attempting to collect data from various URLs, but I encounter issues when executing browser.fetch(link). On occasion, I receive the error [Errno 104] Connection reset by peer, while at other times it’s [Errno 111] Connection refused. Interestingly, it works perfectly on my Mac with a standard browser. This suggests that my scraper is functioning correctly.

I have experimented with several solutions, including waiting for page elements, implementing implicit waits, and using selenium-requests to ensure accurate request headers, yet I have not found success.

Here are the URLs I am targeting:

http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals

I am using Python, Selenium, and the headless Firefox WebDriver on CentOS 6.5.

Additionally, I have multiple AJAX heavy pages that I have been able to scrape without issues; examples include:

http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals

Despite dedicating numerous days to troubleshoot this problem, I have not achieved any results. Any assistance would be greatly appreciated.

Hey FlyingEagle,

It sounds like you're facing network-level issues with headless browsing on CentOS. Try these fixes:

  • Use options.add_argument('--no-sandbox') and options.add_argument('--disable-gpu') when initializing the WebDriver if not yet applied.
  • Check if there’s any firewall or network policy blocking requests.
  • Increase your WebDriver's timeout settings for connections.
  • Consider using a proxy or a VPN to see if the issue persists.
  • Test switching to a different headless browser, like Chromium.

Give these a try and let me know if you're still stuck.

Another approach to handle network-related errors in Selenium is to ensure that you verify the network conditions of the environment where your script is running. Here are some additional strategies to consider:

  • Update Geckodriver and Selenium: Make sure both the Geckodriver and Selenium packages are up to date. Newer versions often contain bug fixes that could resolve network-related issues.
  • <li><strong>Check Resource Limits:</strong> On CentOS, verify if the system's resource limits aren’t too low, as these can impact performance. For instance, check the <code>ulimit</code> settings for opened files and adjust them if necessary.</li>
    
    <li><strong>Use Retries with Exponential Backoff:</strong> Implement a retry mechanism with exponential backoff to handle transient connection resets. Here's a simple Python example:</li>
    
    import time
    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    
    options = Options()
    options.headless = True
    
    url = 'http://www.snapdeal.com/offers/deal-of-the-day'
    
    for attempt in range(3):
        try:
            browser = webdriver.Firefox(options=options)
            browser.get(url)
            # Your scraping logic here
            break
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff
        finally:
            browser.quit()
    
    <li><strong>Network Monitoring:</strong> Use network monitoring tools to check if there are temporary interruptions or resource limits affecting the requests. Tools like <code>nload</code> or <code>wireshark</code> might give insights.</li>
    
    <li><strong>Inspect Server Response:</strong> Occasionally, websites might alter their behavior based on headless browsing. Use browser developer tools to compare the network activity when accessed from your local browser and headless browser.</li>
    

By addressing these aspects, you may overcome the underlying causes of connection resets or refusals.