I am attempting to collect data from various URLs, but I encounter issues when executing browser.fetch(link)
. On occasion, I receive the error [Errno 104] Connection reset by peer
, while at other times it’s [Errno 111] Connection refused
. Interestingly, it works perfectly on my Mac with a standard browser. This suggests that my scraper
is functioning correctly.
I have experimented with several solutions, including waiting for page elements, implementing implicit waits, and using selenium-requests to ensure accurate request headers, yet I have not found success.
Here are the URLs I am targeting:
http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals
I am using Python
, Selenium
, and the headless Firefox WebDriver
on CentOS 6.5
.
Additionally, I have multiple AJAX
heavy pages that I have been able to scrape without issues; examples include:
http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals
Despite dedicating numerous days to troubleshoot this problem, I have not achieved any results. Any assistance would be greatly appreciated.
Hey FlyingEagle,
It sounds like you're facing network-level issues with headless browsing on CentOS. Try these fixes:
- Use
options.add_argument('--no-sandbox')
and options.add_argument('--disable-gpu')
when initializing the WebDriver if not yet applied.
- Check if there’s any firewall or network policy blocking requests.
- Increase your WebDriver's timeout settings for connections.
- Consider using a proxy or a VPN to see if the issue persists.
- Test switching to a different headless browser, like Chromium.
Give these a try and let me know if you're still stuck.
Another approach to handle network-related errors in Selenium is to ensure that you verify the network conditions of the environment where your script is running. Here are some additional strategies to consider:
- Update Geckodriver and Selenium: Make sure both the
Geckodriver
and Selenium
packages are up to date. Newer versions often contain bug fixes that could resolve network-related issues.
<li><strong>Check Resource Limits:</strong> On CentOS, verify if the system's resource limits aren’t too low, as these can impact performance. For instance, check the <code>ulimit</code> settings for opened files and adjust them if necessary.</li>
<li><strong>Use Retries with Exponential Backoff:</strong> Implement a retry mechanism with exponential backoff to handle transient connection resets. Here's a simple Python example:</li>
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
url = 'http://www.snapdeal.com/offers/deal-of-the-day'
for attempt in range(3):
try:
browser = webdriver.Firefox(options=options)
browser.get(url)
# Your scraping logic here
break
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
finally:
browser.quit()
<li><strong>Network Monitoring:</strong> Use network monitoring tools to check if there are temporary interruptions or resource limits affecting the requests. Tools like <code>nload</code> or <code>wireshark</code> might give insights.</li>
<li><strong>Inspect Server Response:</strong> Occasionally, websites might alter their behavior based on headless browsing. Use browser developer tools to compare the network activity when accessed from your local browser and headless browser.</li>
By addressing these aspects, you may overcome the underlying causes of connection resets or refusals.