I’m having trouble with my web scraping setup and need some advice.
I’m using Python with Selenium and headless Firefox to extract information from shopping websites. The problem happens when I call browser.get(target_url). Most of the time I get these errors:
[Errno 104] Connection reset by peer
[Errno 111] Connection refused
The weird thing is that sometimes it works perfectly fine. When I test the same code on my MacBook with a visible browser window, everything runs smoothly every time. So I don’t think there’s an issue with my scraping logic.
I’ve already tried several fixes like adding explicit waits for elements to load, setting implicit wait times, and using different request libraries to send proper headers. None of these approaches solved the problem.
My current environment is CentOS 6.5 running Python, Selenium, and headless Firefox webdriver.
What’s strange is that I successfully scrape many other JavaScript-heavy websites without any connection issues. This problem only affects certain e-commerce sites.
I’ve been stuck on this for weeks now. Has anyone encountered similar connection problems with headless browsers? Any suggestions would be really helpful.
It seems like you might be dealing with anti-bot measures put in place by the e-commerce websites. These sites are quite adept at detecting headless browsers and can lead to the connection issues you’re experiencing. I faced similar problems when scraping retail websites. A solution that worked for me involved modifying the browser’s fingerprint; ensure you’re using a realistic user agent string for your Firefox version and enable JavaScript. Implementing random delays between requests can also help evade detection. You might want to consider rotating IPs, or at the very least, clearing cookies and initiating a new browser session periodically. The fact that your scraping works sometimes suggests that rate limiting is at play rather than outright blocking. Incorporating exponential backoff in your retry logic around the browser.get() method may also provide a more resilient approach to handling those connection errors.
Those headless connection errors are brutal. I’ve hit this exact problem building scraping systems.
It’s not just anti-bot detection. CentOS 6.5 is ancient - Firefox headless has networking bugs on older systems. Managing browser configs, proxies, and retry logic manually sucks.
I gave up on Selenium for this years ago. Now I use Latenode for web scraping. It handles headless browser management and connection resets automatically.
Latenode runs browsers in a clean cloud environment, so no weird CentOS networking issues. It rotates browser fingerprints and handles retries without coding.
I scraped product data from major e-commerce sites using Latenode workflows recently. Zero connection errors, way more reliable than my old Selenium setup. You can schedule it too instead of babysitting scripts that randomly break.
Check it out: https://latenode.com
I’ve encountered similar issues when scraping e-commerce sites using headless Firefox. The connection errors you’re seeing often result from the websites blocking automated requests. The fact that everything works with a visible Firefox instance on your MacBook indicates that the headless mode likely gets flagged.
To improve your scraping success, consider adjusting the Firefox options by adding --disable-blink-features=AutomationControlled and setting a realistic window size, even in headless mode. Many sites examine viewport dimensions and may block requests that appear unusual.
Additionally, the older CentOS version you’re using might cause instability in network handling. Implementing connection pooling or a session manager could help maintain persistent connections instead of initiating a new one with each request.
The sporadic connectivity suggests that you might be encountering rate limiting as well, so increasing the time intervals between your requests could enhance reliability.