How to automate data extraction from sites that prevent automated requests using n8n

I’m working on a side project and need help with n8n workflow automation. There’s this website I want to extract data from, but it has protection against automated HTTP calls.

My goal is pretty straightforward - I want to navigate to the site, input some text into a search field, click submit, and capture the resulting HTML output. Previously I tried using headless browser solutions with Python scripts but had no luck getting past their blocking mechanisms.

Right now I have a workaround using an Ubuntu virtual machine with atbswp (a mouse and keyboard automation tool) that runs every 5 minutes. But this setup is unreliable and crashes frequently, requiring constant manual restarts.

Does anyone know if n8n has better capabilities for handling websites with anti-bot measures? Looking for recommendations on the best approach.

Your VM setup’s overkill. Try puppeteer-extra with the stealth plugin instead - it beats plain headless browsers and works fine with n8n through the execute command node. Just randomize your timing between actions. Sites will flag you fast if you’re hitting them every 5 minutes like clockwork.

I’ve encountered similar anti-bot issues before, and n8n’s HTTP Request node typically struggles with them, as it sends requests that are easily identified as automated. Their browser automation capabilities are quite limited as well.

For me, using Playwright with specific stealth configurations and residential proxies has been effective. It’s crucial to mimic human behavior by implementing random delays, simulating natural mouse movements, and using rotating user agents. Some websites also verify WebGL fingerprints and canvas rendering, which tend to pose a challenge for headless browsers.

Additionally, browser extensions can be a viable alternative since they interact with websites in a more organic manner while executing scheduled tasks. Although this may require more setup, they tend to offer better reliability compared to virtual machines, as they operate in real browsers.

N8n’s built-in nodes are pretty useless against anti-bot protection - they’re way too obvious. Your best bet is pairing n8n with something like ScrapingBee or Apify. These services handle all the tricky browser stuff externally and just feed clean data back to your n8n workflow via API. You could also use n8n’s webhooks to trigger browser automation scripts on separate servers that properly manage sessions and cookies. Bottom line: don’t make direct HTTP requests from n8n. Let specialized services do the heavy lifting on protected sites.