I’m working on a web scraping project that targets React-based websites, so I need to use Puppeteer. My current setup includes several important features to avoid detection and improve performance.
I’m using the stealth plugin to bypass anti-bot protection, the adblocker plugin to prevent advertisements from loading, and TOR proxy connections to rotate IP addresses and avoid getting blacklisted.
Here’s my current working code:
const puppeteer = require('puppeteer-extra');
const StealthModule = require('puppeteer-extra-plugin-stealth');
const AdBlockModule = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(StealthModule());
puppeteer.use(AdBlockModule());
const torPort = 9050;
const targetUrl = 'https://www.redfin.com/CA/Los-Angeles/123-Main-St-90210/home/12345';
const browserInstance = await puppeteer.launch({
dumpio: false,
headless: true,
args: [
`--proxy-server=socks5://127.0.0.1:${torPort}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const newPage = await browserInstance.newPage();
await newPage.setViewport({ width: 1366, height: 768 });
await newPage.goto(targetUrl, {
waitUntil: 'domcontentloaded',
timeout: 25000,
});
newPage
.waitForSelector('.home-value')
.then(async () => {
console.log('Property value found successfully');
await browserInstance.close();
})
.catch(() => {
throw new Error('Could not locate property information');
});
} catch (error) {
await browserInstance.close();
}
This setup works but has reliability issues. I discovered puppeteer-cluster which seems perfect for managing multiple scraping tasks efficiently. The problem is I can’t figure out how to combine my existing plugin configuration with the cluster approach.
I found the basic example in their documentation but it doesn’t show how to properly integrate TOR proxies, stealth mode, and ad blocking together. Can someone help me restructure this code to work with puppeteer-cluster while maintaining all these features?