How to integrate puppeteer-extra plugins with puppeteer-cluster for web scraping?

I’m working on a web scraping project that targets React-based websites, so I need to use Puppeteer. My current setup includes several important features to avoid detection and improve performance.

I’m using the stealth plugin to bypass anti-bot protection, the adblocker plugin to prevent advertisements from loading, and TOR proxy connections to rotate IP addresses and avoid getting blacklisted.

Here’s my current working code:

const puppeteer = require('puppeteer-extra');
const StealthModule = require('puppeteer-extra-plugin-stealth');
const AdBlockModule = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(StealthModule());
puppeteer.use(AdBlockModule());

const torPort = 9050;
const targetUrl = 'https://www.redfin.com/CA/Los-Angeles/123-Main-St-90210/home/12345';

const browserInstance = await puppeteer.launch({
    dumpio: false,
    headless: true,
    args: [
        `--proxy-server=socks5://127.0.0.1:${torPort}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const newPage = await browserInstance.newPage();
    await newPage.setViewport({ width: 1366, height: 768 });
    await newPage.goto(targetUrl, {
        waitUntil: 'domcontentloaded',
        timeout: 25000,
    });

    newPage
    .waitForSelector('.home-value')
    .then(async () => {
        console.log('Property value found successfully');
        await browserInstance.close();
    })
    .catch(() => {
        throw new Error('Could not locate property information');
    });
} catch (error) {
    await browserInstance.close();
}

This setup works but has reliability issues. I discovered puppeteer-cluster which seems perfect for managing multiple scraping tasks efficiently. The problem is I can’t figure out how to combine my existing plugin configuration with the cluster approach.

I found the basic example in their documentation but it doesn’t show how to properly integrate TOR proxies, stealth mode, and ad blocking together. Can someone help me restructure this code to work with puppeteer-cluster while maintaining all these features?

Your problem is that puppeteer-cluster isn’t configured properly at startup. Don’t launch the cluster with default settings - configure it during Cluster.launch() instead. Here’s what worked for me: pass your configured puppeteer-extra instance through the puppeteer parameter, and put browser arguments in launchOptions. Put your TOR proxy settings in launch options, not per-task. Watch out though - the stealth plugin can mess with timing in cluster mode since multiple pages compete for resources. I had to bump up timeouts and add delays between tasks to avoid detection. Also, handle cluster shutdown properly or you’ll get hanging processes with TOR connections.

Had the same problem. You cant use regular puppeteer with cluster - just pass your puppeteer-extra instance. Use Cluster.launch({ puppeteer: puppeteer }) where puppeteer is your configured puppeteer-extra with plugins. Fixed it for me, though I did run into memory leaks at first.

Had this exact problem 6 months ago. The main thing is puppeteer-cluster makes its own browser instances, so you’ve got to configure everything at the cluster level, not per-page. Here’s what worked for me: set up your plugins first, then pass the configured puppeteer-extra instance with your browser args in the launch options. TOR gets tricky because all workers use the same proxy - not great for IP rotation. I built a proxy rotation system so different workers connect to different TOR circuits. Heads up - the adblocker plugin screws with cluster’s page pooling. Blocked requests sometimes hang forever. I added custom timeouts and restart workers periodically to fix it. But the performance boost from clustering is huge, especially when scraping multiple properties at once.

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.