I’m using Node.js with Puppeteer for a scraping tool that concurrently processes five pages. After completing one page, it fetches the next URL from a queue to open in the same instance. However, the CPU usage consistently stays at 100%. What strategies can I implement to reduce CPU consumption in Puppeteer?
This setup is hosted on a DigitalOcean droplet featuring 4GB of RAM and 2 vCPUs. I’ve already tried launching Puppeteer with specific arguments to minimize resource usage, yet there has been no change:
puppeteer.launch({
args: ['--no-sandbox', '--disable-accelerated-2d-canvas', '--disable-gpu'],
headless: true,
});
Are there alternative arguments I can apply to lessen CPU demands? Additionally, I’ve disabled image loading in the following manner:
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType().toUpperCase() === 'IMAGE')
request.abort();
else
request.continue();
});
Firstly, consider reducing the number of concurrent pages. Try processing fewer pages at a time, like 2-3, and monitor CPU usage. Each page can be resource-heavy.
Optimize your code further by implementing network resource blocking. Block fonts and other unnecessary resources:
page.on('request', request => {
const resourceType = request.resourceType().toUpperCase();
if (['IMAGE', 'FONT', 'STYLESHEET'].includes(resourceType))
request.abort();
else
request.continue();
});
Additionally, reduce Puppeteer's polling frequency by setting a delay with await page.waitForTimeout(200);
between navigations.
Lastly, try running this on a more capable hardware setup if feasible, as scraping can be CPU-intensive.
Reducing CPU usage with Puppeteer requires a strategic approach to optimize resource handling. Consider these steps for improved efficiency:
1. Limit Concurrent Pages: Start by adjusting your concurrency. Operate with 2-3 pages instead of 5, which should decrease immediate CPU demand.
2. More Resource Blocking: Alongside images, block other heavy resources like JavaScript and CSS that are non-essential for your scraping. Here's a refined code:
page.on('request', request => {
const resourceType = request.resourceType().toUpperCase();
if (['IMAGE', 'FONT', 'STYLESHEET', 'SCRIPT'].includes(resourceType))
request.abort();
else
request.continue();
});
3. Optimize Browser Context Reuse: Reuse a single browser context instead of opening a new tab each time. This can substantially lower resource consumption:
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
4. Use Minimal Viewport: Set a smaller viewport which can lessen rendering overhead:
await page.setViewport({ width: 800, height: 600 });
Implementing these steps should help you achieve better CPU efficiency while running your scraping tool on a DigitalOcean droplet. Keep monitoring the performance and adjust as necessary for optimal outcomes.
Managing CPU usage effectively when using Puppeteer for web scraping can be critical, especially on constrained environments like your DigitalOcean droplet. Here are some additional strategies you might consider:
1. Enable Site Isolation: Enabling site isolation can sometimes help improve performance by isolating tabs from each other. While it might not always lessen CPU, it can help prevent one tab from consuming all resources:
puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-web-security',
'--disable-accelerated-2d-canvas',
'--disable-gpu',
'--enable-site-per-process'
],
headless: true
});
2. Limit Request Payload: Reduce the payload size by setting a timeout and adjusting the requests as needed. Adjusting the timeout can help manage CPU usage and maintain a steady flow without overwhelming the system:
await page.setDefaultNavigationTimeout(30000);
await page.setRequestInterception(true);
page.on('request', request => {
const resourceType = request.resourceType().toUpperCase();
if (['IMAGE', 'FONT', 'STYLESHEET', 'SCRIPT', 'MEDIA'].includes(resourceType))
request.abort();
else
request.continue();
});
3. Reduce Resource Priority: Look into Puppeteer's experimental options like Network.setResourcePriority
to control the network resource prioritization more effectively if applicable.
4. Adjust Concurrency Dynamically: Consider implementing a dynamic concurrency strategy that scales the number of active pages based on available CPU resources or scheduled times of low CPU usage. This can be quite helpful:
const maxConcurrentPages = Math.min(availableCPULevel / averageCPULoadPerPage, maxPages);
While some of these steps might require experimentation to see what fits your workload best, refining your approach with these optimizations is a great path towards reducing CPU impact without sacrificing performance.
To reduce CPU usage in your Puppeteer setup, consider the following:
1. Limit Concurrent Pages: Try processing fewer pages at once, like 2-3, to reduce immediate load.
2. Further Resource Blocking: Besides images, block other non-essential resources:
page.on('request', request => {
const resourceType = request.resourceType().toUpperCase();
if (['IMAGE', 'FONT', 'STYLESHEET', 'SCRIPT'].includes(resourceType))
request.abort();
else
request.continue();
});
3. Reuse Browser Context: Instead of opening new tabs, reuse a single browser context:
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
4. Smaller Viewport: Set a smaller viewport to lessen rendering load:
await page.setViewport({ width: 800, height: 600 });
These strategies should help manage CPU consumption effectively on your DigitalOcean droplet.