NodeJS Puppeteer Screenshot Loop with Async/Await Issues

I’m working on a Node.js application that should extract URLs from a sitemap and capture screenshots using Puppeteer. However, I’m running into some async/await problems that I can’t figure out.

const bluebird = require('bluebird');
const puppeteer = require('puppeteer');
const SitemapParser = require('sitemapper');

async function processUrls(sitemapUrl, requestTimeout) {
  const urlList = await extractUrls(sitemapUrl, requestTimeout);
  await bluebird.each(urlList, async (url, idx) => {
    await captureScreenshot(url, idx);
  });
}

async function captureScreenshot(url, idx) {
  const browserInstance = await puppeteer.launch();
  console.log('processing', idx);
  const newPage = await browserInstance.newPage();
  await newPage.goto(url);
  const filePath = await 'images/' + idx + newPage.title() + '.png';
  await newPage.screenshot({path: filePath});
  browserInstance.close();
}

async function extractUrls(sitemapUrl, requestTimeout) {
  const parser = await new SitemapParser({
    url: sitemapUrl,
    timeout: requestTimeout
  });
  const response = await parser.fetch();
  console.log(response.sites.length);
  return response.sites;
}

processUrls('https://example.com/sitemap.xml', 10000)
  .catch(error => {
    console.error(error);
  });

I’m facing two main issues. First, the URL array length keeps changing between script runs, even though I’m using await. Second, the screenshot functionality is unreliable and sometimes creates duplicate files. I think there might be promise resolution problems but I’m not sure about the correct async loop pattern. Any suggestions would be helpful.

You’re launching a new browser for every screenshot - that’s killing your performance and creating race conditions. Launch the browser once at the start of processUrls, pass it to captureScreenshot, then close it when the loop’s done. That SitemapParser line looks off too. Drop the await - just use const parser = new SitemapParser({...}). The URL count changing means either your sitemap’s updating or you’re hitting timeouts. Add error handling around parser.fetch() and bump up that timeout value. For the duplicate files - newPage.title() probably fails sometimes and returns undefined, so you end up with broken filenames.

I’ve hit this same issue scraping big sitemaps. You’re burning through memory by creating new browser instances for every URL - that’s what’s causing the crashes. Create one browser instance and spawn multiple pages from it, or set up a page pool with concurrency limits. Don’t use bluebird.each - it’s way too slow since it runs one at a time. Switch to bluebird.map with concurrency: await bluebird.map(urlList, captureScreenshot, {concurrency: 3}). For the sitemap URL counts changing, wrap your parsing in retry logic - network timeouts happen all the time. And clean your filenames! Page titles have special characters that’ll break your file paths. Try title.replace(/[^a-z0-9]/gi, '_') to sanitize them.

the problem’s in your captureScreenshot function - you’re not awaiting newPage.title() since it returns a promise. change it to const filePath = 'images/' + idx + await newPage.title() + '.png'; and drop the await before the string concatenation - that’s unnecessary. this async issue is likely what’s causing your file duplicates.