Recursively scraping paginated content using Puppeteer

I’m trying to scrape a paginated list using Puppeteer. I want to get all the results, but I’m running into issues.

My current approach uses a for loop, but it’s giving me this error:

UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.evaluate): Cannot find context with specified id undefined

How can I fix this and turn my for loop into a recursive function? Here’s a simplified version of what I’m trying to do:

async function scrapePage(pageNum) {
  const content = await page.content();
  console.log(`Got content from page ${pageNum}`);
  
  // Save content to file
  
  const nextButton = await page.$('button.next-page');
  if (nextButton) {
    await nextButton.click();
    await page.waitForNavigation();
    await scrapePage(pageNum + 1);
  }
}

await scrapePage(1);

Any tips on making this work correctly? I’m new to web scraping and could use some guidance. Thanks!

I’ve encountered similar challenges with Puppeteer. Your approach is on the right track, but there are a few optimizations you can consider. First, using page.waitForSelector() instead of page.waitForNavigation() may ensure that the new content loads fully without relying solely on navigation events. Additionally, wrapping your scraping code in a try-catch block not only helps in managing errors but also provides better control over the scraping flow.

Below is a refined version of your function:

async function scrapePage(pageNum) {
  try {
    const content = await page.evaluate(() => document.body.innerHTML);
    console.log(`Scraped content from page ${pageNum}`);
    
    // Save content to file here
    
    const nextButton = await page.$('button.next-page');
    if (nextButton) {
      await nextButton.click();
      await page.waitForSelector('your-unique-page-element');
      await scrapePage(pageNum + 1);
    }
  } catch (error) {
    console.error(`Error on page ${pageNum}:`, error);
  }
}

This should help resolve your issue and make the scraping process more robust.

hey mike, ive had similar issues. Try adding a delay between clicks using page.waitForTimeout(2000) before clicking next. Also, make sure ur using page.evaluate() to grab content inside the page context. Good luck with ur scraping project!

Hey Mike, I’ve done quite a bit of web scraping with Puppeteer and I think I can help you out. Your approach is solid, but there are a few tweaks that could make it more reliable.

First off, instead of using page.content(), try page.evaluate(() => document.body.innerHTML). This ensures you’re grabbing the actual rendered content.

Also, I’ve found that adding a small delay and checking for the existence of new content before proceeding can really help with stability. Here’s how I might modify your function:

async function scrapePage(pageNum) {
  await page.waitForTimeout(1000); // Give the page a moment to settle
  const content = await page.evaluate(() => document.body.innerHTML);
  console.log(`Scraped page ${pageNum}`);
  
  // Save content logic here
  
  const nextButton = await page.$('button.next-page');
  if (nextButton) {
    await nextButton.click();
    await page.waitForFunction(() => document.querySelector('some-unique-element-on-next-page'));
    return scrapePage(pageNum + 1);
  }
}

This approach has worked well for me on various projects. Let me know if you need any clarification!