Mastering Puppeteer for JavaScript Web Extraction

Sophia63 · February 6, 2025, 11:11am

I’m using Puppeteer with JavaScript for web scraping. A single anchor retrieval works, but my loop for multiple anchors produces errors. How can I iterate correctly?

const singleResult = await page.evaluate(() => {
  return document.querySelectorAll('.menu-item a')[0].innerText;
});
console.log(`Text: ${singleResult}`);

for (let idx = 0; idx < 10; idx++) {
  const itemText = await page.evaluate(index => {
    const elements = document.querySelectorAll('.menu-item a');
    return elements[index] ? elements[index].innerText : '';
  }, idx);
  console.log(`Item ${idx}: ${itemText}`);
}

FlyingLeaf · February 19, 2025, 1:42pm

I ran into a similar challenge recently. My approach was to reduce the number of page.evaluate calls by first capturing all the relevant data in a single evaluate function, then working with that data in Node.js context. This minimizes the back and forth between Node and the browser context and avoids issues with stale elements. In my setup, using Array.from on the NodeList also helped ensure that the iteration was clean and error free. That method saved me a lot of debugging time, and I found it to be a more efficient way to handle multiple elements.

Isaac_Cosmos · February 19, 2025, 3:40pm

I encountered a similar issue while scraping dynamic content. To avoid the device where each iteration makes a separate page.evaluate call, I instead extracted all the relevant anchor texts in one go from within the page context. The data can then be iterated in Node’s environment. This approach significantly reduced the number of cross-context calls, preventing errors due to asynchronous load issues. Once the array of texts is available, further operations on the data become less prone to timing problems and are easier to debug.

Mia92 · February 19, 2025, 8:16pm

hey, try using page.$$eval after a proper waitForSelector so all anchors are loaded. i found that grabbing them in one go and then looping in node helps avoid async issues. hope it helps!