Extracting data from dynamically loaded websites using Node.js with headless browser automation

I’m working on extracting information from websites that load content dynamically using JavaScript. My approach involves using a headless browser automation tool.

The tricky part is timing the browser shutdown correctly. I need to wait until all the custom JavaScript execution completes before closing the browser instance. If I terminate it too early, the data extraction process gets interrupted and I lose all progress.

I’ve implemented an async generator pattern to manage the browser lifecycle and control the asynchronous operations. However, I’m not satisfied with the current implementation and suspect there might be a cleaner approach.

Here’s my current code structure:

module.exports = function createBrowserClient (browserAutomation) {
  const extractData = async (targetUrl, customScript) => {
    const requestHandler = initiateRequest(targetUrl)
    const { value: browserPage } = await requestHandler.next()

    if (browserPage) {
      const extractedData = await browserPage.evaluate(customScript)
      requestHandler.next()

      return extractedData
    }
  }

  async function * initiateRequest (websiteUrl) {
    const browserInstance = await browserAutomation.launch()
    const activePage = await browserInstance.newPage()

    const requestState = {
      request: { websiteUrl },
    }

    try {
      await activePage.goto(websiteUrl)
      yield activePage
    } catch (err) {
      throw new RequestError(err, requestState)
    } finally {
      yield browserInstance.close()
    }
  }

  return {
    extractData,
  }
}

Any suggestions for improving this pattern or alternative approaches?

Your async generator approach is way too complex for this. I’ve scraped SPAs before and found a much simpler pattern that works better. Wrap your browser operations in a dedicated function that handles everything internally. Don’t yield the page and rely on external cleanup - instead, pass both the URL and extraction logic as parameters. Let the function handle browser creation, page operations, and cleanup automatically. For timing, ditch the arbitrary delays. Use page.waitForNetworkIdle() to make sure all network requests finish, or combine it with page.waitForLoadState() for better sync. No more guessing when dynamic content loads. Also, reuse browser instances across multiple requests instead of creating new ones every time. You’ll cut overhead and see much better performance when processing multiple URLs.

I’ve encountered similar issues when dealing with web scraping, and I can relate to the complexity of using async generators for managing the browser lifecycle. One common mistake is intertwining page operations with browser cleanup; this can complicate error handling significantly.

Instead, I recommend creating a dedicated browser manager that controls the browser lifecycle more explicitly. You can encapsulate your data extraction logic within a try-catch-finally structure, ensuring cleanup occurs in the finally block regardless of success or failure.

For timing problems, it’s more effective to utilize methods like page.waitForFunction() or page.waitForSelector() instead of relying on arbitrary timeouts. This allows you to wait for specific elements or variables to confirm that the dynamic content has fully loaded. Additionally, consider implementing error boundaries and retry logic to handle network failures more gracefully, since headless browsers can be unpredictable with dynamic content loading.

Your generator pattern’s overkill here. Just use a simple wrapper class to manage browser instances better. The real problem is your browser cleanup - calling requestHandler.next() after evaluate won’t guarantee the finally block runs. Use page.waitForFunction(() => window.myCustomFlag === true) instead of guessing when the JS finishes.