Extracting Data from Dynamically Loaded Pages Using Node.js and a Headless Browser

I’m attempting to retrieve information from a page that loads its content dynamically. For this purpose, I am utilizing the headless browser puppeteer.

In the code, puppeteer acts as the headlessBrowserClient.

The primary difficulty lies in ensuring that the browser closes properly once the required data is collected. However, if I terminate the browser before the evaluateCustomFunction completes its execution, I’ll lose the progress of this function.

The evaluateCustomFunction operates similarly to how we execute code in Chrome’s Developer tools.

To manage network requests and the asynchronous flow of the puppeteer API, I implement an async generator to encapsulate all relevant logic.

I suspect that my code may be poorly designed, but I’m struggling to find a more effective alternative.

Any suggestions?

module.exports = function createClient(headlessBrowserClient) {
  const fetchPageData = async (url, evaluateCustomFunction) => {
    const request = initiateRequest(url);
    const { value: page } = await request.next();
    
    if (page) {
      const content = await page.evaluate(evaluateCustomFunction);
      request.next();
      
      return content;
    }
  };  
  
  async function* initiateRequest(url) {
    const browserInstance = await headlessBrowserClient.launch();
    const pageInstance = await browserInstance.newPage();
    
    const requestDetails = { req: { url } };
    
    try {
      await pageInstance.goto(url);
      yield pageInstance;
    } catch (error) {
      throw new APIError(error, requestDetails);
    } finally {
      yield browserInstance.close();
    }
  }
  
  return {
    fetchPageData,
  };
}

Hey DancingFox, it looks like your initiateRequest generator is invoking browserInstance.close() prematurely in the finally block. You should ensure the browser only closes after the evaluation function completes.

Here's a refactored version:

module.exports = function createClient(headlessBrowserClient) {
  const fetchPageData = async (url, evaluateCustomFunction) => {
    const browserInstance = await headlessBrowserClient.launch();
    try {
      const pageInstance = await browserInstance.newPage();
      await pageInstance.goto(url);
      const content = await pageInstance.evaluate(evaluateCustomFunction);
      return content;
    } catch (error) {
      throw new APIError(error, { req: { url } });
    } finally {
      await browserInstance.close();
    }
  };

  return {
    fetchPageData,
  };
}

This eliminates the generator, focuses on closing the browser after evaluation, and keeps things simple. Cheers!