Extract only plain text using Puppeteer

I am able to retrieve the entire HTML content of a webpage utilizing Puppeteer. However, I’m looking for a method to extract just the plain text content, excluding any HTML tags. How can I achieve this?

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const text = await page.evaluate(() => document.body.innerText); // Fetch only plain text
  console.log(text);
  await browser.close();
})();

To extract just the plain text from a webpage in Puppeteer, using document.body.innerText is indeed effective. You've set up the basic structure well already.
Here's an enhanced version to ensure maximum efficiency:

const puppeteer = require('puppeteer');

(async () => {
  try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Extract plain text
    const text = await page.evaluate(() => document.body.innerText);
    console.log(text);

    await browser.close();
  } catch (error) {
    console.error('Error during extraction:', error);
  }
})();

### Key Points:

  • Page Load: Using { waitUntil: 'networkidle2' } ensures the page is fully loaded, crucial for sites with dynamic content.
  • Error Handling: Adding a try...catch block helps catch potential errors, ensuring reliable operation and easier debugging.

This method retains simplicity while ensuring robustness and efficiency in your text extraction process.

To extract plain text from a webpage using Puppeteer, leveraging the document.body.innerText is indeed a straightforward and efficient approach. Here is an improved version of the code to ensure that the text extraction is robust:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // Use "networkidle2" to ensure the page is fully loaded, especially for single-page applications
  await page.goto('https://example.com', { waitUntil: 'networkidle2' });

  // Extracting plain text by getting innerText property
  const plainText = await page.evaluate(() => document.body.innerText);
  console.log(plainText);

  await browser.close();
})();

Additional Considerations:

  • Page Load Timing: Utilizing the { waitUntil: 'networkidle2' } option is excellent for ensuring that all necessary resources are loaded, especially in complex sites with JavaScript rendering.
  • Error Handling: Though not shown here, incorporating a try...catch block around asynchronous operations can be very beneficial for managing unexpected errors during text extraction.
  • Performance: For pages with extensive dynamically loaded data, consider the method's performance implications if handling multiple pages.

This method effectively strips away all HTML tags, giving you access to read-only content presented in the browser view.

To extract plain text using Puppeteer, stick with document.body.innerText. Your code looks good, but ensure the page is fully loaded before extracting content. Here's an optimized version:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com', { waitUntil: 'networkidle2' });
  const text = await page.evaluate(() => document.body.innerText);
  console.log(text);
  await browser.close();
})();

This method ensures you're extracting text accurately and efficiently.