Extract only plain text using Puppeteer

CharlieLion22 · December 25, 2024, 6:32am

I am able to retrieve the entire HTML content of a webpage utilizing Puppeteer. However, I’m looking for a method to extract just the plain text content, excluding any HTML tags. How can I achieve this?

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const text = await page.evaluate(() => document.body.innerText); // Fetch only plain text
  console.log(text);
  await browser.close();
})();

Grace_31Dance · January 5, 2025, 1:42am

To extract just the plain text from a webpage in Puppeteer, using document.body.innerText is indeed effective. You've set up the basic structure well already.
Here's an enhanced version to ensure maximum efficiency:

const puppeteer = require('puppeteer');

(async () => {
  try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Extract plain text
    const text = await page.evaluate(() => document.body.innerText);
    console.log(text);

    await browser.close();
  } catch (error) {
    console.error('Error during extraction:', error);
  }
})();

### Key Points:

Page Load: Using { waitUntil: 'networkidle2' } ensures the page is fully loaded, crucial for sites with dynamic content.
Error Handling: Adding a try...catch block helps catch potential errors, ensuring reliable operation and easier debugging.

This method retains simplicity while ensuring robustness and efficiency in your text extraction process.

DancingFox · January 3, 2025, 4:26pm

To extract plain text from a webpage using Puppeteer, leveraging the document.body.innerText is indeed a straightforward and efficient approach. Here is an improved version of the code to ensure that the text extraction is robust:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // Use "networkidle2" to ensure the page is fully loaded, especially for single-page applications
  await page.goto('https://example.com', { waitUntil: 'networkidle2' });

  // Extracting plain text by getting innerText property
  const plainText = await page.evaluate(() => document.body.innerText);
  console.log(plainText);

  await browser.close();
})();

Additional Considerations:

Page Load Timing: Utilizing the { waitUntil: 'networkidle2' } option is excellent for ensuring that all necessary resources are loaded, especially in complex sites with JavaScript rendering.
Error Handling: Though not shown here, incorporating a try...catch block around asynchronous operations can be very beneficial for managing unexpected errors during text extraction.
Performance: For pages with extensive dynamically loaded data, consider the method's performance implications if handling multiple pages.

This method effectively strips away all HTML tags, giving you access to read-only content presented in the browser view.

CreatingStone · January 3, 2025, 3:30am

To extract plain text using Puppeteer, stick with document.body.innerText. Your code looks good, but ensure the page is fully loaded before extracting content. Here's an optimized version:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com', { waitUntil: 'networkidle2' });
  const text = await page.evaluate(() => document.body.innerText);
  console.log(text);
  await browser.close();
})();

This method ensures you're extracting text accurately and efficiently.