Hey everyone! I’m trying to figure out how to get just the plain text from a webpage using Puppeteer. Right now, I can grab the entire HTML code, but I want to strip out all the tags and just get the text content. Here’s what I’ve got so far:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const tab = await browser.newPage();
await tab.navigate('https://example.com');
const pageContent = await tab.content();
console.log(pageContent); // This gives me all the HTML
await browser.close();
})();
This code works fine for getting the whole page, but how can I modify it to just extract the text? Any tips or tricks would be super helpful! Thanks in advance!
hey there! u can try using the page.evaluate() method with document.body.innerText. it’ll grab all the visible text on the page. something like this:
const text = await page.evaluate(() => document.body.innerText);
console.log(text);
hope that helps! lmk if u need anything else 
I’ve been using Puppeteer for a while now, and I’ve found that sometimes the innerText approach can be a bit unreliable, especially with dynamic content. What’s worked well for me is combining querySelector with textContent. Here’s a snippet that’s been pretty solid:
const text = await tab.evaluate(() => {
return Array.from(document.querySelectorAll('body, body *'))
.map(el => el.textContent.trim())
.filter(text => text !== '')
.join(' ');
});
console.log(text);
This method grabs text from all elements, trims whitespace, filters out empty strings, and joins everything together. It’s a bit more verbose, but I’ve found it to be more consistent across different types of websites. Just remember to tweak the selector if you need to target specific parts of the page.
To extract just the text content using Puppeteer, you can utilize the evaluate
function. Here’s how you can modify your code:
const text = await tab.evaluate(() => {
return document.body.innerText;
});
console.log(text);
This method runs the provided function in the context of the page and returns the result. The innerText
property gives you the visible text content of the body element, effectively stripping out all HTML tags.
For more granular control, you might want to target specific elements or use textContent
instead of innerText
if you need to include hidden text. Remember to handle potential errors and close the browser properly.