I’m working with Puppeteer to scrape web pages and I need help extracting just the visible text content. Right now I can pull the entire HTML source code, but I want to strip away all the HTML tags and get only the readable text that users would see on the page.
This code gives me the complete HTML structure, but I need a way to extract only the plain text without any HTML elements or formatting tags. What’s the best approach to achieve this with Puppeteer?
I use page.$eval() for this - works great. Try await page.$eval('*', el => el.innerText) to grab all visible text from the page. The big win with innerText is it follows CSS styling, so hidden elements won’t show up (unlike textContent). I also target specific containers like await page.$eval('body', el => el.innerText). Handles whitespace better and shows exactly what users see. Just heads up - some sites have multiple text nodes, so you might need to tweak the selector depending on how the page is structured.
To extract only visible text content using Puppeteer, you can utilize the textContent property within the page.evaluate() method. Instead of using page.content(), try await page.evaluate(() => document.body.textContent). This will provide you with plain text, completely stripping away any HTML tags. If you need to be more specific, you can target elements like await page.evaluate(() => document.querySelector('main').textContent). Keep in mind that textContent retains whitespace and line breaks, so applying .trim() or a regex for further cleanup may be necessary. This approach is much simpler than manually parsing HTML and effectively handles nested elements.
you can also try page.evaluate(() => document.documentElement.innerText) to grab everything from the HTML root. this works better than targeting the body on sites with content outside those tags. plus it automatically filters out script and style elements so you won’t get CSS or JS mixed into your text.