I’m trying to get metadata from websites using Puppeteer and Node.js. My code works fine for getting the title tag and text from paragraphs, but I’m stuck on how to extract the content from meta tags. Specifically, I want to get the text from the description meta tag. Here’s what I’ve got so far:
This approach targets all meta tags with name=‘description’, then returns the content of the first one (if it exists). It’s concise and handles cases where the tag might be missing.
For more robust metadata extraction, you could also look into the ‘metascraper’ library. It’s designed specifically for this task and can handle a wide variety of metadata formats across different websites.
I’ve encountered similar issues with scraping metadata using Puppeteer. In my experience, sometimes document.getElementsByTagName offers a more reliable way to access meta tags than document.querySelector. Here’s an alternative approach that worked for me:
const description = await page.evaluate(() => {
const metaTags = document.getElementsByTagName('meta');
for (let i = 0; i < metaTags.length; i++) {
if (metaTags[i].getAttribute('name') === 'description') {
return metaTags[i].getAttribute('content');
}
}
return null;
});
This method loops through all meta tags and retrieves the content of the one whose name attribute is set to ‘description’. It has proven to be effective across various site structures. Hopefully, this alternative approach will help resolve the problem with extracting the metadata.