Extracting tweet content with Puppeteer

I’m working on a project to scrape tweet info using Puppeteer. I’ve managed to get to the right page, but I’m stuck on extracting the tweet text.

When I run this in my browser console, it works fine:

$x(`//div[@data-testid="tweetText"]`)

I can even grab specific text like this:

$x(`//div[@data-testid="tweetText"]`)[0].childNodes[1].childNodes[0].wholeText

But when I try to use page.$$ in my Puppeteer code, I get nothing:

const [tweetData] = await page.$$(`xpath/.//div[@data-testid='tweetText']`);

Any ideas on how to make this work? I’d love to get the actual tweet text if possible. Thanks for any help!

I’ve faced similar challenges extracting tweet content with Puppeteer. One trick that worked for me was using a combination of page.waitForSelector() and page.evaluate(). Here’s what I did:

await page.waitForSelector('div[data-testid="tweetText"]');
const tweetText = await page.evaluate(() => {
    const tweetElement = document.querySelector('div[data-testid="tweetText"]');
    return tweetElement ? tweetElement.innerText : null;
});

This approach ensures the tweet element is loaded before trying to extract its content. The evaluate function runs in the context of the page, which can help bypass some of the issues with direct Puppeteer selectors. If you’re dealing with multiple tweets, you might want to use querySelectorAll and map the results. Hope this helps!

I’ve encountered similar issues with Puppeteer before. The problem might be related to timing. Twitter’s content loads dynamically, so the element might not be present when Puppeteer tries to find it. Try adding a wait before your selector:

await page.waitForXPath(`//div[@data-testid='tweetText']`);
const [tweetData] = await page.$x(`//div[@data-testid='tweetText']`);
const tweetText = await page.evaluate(el => el.textContent, tweetData);

This approach waits for the element to appear before attempting to select it. Also, note that I’ve used $x instead of $$, which is more appropriate for XPath selectors. If you’re still having trouble, you might need to investigate if there are any iframes or shadow DOMs involved.

hey, have u tried using page.evaluate() instead? sometimes it works better for dynamic content. like this:

await page.evaluate(() => {
const tweetElement = document.querySelector(‘div[data-testid=“tweetText”]’);
return tweetElement ? tweetElement.textContent : null;
});

might solve ur problem. good luck!