Extracting Twitter user stats without official API using Python and headless browser automation

miat · July 7, 2025, 7:15am

Hi everyone! I’m trying to collect follower and following counts from Twitter profiles using Python instead of the official Twitter API. I want to use web scraping techniques with a headless browser setup.

I’ve been looking into libraries like Selenium for browser automation and BeautifulSoup for HTML parsing. My goal is to extract user statistics from multiple Twitter accounts programmatically.

Here’s a basic example I’ve been working with:

import requests
from bs4 import BeautifulSoup

# Target profile URL
profile_url = 'https://twitter.com/username'

# Send GET request
response = requests.get(profile_url)

# Parse HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Extract follower count
follower_count = soup.find('div', {'data-testid': 'followers'}).find('span').text
following_count = soup.find('div', {'data-testid': 'following'}).find('span').text

print(f'Followers: {follower_count}')
print(f'Following: {following_count}')

The main challenge is setting this up in Google Colab with a headless browser configuration. Can anyone help me configure a headless browser environment that works reliably in Colab for this type of data extraction?

Gizmo_Funny · July 16, 2025, 11:19am

I’ve hit the same wall with Twitter scraping in Colab. BeautifulSoup won’t cut it here - Twitter loads everything with JavaScript, so you need a headless browser.

Skip regular Selenium and go straight to undetected-chromedriver (!pip install undetected-chromedriver). Set up Chrome with --no-sandbox, --disable-dev-shm-usage, and --disable-gpu. Don’t forget delays between requests and rotate your user agents or you’ll get flagged fast.

Fair warning: Twitter changes their HTML constantly. My selectors broke every few weeks when they tweaked the DOM. And they’re aggressive with rate limiting - even with proper headers and delays, scrape too hard and you’re blocked.

sofiag · July 15, 2025, 12:31am

Twitter loads everything with JavaScript, so requests.get() won’t work. You need Selenium with headless Chrome in Colab. Install it with !apt-get update && !apt install chromium-chromedriver && !pip install selenium. Set ChromeOptions to --headless, --no-sandbox, and --disable-setuid-sandbox. Don’t use static delays - Twitter’s loading times are all over the place. Use WebDriverWait instead. Target the span elements in the profile stats section, but Twitter changes their UI constantly so you’ll be updating selectors a lot. If you’re scraping multiple accounts, rotate proxies. Twitter’s gotten way better at catching bots lately.

emcarter · July 14, 2025, 2:00am

Twitter scraping’s brutal now, but I’d try Puppeteer over Selenium. It’s faster and harder to detect. For Colab, run !npm install puppeteer then use the pyppeteer wrapper. Just heads up - Twitter’s anti-bot game is insane right now, so you’ll hit tons of captchas and blocks.

miat · July 18, 2025, 10:26am

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.