I have a website that shows data in tables and charts. Lately I’m getting tons of automated requests that seem to come from Google Sheets users scraping my content.
The requests are coming from Google’s servers (I can tell from the IP addresses and user agent strings). I’m seeing anywhere from 2,500 to 10,000 hits daily from these automated scrapers.
Someone is probably using functions like IMPORTHTML in Google Sheets to pull data from my site. I don’t want this happening because I can’t control how my data gets used or if it’s properly credited.
What’s the best way to stop this scraping that Google actually recommends?
Right now I’m blocking requests when the user agent contains GoogleDocs or docs.google.com and sending back a 403 error. I don’t want to block by IP address since that means blocking Google’s servers which seems risky.
Most traffic comes from these user agents: Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)
Also wondering - can I figure out which specific Google Sheet is doing this? The requests don’t have referrer info or cookies so it’s hard to trace back to the source document or user.
Yeah, blocking the GoogleDocs user agent is spot on. Had the same scraping headaches last year and this worked great - didn’t mess with legitimate Google crawling either. You might want to add rate limiting based on request patterns too. These Google Sheets scrapers are super predictable - they hammer your site with rapid sequential requests that are dead giveaways. I also threw in a simple HTTP auth challenge for those user agents. Breaks the automation but lets people access manually if they really need to. Finding the actual sheet though? Forget it. Google proxies everything through their servers so you’re basically blind. Best bet is watching request timing patterns and maybe matching traffic spikes to figure out when new sheets start hitting you.
Had this same problem six months ago - my analytics dashboard got destroyed by IMPORTHTML requests. User agent blocking works, but I’d add a CAPTCHA challenge for GoogleDocs requests instead of blocking them outright. Legit researchers can still get your data manually while you kill the automated scraping. Honeypot endpoints saved my ass. I set up fake data URLs that only scrapers would hit. When GoogleDocs user agents access those, I block that IP range for a few hours. Way more precise than permanent blocks. You’re right about identifying source sheets - basically impossible through normal methods. But I found different sheets have unique request patterns. Some hit every 5 minutes, others hourly. Tracking these patterns let me estimate how many sheets were scraping me, even though I couldn’t ID the actual documents.
yeah, blocking Google Docs user agent is a good move. Also, a robots.txt could help, but it’s not foolproof. about tracing the specific sheet, it’s tricky since google hides that info. keep an eye on your server logs for unusual patterns tho!