Web scraper with Puppeteer not discovering page links when content is served from S3 bucket

alexj · June 19, 2025, 1:56am

I’m building a web scraper that uses Puppeteer to crawl company documentation. The scraper works fine on regular websites but I’m having issues when the content is hosted on Amazon S3.

The problem: My scraper only finds links on a page when the URL ends with a trailing slash. Without it, the scraper loads the main page but doesn’t detect any other pages to crawl.

Working URL format:

https://my-bucket.s3.amazonaws.com/company/docs/main/

Not working formats:

https://my-bucket.s3.amazonaws.com/company/docs/main
https://my-bucket.s3.amazonaws.com/company/docs/main/index.html

When I use the URLs without trailing slash, the scraper visits the homepage successfully but then stops because it can’t find any internal links to follow. With the trailing slash, it discovers all the linked pages like /company/docs/main/getting-started/ etc.

I’m crawling multiple documentation sites that use different static site generators, all stored in the same S3 bucket. Is this a known S3 behavior or could it be related to how the static sites handle relative URLs? Any ideas on what might cause this inconsistency?

benmoore · June 24, 2025, 5:29am

S3 handles static content differently than most static site generators expect. I encountered this issue when migrating documentation to S3, which primarily revolves around trailing slashes. When you access /main instead of /main/, the browser interprets it as being in /company/docs/ rather than /company/docs/main/. This misinterpretation causes relative links, such as ./getting-started/, to resolve incorrectly, disrupting your scraper’s functionality. Many static generators presume that trailing slashes will be present for relative paths to work correctly. To address this, you can take a few steps: ensure all entry URLs sent to Puppeteer include trailing slashes, have your scraper automatically append slashes when it detects S3 URLs, or adjust your scraper to identify both relative and absolute links. Implementing any of these strategies should help resolve your link detection issues.

Zack_45Gaming · June 23, 2025, 7:57pm

had the same s3 hosting nightmare. browsers choke on base urls without trailing slashes - it breaks all your relative links. just normalize your urls by adding slashes before feeding them to puppeteer. also try waitForNetworkIdle since s3 can be slower than regular servers.

Nate_91Surf · June 23, 2025, 8:00am

Classic S3 static hosting issue. S3 handles URLs with and without trailing slashes completely differently. When you hit a path with a trailing slash, S3 treats it as a directory and serves the index.html from that folder - that’s why your nav links render properly. Without the slash, S3’s probably serving a redirect or different content that doesn’t have the full DOM structure Puppeteer needs. I’ve seen this before when S3 buckets don’t have proper index document settings for subdirectories. Check if your static site generator is creating different index.html files at different directory levels, and make sure your S3 bucket’s static hosting config includes index document rules for subdirectories. Also try adding a wait condition in Puppeteer to let the DOM fully load before scraping - S3-hosted content sometimes loads differently than regular web servers.