I’m having trouble with my headless Chrome setup on AWS Lambda. When I try to scrape a specific shopping website, the page never fully loads even with long wait times.
My browser setup:
def setup_browser(self):
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920x1080')
chrome_options.add_argument('--disable-web-security')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
binary_path = '/opt/headless-chromium'
driver_path = '/opt/chromedriver'
chrome_options.binary_location = binary_path
chrome_service = Service(driver_path)
browser = webdriver.Chrome(service=chrome_service, options=chrome_options)
return browser
Loading the page:
browser.get('https://example-store.com/product/item-details')
WebDriverWait(browser, 180).until(
lambda browser: browser.execute_script('return document.readyState') == 'complete')
try:
WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.XPATH, '//div[@class="product-title"]'))
)
except:
print("page content not loaded")
browser.save_screenshot('/tmp/debug_screenshot.png')
print(browser.execute_script("return document.body.innerHTML;"))
The page appears to load but the main content never shows up. I’ve tried increasing wait times but it doesn’t help. What could be causing this issue with headless browsing on Lambda?
Looking at your configuration, I suspect the issue might be related to network timeouts or user agent detection. Shopping websites often have sophisticated bot detection mechanisms that can cause content to fail loading even when the page structure appears complete. Your user agent argument is malformed - it should use --user-agent=
prefix rather than just adding the string directly. Try using chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')
instead. Additionally, consider adding --disable-features=VizDisplayCompositor
and --disable-extensions
to reduce resource usage. I had similar problems with Lambda timeouts on heavy JavaScript sites and found that implementing a retry mechanism with exponential backoff helped. Sometimes the first attempt fails due to cold start performance issues, but subsequent attempts succeed. You might also want to add --no-first-run
and --disable-default-apps
to speed up browser initialization time within Lambda’s execution limits.
try adding --disable-background-timer-throttling
and --disable-renderer-backgrounding
to your options. lambda can throttle background processes which messes up js execution on ecommerce sites. also that xpath selector might be too specific - shopping sites often use dynamic class names that change
I encountered similar issues when scraping e-commerce sites with Lambda. The problem is usually JavaScript-heavy content that requires additional rendering time beyond document.readyState completion. Most shopping sites load product details asynchronously after the initial page structure loads. Instead of waiting for document.readyState, try waiting for specific network activity to finish. I found success using browser.execute_script('return jQuery.active == 0')
if the site uses jQuery, or implementing a custom wait condition that checks for the absence of loading spinners or placeholder content. Another issue I faced was Lambda’s memory constraints affecting Chrome’s ability to execute complex JavaScript. Consider increasing your Lambda memory allocation to at least 1GB, as this also increases CPU allocation which helps with rendering performance. Also, some sites detect headless browsers despite the automation flags you’ve set - try adding --disable-blink-features=AutomationControlled
to your options.