Need help with web scraping tool development issues

Hey everyone!

I’m working on a web scraping project and running into some problems. I built a scraper using playwright that can grab dynamic content, record network activity, and take full page screenshots without showing the browser. It works pretty well but modern sites have really good bot protection so I had to make my own workarounds.

I tried lots of different tools before this like requests-html, puppeteer, and crawl4ai. The crawl4ai library is actually really good and has great features like bot detection bypass and batch processing, but there’s a bug with taking full page screenshots.

I made a custom hook for screenshots but the dynamic content doesn’t load properly. Here’s what I’m using:

import asyncio
import base64
from typing import Optional, Dict, Any
from playwright.async_api import Page, BrowserContext
import logging

logger = logging.getLogger(__name__)

class PageCapture:
    def __init__(self, 
                 capture_enabled: bool = True,
                 complete_page: bool = True,
                 image_format: str = "png",
                 image_quality: int = 90):
        
        self.capture_enabled = capture_enabled
        self.complete_page = complete_page
        self.image_format = image_format
        self.image_quality = image_quality
        self.captured_data = None
        
    async def take_page_capture(self, 
                               browser_page: Page, 
                               browser_context: BrowserContext, 
                               target_url: str, 
                               page_response, 
                               **options):
        if not self.capture_enabled:
            return browser_page
            
        logger.info(f"Starting page capture for: {target_url}")
        
        try:
            await browser_page.wait_for_load_state("networkidle")
            
            await browser_page.evaluate("""
                document.body.style.zoom = '1';
                document.body.style.transform = 'none';
                
                const viewportTag = document.querySelector('meta[name="viewport"]');
                if (viewportTag) {
                    viewportTag.setAttribute('content', 'width=device-width, initial-scale=1.0');
                }
            """)
            
            await asyncio.sleep(2.5)
            
            capture_settings = {
                "full_page": self.complete_page,
                "type": self.image_format
            }
            
            if self.image_format == "jpeg":
                capture_settings["quality"] = self.image_quality
            
            image_data = await browser_page.screenshot(**capture_settings)
            
            self.captured_data = {
                'raw_bytes': image_data,
                'encoded_data': base64.b64encode(image_data).decode('utf-8'),
                'source_url': target_url
            }
            
            logger.info(f"Capture completed! File size: {len(image_data)} bytes")
            
        except Exception as error:
            logger.error(f"Capture failed: {str(error)}")
            self.captured_data = None
        
        return browser_page

I want to use crawl4ai’s batch processing feature to scrape thousands of pages quickly. The screenshot part works but the dynamic content needs to load first before capturing.

Any ideas on how to fix this? Really need some guidance here. Thanks!

you’re using crawl4ai wrong. I had the same problem - crawl4ai has built-in screenshot hooks that work way better than custom ones. use their ScreenshotHook instead of rolling your own since it handles timing automatically. also double-check your browser config - chromium args can screw up dynamic loading.

Your wait conditions probably aren’t handling complex dynamic content well enough. I’ve had way better luck combining multiple wait strategies instead of just using networkidle. After your current wait, add checks for content-specific stuff like when DOM mutations stop or JavaScript finishes executing. You can track specific API calls by intercepting responses with playwright’s route handlers. For crawl4ai batch processing, throw in a content validation step after capture - just check if the elements you expect are actually in the DOM before screenshotting. If validation fails, run more wait cycles. I’ve hit similar issues where content loads through multiple async calls that completely ignore networkidle timing. Also check if there’s lazy loading happening - that’ll delay content rendering until way after the initial page load.

Your timing’s off. 2.5 seconds after networkidle isn’t enough for most modern dynamic content - I’ve hit this same wall scraping SPAs. Ditch the fixed delays. Wait for specific elements to show up or track when the actual network requests finish. Look for key DOM elements that signal content’s loaded, or watch the XHR responses filling in your data. Try playwright’s wait_for_function to catch when JavaScript wraps up. I’ve had way better luck waiting for specific CSS selectors or checking window properties than trusting networkidle alone. That dynamic stuff often loads after the network goes quiet. For batch processing with crawl4ai, you’ll probably need custom timing per site - different platforms load at different speeds. Quick test: bump up your wait time and see if that fixes it first.