Hey everyone!
I’m working on a web scraping project and running into some problems. I built a scraper using playwright that can grab dynamic content, record network activity, and take full page screenshots without showing the browser. It works pretty well but modern sites have really good bot protection so I had to make my own workarounds.
I tried lots of different tools before this like requests-html, puppeteer, and crawl4ai. The crawl4ai library is actually really good and has great features like bot detection bypass and batch processing, but there’s a bug with taking full page screenshots.
I made a custom hook for screenshots but the dynamic content doesn’t load properly. Here’s what I’m using:
import asyncio
import base64
from typing import Optional, Dict, Any
from playwright.async_api import Page, BrowserContext
import logging
logger = logging.getLogger(__name__)
class PageCapture:
def __init__(self,
capture_enabled: bool = True,
complete_page: bool = True,
image_format: str = "png",
image_quality: int = 90):
self.capture_enabled = capture_enabled
self.complete_page = complete_page
self.image_format = image_format
self.image_quality = image_quality
self.captured_data = None
async def take_page_capture(self,
browser_page: Page,
browser_context: BrowserContext,
target_url: str,
page_response,
**options):
if not self.capture_enabled:
return browser_page
logger.info(f"Starting page capture for: {target_url}")
try:
await browser_page.wait_for_load_state("networkidle")
await browser_page.evaluate("""
document.body.style.zoom = '1';
document.body.style.transform = 'none';
const viewportTag = document.querySelector('meta[name="viewport"]');
if (viewportTag) {
viewportTag.setAttribute('content', 'width=device-width, initial-scale=1.0');
}
""")
await asyncio.sleep(2.5)
capture_settings = {
"full_page": self.complete_page,
"type": self.image_format
}
if self.image_format == "jpeg":
capture_settings["quality"] = self.image_quality
image_data = await browser_page.screenshot(**capture_settings)
self.captured_data = {
'raw_bytes': image_data,
'encoded_data': base64.b64encode(image_data).decode('utf-8'),
'source_url': target_url
}
logger.info(f"Capture completed! File size: {len(image_data)} bytes")
except Exception as error:
logger.error(f"Capture failed: {str(error)}")
self.captured_data = None
return browser_page
I want to use crawl4ai’s batch processing feature to scrape thousands of pages quickly. The screenshot part works but the dynamic content needs to load first before capturing.
Any ideas on how to fix this? Really need some guidance here. Thanks!