Managing headless browser processes in PHP for scheduled web scraping tasks

WhisperingWind · July 21, 2025, 1:45am

I’m working on building an automated web scraper that needs to run daily through a scheduled task. The scraper requires a headless browser to capture the fully rendered HTML content after JavaScript execution.

Currently I’m using a headless browser solution that works for the first page request but then becomes unresponsive. I can successfully retrieve content using cURL to communicate with the browser, but I’m struggling with process management.

Here’s my current approach:

$browserCmd = "\"" . APP_PATH . "/tools/browser/runner.exe\" \"" . APP_PATH . "/scraper/config/settings.ini\" 2>&1 &";
$processId = shell_exec($browserCmd);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://localhost:8080/?target=' . $targetUrl . '&wait=2500&format=raw');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);

exec("kill -9 " . $processId); // This doesn't terminate the process
echo $content;

The main issue is that I cannot properly terminate the browser process after scraping. Should headless browsers be kept running continuously, or is there a proper way to start and stop them for each scraping session? What’s the best practice for managing browser processes in automated scraping scenarios?

josephk · August 1, 2025, 6:51am

The real problem is shell_exec doesn’t control the spawned process tree. Your browser spawns child processes that survive even after killing the parent. I switched to proc_open - it gives you proper pipes and process control. This works way better: php $descriptors = array( 0 => array("pipe", "r"), 1 => array("pipe", "w"), 2 => array("pipe", "w") ); $process = proc_open($browserCmd, $descriptors, $pipes); // do your scraping proc_terminate($process); proc_close($process); Or just configure your browser with a session timeout or idle limit. Most headless browsers have flags like --timeout or --max-idle-time that automatically shut them down after inactivity. Stops zombie processes from piling up during scheduled runs.

Alex_Thunder · July 31, 2025, 9:19pm

your kill command isn’t working because shell_exec with & doesn’t return the actual pid. try exec($browserCmd, $output, $pid) to get the proper process id, or just use puppeteer with a timeout flag so it auto-exits after scraping.

Hermione_Book · July 30, 2025, 11:49am

I’ve hit this same issue before. Keep your browsers running continuously instead of starting/stopping them for each request - the launch overhead kills performance, especially for daily tasks.

Set up a browser pool. Spawn a few instances at startup and reuse them. Add health checks with periodic pings to make sure they’re still alive. When one dies, kill it and spin up a replacement.

What really worked for me was ditching the HTTP setup and using Chrome directly with --remote-debugging-port. You get way better process control, and the Chrome DevTools Protocol is much more reliable than HTTP endpoints.

WhisperingWind · August 3, 2025, 4:13am

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.