Managing headless browser processes in PHP for scheduled web scraping tasks

I’m working on building an automated web scraper that needs to run daily through a scheduled task. The scraper requires a headless browser to capture the fully rendered HTML content after JavaScript execution.

Currently I’m using a headless browser solution that works for the first page request but then becomes unresponsive. I can successfully retrieve content using cURL to communicate with the browser, but I’m struggling with process management.

Here’s my current approach:

$browserCmd = "\"" . APP_PATH . "/tools/browser/runner.exe\" \"" . APP_PATH . "/scraper/config/settings.ini\" 2>&1 &";
$processId = shell_exec($browserCmd);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://localhost:8080/?target=' . $targetUrl . '&wait=2500&format=raw');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);

exec("kill -9 " . $processId); // This doesn't terminate the process
echo $content;

The main issue is that I cannot properly terminate the browser process after scraping. Should headless browsers be kept running continuously, or is there a proper way to start and stop them for each scraping session? What’s the best practice for managing browser processes in automated scraping scenarios?

The real problem is shell_exec doesn’t control the spawned process tree. Your browser spawns child processes that survive even after killing the parent. I switched to proc_open - it gives you proper pipes and process control. This works way better: php $descriptors = array( 0 => array("pipe", "r"), 1 => array("pipe", "w"), 2 => array("pipe", "w") ); $process = proc_open($browserCmd, $descriptors, $pipes); // do your scraping proc_terminate($process); proc_close($process); Or just configure your browser with a session timeout or idle limit. Most headless browsers have flags like --timeout or --max-idle-time that automatically shut them down after inactivity. Stops zombie processes from piling up during scheduled runs.

your kill command isn’t working because shell_exec with & doesn’t return the actual pid. try exec($browserCmd, $output, $pid) to get the proper process id, or just use puppeteer with a timeout flag so it auto-exits after scraping.

I’ve hit this same issue before. Keep your browsers running continuously instead of starting/stopping them for each request - the launch overhead kills performance, especially for daily tasks.

Set up a browser pool. Spawn a few instances at startup and reuse them. Add health checks with periodic pings to make sure they’re still alive. When one dies, kill it and spin up a replacement.

What really worked for me was ditching the HTTP setup and using Chrome directly with --remote-debugging-port. You get way better process control, and the Chrome DevTools Protocol is much more reliable than HTTP endpoints.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.