Managing headless browser processes in PHP for scheduled web scraping tasks

I’m working on setting up an automated web scraper that needs to run through cron jobs every day. The main challenge is that I need to capture content that gets loaded by JavaScript, so I have to use a headless browser to get the fully rendered DOM.

I’ve been experimenting with a headless browser solution but I’m running into issues with process management. The browser works fine for the first request but then it just hangs and becomes unresponsive. I can’t seem to properly terminate the process through PHP code.

Here’s what I’m currently trying:

$command = "\"" . APP_PATH . "/tools/browser/runner.exe\" \"" . APP_PATH . "/scraper/config/app.ini\" 2>&1 &";
$process = shell_exec($command);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://localhost:8080/?target=' . $target_url . '&wait=2000&format=raw');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);

exec("kill -9 " . $process); // not working properly
echo $content;

Should I be keeping the headless browser running all the time instead of starting and stopping it? What’s the best approach for managing these processes in a cron environment? Any suggestions for better JavaScript-enabled scraping methods in PHP would be really helpful too.

you should use proc_open for better control, shell_exec ain’t giving you the PID. and about the crashes, could be memory leaks, y’know? better keep the browser as a service instead of starting/stopping it all the time. good luck!

The issue you’re encountering is likely due to improper management of the headless browser processes. Using shell_exec may not provide the actual process ID (PID), which complicates the termination of those processes. A better method is to use proc_open, as it allows for more granular control of the process and its resources.

In my experience, keeping the headless browser running continuously as a service can mitigate issues of process hangs and memory leaks. This setup minimizes the startup overhead for each request and ensures that you’re using a consistent instance.

As a quick fix, employing a timeout via curl_setopt($ch, CURLOPT_TIMEOUT, 30) can help manage hanging requests. Also, you might want to track PIDs accurately for proper cleanup afterward. Consider leveraging libraries such as ReactPHP or Symfony’s Panther, which offer robust handling of these scenarios.

I had similar problems when building scrapers for client projects. The root cause is usually that shell_exec returns output, not the actual process ID, so your kill command targets nothing.

What worked for me was implementing a proper daemon approach. I created a lightweight HTTP server using ReactPHP that keeps ChromeDriver running continuously. The server accepts scraping requests via HTTP endpoints and maintains the browser session between calls.

For immediate fixes, try using pgrep to find the actual browser process before killing it. Something like exec("pkill -f runner.exe") might work better than your current approach.

Also consider adding proper error handling and timeouts. I learned the hard way that headless browsers can consume massive amounts of memory if pages don’t load correctly. Setting memory limits and implementing health checks saved me countless debugging hours.

The daemon approach eliminated most reliability issues I was facing with cron-based scraping tasks.