Managing headless browser processes in PHP for automated web scraping via cron

I’m working on setting up an automated web scraper that needs to run every day through a cron job. Since I need to capture content that gets loaded by JavaScript, I decided to use a headless browser to get the fully rendered DOM.

I’ve been experimenting with a headless browser solution and can successfully scrape one page using cURL. However, I’m running into issues where the browser process hangs after the first request. The documentation is pretty sparse so I’m stuck.

My main question is about properly managing the browser process lifecycle in PHP. Should I be starting and stopping the headless browser for each request, or is it better to keep it running continuously? Starting and stopping seems cleaner but I’m not sure how to properly terminate the process.

Here’s what I’ve tried so far:

$command = "\"" . BASE_PATH . "/tools/browser/runner.exe \" \"" . BASE_PATH . "/scraper/config/app.ini \" 2>&1 &";
$processId = shell_exec($command);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://localhost:8080/?target=' . $targetUrl . '&wait=2000&format=raw');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);

exec("kill -9 " . $processId); // This doesn't work properly
echo $content;

The process termination part isn’t working as expected. Is there a more reliable approach for handling headless browser processes in PHP? Or maybe there’s a better solution altogether for scraping JavaScript-heavy pages?

You’ve got a classic process management issue. The problem is shell_exec() doesn’t return the actual process ID when you use & - it returns the PID of the shell that spawned your process, not the browser process itself.

I’ve hit this same issue with headless browsers in PHP. What worked for me was switching to proc_open() instead of shell_exec(). You get proper process handles that you can actually manage:

$descriptorspec = array(
   0 => array("pipe", "r"),
   1 => array("pipe", "w"),
   2 => array("pipe", "w")
);

$process = proc_open($command, $descriptorspec, $pipes);
// Your cURL logic here
proc_terminate($process);
proc_close($process);

For your lifecycle question - starting/stopping per request is definitely safer for cron jobs. Persistent processes pile up memory leaks and get unstable over time. The startup overhead is usually nothing compared to having reliable, fresh processes.

honestly, just ditch the custom browser setup and use puppeteer with node.js instead. way less headache than fighting php process management. you can still call it from php using exec() and it handles all the lifecycle stuff automatically without hanging processes.

Been dealing with the same scraping issues. Zombie processes are usually why browsers hang - your current setup probably isn’t killing the child processes when you terminate the parent. I fixed this by adding a timeout with process group termination. Don’t just kill the main process - kill the entire process group to catch all the spawned children. Also, run your cron job under a dedicated user account with limited resources so runaway processes can’t eat your whole system. Check if your headless browser is actually configured right too. Some browsers still try to initialize display stuff even in headless mode, which makes them hang on servers without proper display setup.