How to capture streaming XHR data in real-time with Puppeteer before request finishes

I’m working on a Node.js project using Puppeteer to extract live data from a website. Rather than parsing the DOM directly, I monitor network requests and grab JSON responses for better structured information.

Most things work fine, but I’m stuck with Firestore streaming requests. When I watch these requests in Chrome DevTools, I can see data arriving in chunks like this:

23
[[456,["empty"]]]
23
[[457,["empty"]]]
89
[[458,[{
"dataUpdate": {
"token": "AbcDef123XyZ=",
"timestamp": "2025-02-15T10:15:22.445123Z"
}
}]]]
23
[[459,["empty"]]]

Each chunk arrives over multiple seconds, then eventually the connection closes and a fresh request starts. The issue is my code only receives all the data when the entire response finishes, not as each piece streams in.

I need to process these chunks as they arrive for real-time updates. Is there a method to access response data while it’s still streaming?

page.on('response', async (res) => {    
    if (res.request().resourceType() === 'xhr') {
        console.log('Stream URL:', res.url());
        const fullData = await res.text();
        console.log('Complete response: ', fullData);
    }
});

yeah, puppeteer’s streaming support is pretty weak. i used CDP’s Runtime.evaluate to inject JS that intercepts fetch/xhr at the browser level before puppeteer sees it. override XMLHttpRequest.prototype.onreadystatechange to catch readyState 3 (loading) and grab responseText as it comes in. it’s hacky but works great for realtime data without needing separate http clients.

Puppeteer’s response object doesn’t support streaming - res.text() waits for the full response before returning anything. That’s why you only see results after the connection closes.

I hit this same issue building a real-time monitoring tool. My workaround was intercepting requests with page.setRequestInterception(true), then using a separate HTTP client to make the same request with streaming. You can grab headers and cookies from the intercepted request, then use axios or the native http module with streaming enabled.

You could also go directly through Chrome DevTools Protocol using page._client.send() to tap into Network.streamingDataReceived events. It’s lower-level CDP stuff that isn’t officially supported by Puppeteer’s API though. The streaming interception gets messy with auth tokens and session management, but it works if you really need real-time processing.

Chrome DevTools Protocol is way more direct - use Network.getResponseBody with Network.responseReceived events. I’ve built data extraction tools this way when I needed streaming support.

Just access CDP through page._client, listen for network events, then poll the response body while it’s loading. Check if Network.loadingFinished is false - you can still grab partial content.

Here’s another trick that worked great: use page.evaluateOnNewDocument() to monkey-patch XMLHttpRequest before requests fire. Override the onprogress handler to capture streaming data in window variables, then poll them with page.evaluate(). Everything stays in the browser context and you dodge authentication issues you’d hit with external HTTP clients.

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.