Puppeteer js only retrieves HTML header while full content is visible in Chrome DevTools

John_Clever · March 6, 2025, 10:23pm

I’m using Puppeteer JS in my Node.js app to scrape a lyrics website, but I’m only getting the HTML header instead of the full page content. Here’s a sample URL I’m working with: https://shironet.mako.co.il/search?q=fire. The site appears to be built with an SPA framework, as I get only the header filled with compressed JS functions and an empty HTML body. However, I can see the complete HTML in Chrome DevTools. This is the scraping code I’m using:

'use strict'
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const baseUrl = 'https://shironet.mako.co.il/search?q=';

async function fetchLyrics(songName) {
    if (!songName) {
        return 'No song name specified';
    }
    console.log(`Fetching lyrics for: ${songName}`);
    puppeteer.launch({ headless: true }).then(async browser => {
        const page = await browser.newPage();
        await page.goto(`${baseUrl}/${songName}`, { waitUntil: 'networkidle2' });
        await page.waitForTimeout(10000);
        const html = await page.content();
        const $ = cheerio.load(html);
        $('a.search_link_name').each((i, el) => {
            console.log($(el).text());
        });
        await browser.close();
    });
}

module.exports = { fetchLyrics };

In DevTools with headless: false, the body is empty and functions fill the header, stopping the page from loading. This is some of the HTML response I’m getting in both headless and non-headless modes:

<html><head><meta charset="utf-8"><script>function i700(){}i700.F20=function (){return typeof i700.O20.p60==='function'?i700.O20.p60.apply(i700.O20,arguments):i700.O20.p60;};i700.X70=function (){return typeof i700.v70.p60==='function'?i700.v70.p60.apply(i700.v70,arguments):i700.v70.p60;};i700.Z20=function (){return typeof i700.O20.P20==='function'?i700.O20.P20.apply(i700.O20,arguments):i700.O20.P20;};i700.Q60=function (){return typeof i700.Y60.P20==='function'?i700.Y60.P20.apply(i700.Y60,arguments):i700.Y60.P20;};...;winsocks();</script></head><body></body></html>

What might I be doing wrong? Cheerio fails without body content. Even waitFor and waitUntil tricks don’t work for me. Also, tools like Axios and Insomnia return an empty body, but Postman retrieves the correct HTML. Any idea why this happens? Thanks for any help!

Bob_Clever · March 16, 2025, 9:19pm

hey, i’ve dealt with similar stuff. try using page.setRequestInterception(true) and listen for the response with the lyrics data. parse that directly instead of the html.

also, check the network tab in devtools for xhr requests when searching. might find a direct api endpoint to hit without puppeteer.

if u must use puppeteer, add random delays and rotate user agents. some sites catch bots with perfect timing

SwiftCoder42 · March 16, 2025, 7:20am

I’ve encountered similar challenges with dynamic sites. Have you tried using page.evaluate() to execute JavaScript directly on the page? This can sometimes help with retrieving content that’s loaded dynamically.

Another approach is to analyze the network requests in Chrome DevTools. Look for XHR or Fetch requests that might be loading the lyrics data. You could then recreate those requests in your script, potentially bypassing the need for Puppeteer altogether.

If you’re still set on using Puppeteer, consider adding a longer wait time or waiting for specific elements to appear on the page. Something like:

await page.waitForSelector(‘a.search_link_name’, {timeout: 30000});

This waits up to 30 seconds for the search results to appear. Adjust the timeout as needed.

Lastly, some sites use sophisticated bot detection. Try adding a custom user agent and introducing random delays between actions to appear more human-like.

Grace_31Dance · March 15, 2025, 10:21am

I’ve faced similar issues when scraping dynamic websites. One trick that worked for me was intercepting network requests. Try using page.setRequestInterception(true) and then listen for the response that contains the lyrics data. You can then parse that directly instead of relying on the rendered HTML.

Another approach is to reverse-engineer the site’s API calls. Open DevTools, go to the Network tab, and look for XHR requests when searching for a song. You might find a direct API endpoint you can hit without needing to use Puppeteer at all.

If you must use Puppeteer, try adding some randomized delays between actions and rotate user agents. Some sites have sophisticated bot detection that picks up on too-perfect timing or consistent headers.

Lastly, check if the site has a public API or terms of service regarding scraping. Sometimes there are official ways to access the data you need without resorting to web scraping.