Web scraper fails to extract data from final pages consistently

henryg · June 25, 2025, 2:08am

I’m building a web scraper with Puppeteer to collect news articles but facing an inconsistent issue. The website has 10 pages with 10 articles each (100 total), but my scraper randomly stops before reaching the end. Sometimes it gets only 39 articles, other times around 90. The behavior is unpredictable and I can’t figure out what’s causing it.

Here’s my current approach:

await browser.goto(targetUrl, { timeout: 60000 })

const scrapedData = []

await browser.waitForSelector('div.pagination-container', { timeout: 60000 })

let paginationButtons = await browser.$$('div.pagination-container')

await browser.waitForSelector('div.content-wrapper', { timeout: 60000 })

for(let index = 0; index < paginationButtons.length; index++){
    const currentButton = paginationButtons[index]

    if(index !== 0){
        await browser.evaluate((button) => {
            button.click()
        }, currentButton)

        await browser.waitForSelector('div.content-wrapper', { timeout: 60000 }).catch(error => {
            return
        })
    }

    paginationButtons = await browser.$$('div.pagination-container')

    let articleLinks = await browser.$$('div.content-wrapper > div > div div > div.thumbnail-section > div > a')

    for (const linkElement of articleLinks) {
        try {
            const articleUrl = await browser.evaluate((anchor) => anchor.href, linkElement)

            const newTab = await browserInstance.newPage()
            await newTab.goto(articleUrl, { waitUntil: 'load', timeout: 60000 })
            await newTab.waitForSelector('h1.article-title', { timeout: 60000 }).catch(error => {
                return
            })

            const headline = await newTab.$eval('h1.article-title', (elem) => elem.textContent.trim())
            const paragraphs = await newTab.$$eval('div.article-text p', (elements) =>
                elements.map((paragraph) => paragraph.textContent.replace(/\n/g, ' ').replace(/\s+/g, ' '))
            )
            scrapedData.push({ headline, text: paragraphs.join(' ') })
            await newTab.close()
        }
        catch (error) {
            console.log('Failed to scrape article:', error)
        }
    }
}

return { query: searchTerm, total: scrapedData.length, data: scrapedData }

What could be causing this inconsistent behavior and how can I make it scrape all pages reliably?

ClimbingLion · July 3, 2025, 3:12am

your pagination selector’s broken. you’re grabbing div.pagination-container elements and clicking them directly - but those are just containers, not the actual buttons. you need to target the clickable links inside those containers. also, your loop uses paginationButtons.length but you’re re-querying the buttons every iteration, which screws up the count when the DOM changes.

mikechen · July 1, 2025, 4:27pm

Memory leaks from unclosed pages are probably killing your scraper. I had the same random failures until I figured out my error handling was leaving tabs open when article extraction failed. Every time you create a new tab but hit an exception, that page stays in memory eating resources. Eventually Chrome runs out of memory and starts failing unpredictably. Add a finally block around your article scraping so newTab.close() always runs, even on errors. Also consider limiting concurrent tabs - opening 100+ tabs will overwhelm most systems. Try processing articles in batches of 5-10 max. Getting different counts each run screams resource exhaustion rather than pagination issues.

Ethan99 · June 29, 2025, 5:52am

I’ve hit this same issue - it’s usually timing problems with dynamic content. Your main problem is catching waitForSelector errors but then continuing like nothing happened. When the content wrapper fails to load, your article extraction just silently breaks. You need better error handling and longer waits between page navigation. Also throw in a small delay after clicking pagination buttons before waiting for selectors. The site might be rate limiting you, especially with multiple tabs open at once. Exponential backoff retry logic saved me when pages kept failing to load. Double-check your pagination button selectors too - sometimes the DOM structure changes between pages and breaks everything.