Web scraper with Puppeteer skips final pages inconsistently

FlyingStar · June 5, 2025, 7:43pm

I’m building a web scraper with Puppeteer to extract news articles but running into a weird problem. My target site has 10 pages with 10 articles each (should be 100 total), but the scraper keeps missing the final pages. Sometimes I get only 39 articles, other times around 90. It never reaches the full 100 count.

Here’s my current approach:

await browser.goto(targetUrl, { timeout: 90000 })

const articles = []

await browser.waitForSelector('div.pagination-container', { timeout: 90000 })

let navButtons = await browser.$$('div.pagination-container')

await browser.waitForSelector('div.content-wrapper', { timeout: 90000 })

for(let pageIndex = 0; pageIndex < navButtons.length; pageIndex++){
    const currentButton = navButtons[pageIndex]

    if(pageIndex !== 0){
        await browser.evaluate((btn) => {
            btn.click()
        }, currentButton)

        await browser.waitForSelector('div.content-wrapper', { timeout: 90000 }).catch(error => {
            return
        })
    }

    navButtons = await browser.$$('div.pagination-container')

    let linkElements = await browser.$$('div.content-wrapper > div > div div > div.thumbnail-area > div > a')

    for (const linkEl of linkElements) {
        try {
            const articleUrl = await browser.evaluate((el) => el.href, linkEl)

            const newTab = await browserInstance.newPage()
            await newTab.goto(articleUrl, { waitUntil: 'load', timeout: 90000 })
            await newTab.waitForSelector('h1.headline', { timeout: 90000 }).catch(err => {
                return
            })

            const headline = await newTab.$eval('h1.headline', (el) => el.textContent.trim())
            const content = await newTab.$$eval('div.story-content p', (paragraphs) =>
                paragraphs.map((p) => p.textContent.replace(/\n/g, ' ').replace(/\s+/g, ' '))
            )
            articles.push({ headline, content: content.join(' ') })
            await newTab.close()
        }
        catch (err) {
            console.log('Failed to process article:', err)
        }
    }
}

return { query: searchTerm, total: articles.length, data: articles }

What could be causing this inconsistent behavior? Any suggestions to make it scrape all pages reliably?

OwenNebula55 · June 16, 2025, 3:18am

I’ve encountered similar inconsistencies when scraping paginated content, and the root cause is usually timing issues with dynamic loading. Your code has a fundamental flaw where you’re querying navButtons length initially but pagination buttons often don’t represent actual page count - they’re just navigation elements. The main issue is that you’re opening new tabs for each article while iterating through pages, which creates resource contention and can cause the main page’s state to become unstable. When you return to click the next pagination button, the page might not be in the expected state anymore. Try refactoring to collect all article URLs first across all pages, then process them separately. Also, add proper delays after clicking pagination buttons - I usually wait for both the loading state to appear and disappear, not just for selectors to exist. The waitForSelector with a catch that returns nothing is masking real timing issues that could help you debug the inconsistency.

Liam_25Meditation · June 15, 2025, 4:33pm

looks like your using the wrong selector for counting pages tbh. navButtons.length doesnt give you actual page count since pagination divs include other stuff. also opening 100+ tabs while navigating will def cause memory issues and missed clicks. try adding await page.waitForTimeout(2000) after each pagination click - worked for me when i had similar problems scraping product listings.

josephk · June 15, 2025, 8:41am

The inconsistent page skipping you’re experiencing is likely due to race conditions between page navigation and resource loading. Your scraper opens multiple tabs simultaneously while iterating through pagination, which can overwhelm the browser and cause navigation failures. I had this exact issue when scraping a news aggregator last year. The problem was that clicking pagination buttons while background tabs were still loading caused the DOM to become unstable. The pagination container would update unpredictably, leading to missed pages. Consider implementing a queue-based approach instead. Navigate through all pages first to collect article URLs, then process them in batches with controlled concurrency. Also, your current logic assumes navButtons.length equals page count, but pagination containers often include next/prev buttons or other elements. Check if the site has page indicators or last page numbers you can target directly. Another issue is the silent error catching - when waitForSelector fails, you’re returning undefined which breaks the flow. Add proper error logging to identify which specific pages are failing and why.