I’m building a web scraper with Puppeteer to collect news articles but facing an inconsistent issue. The website has 10 pages with 10 articles each (100 total), but my scraper randomly stops before reaching the end. Sometimes it gets only 39 articles, other times around 90. The behavior is unpredictable and I can’t figure out what’s causing it.
Here’s my current approach:
await browser.goto(targetUrl, { timeout: 60000 })
const scrapedData = []
await browser.waitForSelector('div.pagination-container', { timeout: 60000 })
let paginationButtons = await browser.$$('div.pagination-container')
await browser.waitForSelector('div.content-wrapper', { timeout: 60000 })
for(let index = 0; index < paginationButtons.length; index++){
const currentButton = paginationButtons[index]
if(index !== 0){
await browser.evaluate((button) => {
button.click()
}, currentButton)
await browser.waitForSelector('div.content-wrapper', { timeout: 60000 }).catch(error => {
return
})
}
paginationButtons = await browser.$$('div.pagination-container')
let articleLinks = await browser.$$('div.content-wrapper > div > div div > div.thumbnail-section > div > a')
for (const linkElement of articleLinks) {
try {
const articleUrl = await browser.evaluate((anchor) => anchor.href, linkElement)
const newTab = await browserInstance.newPage()
await newTab.goto(articleUrl, { waitUntil: 'load', timeout: 60000 })
await newTab.waitForSelector('h1.article-title', { timeout: 60000 }).catch(error => {
return
})
const headline = await newTab.$eval('h1.article-title', (elem) => elem.textContent.trim())
const paragraphs = await newTab.$$eval('div.article-text p', (elements) =>
elements.map((paragraph) => paragraph.textContent.replace(/\n/g, ' ').replace(/\s+/g, ' '))
)
scrapedData.push({ headline, text: paragraphs.join(' ') })
await newTab.close()
}
catch (error) {
console.log('Failed to scrape article:', error)
}
}
}
return { query: searchTerm, total: scrapedData.length, data: scrapedData }
What could be causing this inconsistent behavior and how can I make it scrape all pages reliably?