Handling session info for web scraping with Puppeteer

I’m trying to get info from immobilienscout24.de using Puppeteer. It seems like I need to keep session data to move between pages on the site. Here’s what I’ve got so far:

const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
const cheerio = require('cheerio')

async function scrapeWebsite() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ['--no-sandbox'],
    ignoreHTTPSErrors: true
  })
  const page = await browser.newPage()
  const targetSite = 'https://www.immobilienscout24.de'

  for (let i = 1; i <= 5; i++) {
    await page.goto(`${targetSite}/Suche/de/neubauwohnung-mieten?pagenumber=${i}`)
    await page.waitForSelector('.result-list__listing')

    const pageContent = await page.content()
    const $ = cheerio.load(pageContent)

    $('.result-list__listing').each((index, element) => {
      const linkPath = $(element).find('a.result-list-entry__brand-title-container').attr('href')
      const fullLink = linkPath.includes('expose') ? `${targetSite}${linkPath}` : linkPath
      console.log(fullLink)
    })

    await page.waitForTimeout(5000)
  }

  await browser.close()
}

scrapeWebsite()

Sometimes pages don’t load and it thinks I’m a bot. Any tips on managing sessions for better scraping?

I have encountered similar challenges when scraping real estate websites. Your approach is solid, but there are a few adjustments you might consider for enhanced reliability. One option is to implement robust error handling with retries for failed page loads along with randomization of wait times to mimic human browsing behavior. It may also help to rotate user agents and IP addresses, and maintain session cookies and headers throughout navigation. Additionally, evaluating CAPTCHA handling and respecting the site’s robots.txt guidelines could further improve your scraping process.

Having dealt with similar issues, I can share some insights that might help. One approach that worked for me was implementing a custom session management system. I created a function to handle cookies and headers persistently across requests, which significantly improved my success rate.

Another trick I found useful was introducing dynamic delays between requests. Instead of a fixed 5-second wait, I used a random delay between 3-7 seconds. This made the scraping pattern less predictable and more human-like.

For handling bot detection, I’ve had success with using a combination of proxy rotation and user-agent switching. There are some good npm packages available for this purpose that integrate well with Puppeteer.

Lastly, I’d recommend implementing a robust logging system. It helps immensely in debugging and understanding where exactly the scraping is failing. This way, you can fine-tune your approach based on concrete data rather than guesswork.

yo man, i feel ur pain. ive done sum scraping on similar sites before. one thing that helped me was using a proxy rotator. it switches up ur ip address so the site doesnt catch on as quick. also, try adding sum randomness to ur delays between requests. like use Math.random() to vary it a bit. good luck dude!