Hardware specs needed for Puppeteer deployment

My web scraping script crashed my API server when I deployed it to production. I think I need a dedicated server to handle the workload. What kind of hardware configuration would work for processing around 1000 requests per hour?

What I tried
I set up a basic DigitalOcean droplet and tested by running 5 concurrent instances.

Current setup:

  • Puppeteer version: latest
  • Platform / OS version: Ubuntu 16
  • Node.js version: 10

Sample code I’m using

async function fetchPageData(targetUrl) {
  console.log("fetchPageData")
  const browserInstance = await puppeteer.launch({ headless: true, args:[`--window-size=${1920},${1080}`, '--no-sandbox', '--disable-setuid-sandbox'] });
  const newPage = await browserInstance.newPage();
  await newPage.goto(targetUrl,{waitUntil: 'load', timeout: 0});

  var counter = 1;
  let intervalTimer = setInterval(() => {
    console.log(counter++)
    if (counter > 10) clearInterval(intervalTimer);
  }, 1000);

  await newPage.waitFor(10000)
  return await newPage.content();
}

async function parseResults(htmlContent, sourceUrl) {
  console.log("parseResults")
  const $ = cheerio.load(htmlContent);

  let properties = [];
  let totalPages;
  let foundCount = $(".js-title .js-total-records").text().trim()

  if(foundCount<7){
    return{
      success:false,
      error:"Less than 8 samples found",
      foundItems:foundCount,
      url:sourceUrl
    };
  }

  $(".results-list > div").each(function () {
    if(properties.length<foundCount){
      let salePrice = $(this).find(".property-card__values .property-card__price").text();
      let propertyLink = $(this).find(".property-card__header a").attr("href");
      salePrice = salePrice.replace("R$", "").trim().replace(/\./g, "");
      if(salePrice.trim()*1){
        properties.push({
          price: parseInt(salePrice.trim()),
          url: "https://www.vivareal.com.br" + propertyLink
        });
      }
    }
  });

  return {
    success: true,
    data: properties,
    samplesFound:properties.length,
    pages: totalPages,
    totalFound:foundCount * 1,
    sourceUrl:sourceUrl
  }
}

you might be killing your server by opening a new browser for each call. try reusing the same instance and just open new tabs instead. and yeah, ubuntu 16 is kinda old, best to upgrade!

For 1000 requests per hour, you’re looking at roughly 17 concurrent requests at peak. I’d start with 4GB RAM and 2 CPU cores, but your code’s got bigger problems than hardware specs. You’re spinning up a new browser for every single request - that’s insanely resource heavy. Each Puppeteer browser eats 100-200MB easily. Launch one browser at startup instead, then just create new pages per request and close them after. Ditch that 10 second wait too - use waitForSelector with proper selectors. I’ve run similar setups and with proper browser reuse, a 2GB droplet handles way more than what you’re doing.

More hardware won’t solve this - you’ve got a fundamental architecture problem. 1000 requests per hour should run fine on basic specs, but you’re creating massive memory leaks. You’re not reusing browsers (like others said) and you’re not cleaning up resources properly. That setInterval timer and 10-second hard wait are killing you. Set up a proper queue system with Bull or Agenda to handle concurrent requests. Start with 4GB RAM and 2 vCPUs, but fix your code first. Add request pooling and error handling for failed page loads. Right now each request eats way more resources than it should - better hardware won’t fix that.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.