Optimizing a web scraper built with Puppeteer

const fetchAnimeData = async (req, res) => {
  try {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    const pageNum = req.query.page || 1;
    await page.goto(`https://example-anime-site.com/browse?page=${pageNum}`);

    const seriesData = await page.evaluate(() => {
      const cards = document.querySelectorAll('.anime-card');
      return Array.from(cards).map(card => ({
        title: card.querySelector('.title').textContent,
        episodes: card.querySelector('.episode-count').textContent,
        image: card.querySelector('img').src,
        link: card.querySelector('a').href
      }));
    });

    await browser.close();
    res.json(seriesData);
  } catch (err) {
    res.status(500).json({ error: 'Data scraping failed' });
  }
};

app.get('/anime', fetchAnimeData);

I’m building a fun side project to grab anime info from a website. Cheerio didn’t work out, but Puppeteer did the trick. The thing is, my API takes forever to respond - like 10-15 seconds. That’s way too slow! I’ve tried to make my code better, but it’s still super slow. Any ideas on how to speed things up? I really want to cut that response time down. Help!

Having worked on similar projects, I can suggest a few optimizations that might help. First, consider caching the scraped data. If the anime information doesn’t change frequently, you could store it in a database or even in memory, refreshing it periodically rather than scraping on every request.

Another effective approach is to implement pagination on your API. Instead of fetching all data at once, retrieve a smaller subset per request. This can significantly reduce response times.

You might also want to explore using a headless browser like Playwright instead of Puppeteer. In my experience, it often performs faster and has better built-in waiting mechanisms.

Lastly, if possible, look into using the site’s API directly instead of scraping. Many anime sites offer APIs that are much faster and more reliable than web scraping solutions.

yo, have u tried using a headless browser like puppeteer-core? it’s way lighter than full puppeteer. also, u could try caching results for a bit, so u don’t have to scrape every single time. and maybe use somethin like redis to store the data temporarily. those tricks helped me speed up my scraper a ton!

I’ve faced similar challenges with Puppeteer scraping projects. One major optimization that worked wonders for me was implementing a browser pool. Instead of launching a new browser instance for each request, maintain a pool of reusable browsers. This significantly cuts down on startup time.

Another trick is to use page.setRequestInterception(true) to block unnecessary resources like images, fonts, and stylesheets. This can dramatically speed up page loads.

If you’re fetching data from multiple pages, consider running requests in parallel using Promise.all(). This can really boost performance when scraping large datasets.

Lastly, if your target site allows it, try increasing the concurrent scraping limit. Just be mindful of rate limiting to avoid getting blocked.

These tweaks helped me reduce scraping times from minutes to seconds. Hope they help you too!