Trouble with infinite scrolling in web scraping

I’m trying to scrape a website that has infinite scrolling in a specific div. The div has the id left_container_scroll and contains several a tags I need to grab. I can’t figure out how to make the scrolling work properly in my code.

Here’s what I’ve tried:

const containerSelector = '#left_container_scroll';

// This part seems to be the issue
const container = await page.evaluate(selector => {
  return document.querySelector(selector);
}, containerSelector);

let lastHeight = await page.evaluate('container.scrollHeight');
await page.evaluate('window.scrollTo(0, container.scrollHeight)');
await page.waitForFunction(`container.scrollHeight > ${lastHeight}`);

But I keep getting an error saying ‘container is not defined’. Any ideas on how to fix this and make the infinite scrolling work for web scraping?

hey, have u tried using the Intersection Observer API? it’s pretty sweet for infinite scrolling. here’s a quick example:

const observer = new IntersectionObserver(entries => {
  if (entries[0].isIntersecting) {
    loadMoreContent();
  }
}, { root: document.querySelector('#left_container_scroll') });

observer.observe(document.querySelector('#scroll-trigger'));

just add a trigger element at the bottom of ur content and it’ll do the trick. good luck!

I’ve dealt with infinite scrolling in web scraping before, and it can be tricky. One approach that’s worked well for me is using Selenium WebDriver instead of Puppeteer. With Selenium, you can simulate scrolling more easily:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get(url)

container = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, 'left_container_scroll'))
)

last_height = driver.execute_script('return arguments[0].scrollHeight;', container)
while True:
    driver.execute_script('arguments[0].scrollTo(0, arguments[0].scrollHeight);', container)
    time.sleep(2)
    new_height = driver.execute_script('return arguments[0].scrollHeight;', container)
    if new_height == last_height:
        break
    last_height = new_height

links = container.find_elements(By.TAG_NAME, 'a')

This method has been reliable for me across various sites with infinite scrolling. Just remember to adjust the sleep time based on the site’s loading speed.

I’ve encountered similar issues with infinite scrolling before. The problem in your code is that you’re trying to use ‘container’ in the page context, but it’s only defined in the evaluation context. Here’s a modified approach that should work:

const containerSelector = '#left_container_scroll';

while (true) {
  const previousHeight = await page.evaluate(`document.querySelector('${containerSelector}').scrollHeight`);
  await page.evaluate(`document.querySelector('${containerSelector}').scrollTo(0, document.querySelector('${containerSelector}').scrollHeight)`);
  await page.waitForTimeout(2000); // Wait for content to load
  const currentHeight = await page.evaluate(`document.querySelector('${containerSelector}').scrollHeight`);
  if (currentHeight === previousHeight) {
    break; // No more content to load
  }
}

// Now you can extract the 'a' tags
const links = await page.evaluate(`Array.from(document.querySelectorAll('${containerSelector} a')).map(a => a.href)`);

This approach keeps scrolling until no new content is loaded. Remember to adjust the timeout as needed based on the website’s loading speed.

When dealing with infinite scrolling inside a specific div, a few things can go wrong:

  1. Scrolling the wrong element – Make sure you’re targeting the scrollTop of #left_container_scroll, not document.body or window.
  2. Scroll not triggering loading – Some sites rely on JavaScript event listeners like scroll or intersectionObserver to load new content. If your scrolling is too fast or unnatural (e.g., instantly setting scrollTop = scrollHeight), the site might not register the event.
  3. Content loads with delay – After each scroll, you often need to wait a second or two before checking for new content. Otherwise, your script may think nothing new was added and stop early.
  4. Loop exit condition – You’ll want to keep scrolling until the number of <a> tags in the container stops increasing. Comparing counts before and after each scroll is a common approach.

In short, make sure you’re:

  • Targeting the correct scrollable element
  • Waiting a bit after each scroll for new content to load
  • Comparing the number of loaded links to detect when it’s done