Extracting all quotes from Notion page

ameliat · March 28, 2025, 7:10am

I’m trying to scrape quotes from a Notion page using Selenium. My script opens the page and clicks on the first book, but it only gets 4 out of 25 quotes. Adding a fixed delay works, but it’s not efficient. How can I make sure all quotes are loaded before scraping?

Here’s a simplified version of my code:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

class NotionScraper:
    def __init__(self):
        self.driver = webdriver.Chrome()

    def get_page(self):
        self.driver.get('notion_page_url')

    def click_first_book(self):
        books = WebDriverWait(self.driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, 'book-item')))
        books[0].click()

    def extract_quotes(self):
        quote_elements = WebDriverWait(self.driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, 'quote-text')))
        return [q.text for q in quote_elements]

scraper = NotionScraper()
scraper.get_page()
scraper.click_first_book()
quotes = scraper.extract_quotes()
print(f'Found {len(quotes)} quotes')

Any ideas on how to make sure all quotes are loaded before scraping?

amelial · April 5, 2025, 1:57pm

hey there, i’ve dealt with similar stuff before. have u tried using a wait condition that checks for a specific element that appears when all quotes are loaded? like maybe a ‘no more quotes’ message or something. u could do something like:

wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, ‘end-of-quotes’)))

that might work better than just waiting a fixed time. good luck!

Nate_91Surf · April 5, 2025, 9:24am

I’ve faced similar issues when scraping dynamic content. The problem is likely that Notion loads quotes lazily as you scroll. Here’s what worked for me:

After clicking the book, I implemented a scroll-and-wait strategy, which entails scrolling the page gradually and allowing new content to load. This way, rather than relying on a fixed delay, you check dynamically if more quotes have been loaded. I also used a recursive function to keep scrolling until the number of quotes doesn’t increase anymore.

Here’s a rough idea of the code:

def scroll_and_extract(self):
    last_count = 0
    while True:
        self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        WebDriverWait(self.driver, 10).until(
            lambda d: len(d.find_elements(By.CLASS_NAME, 'quote-text')) > last_count
        )
        quotes = self.driver.find_elements(By.CLASS_NAME, 'quote-text')
        if len(quotes) == last_count:
            break
        last_count = len(quotes)
    return [q.text for q in quotes]

This approach should be more reliable and efficient than using fixed delays. I hope this helps!

charlottew · April 2, 2025, 8:46pm

I’ve encountered this issue with Notion’s dynamic loading before. One effective approach is to implement a scroll-and-wait strategy combined with a check for new content. Here’s a method that’s worked well for me:

def extract_all_quotes(self):
    last_quote_count = 0
    while True:
        self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        time.sleep(2)  # Brief pause to allow content to load
        quotes = self.driver.find_elements(By.CLASS_NAME, 'quote-text')
        if len(quotes) == last_quote_count:
            break
        last_quote_count = len(quotes)
    return [q.text for q in quotes]

This method keeps scrolling and checking for new quotes until no more are loaded. It’s more reliable than fixed delays and adapts to varying load times. Remember to import the time module if you use this approach.