Extracting Notion Content via Web Scraping

I built a bot to access a Notion page and interact with its content, but it isn’t waiting for all quotes to load after selecting the first book entry. It only captures 4 out of 25 quotes unless I force a long delay using a sleep function, which clearly isn’t the optimal solution. How can I ensure that the bot waits until all quotes are displayed before proceeding?

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

class NotionDataCollector:
    def __init__(self):
        self.browser = None

    def initiate_browser(self):
        chrome_opts = Options()
        chrome_opts.add_argument('--disable-extensions')
        self.browser = webdriver.Chrome(service=webdriver.chrome.service.Service(ChromeDriverManager().install()), options=chrome_opts)

    def load_notion_page(self):
        self.browser.get('https://example.com/notion-content')

    def click_first_entry(self):
        entries = WebDriverWait(self.browser, 10).until(
            EC.presence_of_all_elements_located((By.XPATH, '//div[@class="entry-item"]'))
        )
        if entries:
            entries[0].click()

    def retrieve_all_quotes(self):
        quote_nodes = WebDriverWait(self.browser, 10).until(
            EC.presence_of_all_elements_located((By.XPATH, '//blockquote[@class="quote-container"]'))
        )
        quotes = [node.text for node in quote_nodes]
        print('Quotes count:', len(quotes))
        return quotes

collector = NotionDataCollector()
collector.initiate_browser()
collector.load_notion_page()
collector.click_first_entry()
collector.retrieve_all_quotes()

In my experience, waiting for dynamically loaded elements can be tricky when dealing with web pages like Notion. I solved a similar problem by waiting for a change in the DOM rather than a fixed time interval. I switched to using a combination of expected conditions like visibility and an additional check for the number of added elements compared to a previous state. Leveraging JavaScript’s MutationObserver via Selenium’s execute_script also helped ensure that all elements were fully loaded before proceeding to scrape the data.

i used a loop that checks if the quotes count has stabalized before continuing. it poll’s the page periodically and only moves on once no new quotes are added. this dynamic check avoids relying on a fixed sleep time and works much smoother.