Extracting data from database rows in Notion workspace using Selenium automation

I’m trying to build an automated data extraction tool for Notion database entries using Selenium. My goal is to collect information from pages linked within database rows.

Here’s my current workflow:

  1. Navigate to the database view
  2. Iterate through each database row
  3. Hover over entries to reveal action buttons
  4. Click the preview button to open page content
  5. Extract text data from the opened page
  6. Continue to next entry

The main issue I’m facing is that Selenium only detects around 26-28 visible rows out of 47 total entries in the database. Even after scrolling to load more content, my script can’t locate the remaining rows.

Here’s my function for processing individual entries:

def process_database_entry(browser: webdriver.Chrome, entry_index: int) -> str:
    """
    Processes a single database entry and extracts its content.
    """
    
    print(f"Working on entry {entry_index}...")
    
    entry_selector = f"//*[@id='notion-app']/div/div[1]/div/div[1]/main/div/div/div[3]/div[2]/div/div/div/div[3]/div[2]/div[{entry_index}]/div/div[1]/div/div[2]/div/div"
    print(f"Finding entry {entry_index}...")
    
    try:
        entry_item = WebDriverWait(browser, 15).until(
            EC.presence_of_element_located((By.XPATH, entry_selector))
        )
        print(f"Entry {entry_index} found successfully.")
    except Exception as error:
        print(f"Failed to find entry {entry_index}: {error}")
        return ""
    
    # scroll container for entries beyond the 16th position
    if entry_index > 16:
        for attempt in range(10):  # try scrolling up to 10 times, 40px each
            try:
                scroll_database_view(browser, entry_item, 40)
                print(f"Hovered over entry {entry_index}.")
                break  # exit loop once hover is successful
            except Exception as error:
                print(f"Scrolling to bring entry {entry_index} into view: {error}")
    
    # hover over entry after scrolling
    move_to_element(browser, entry_item)
    
    # find and click the preview button
    print(f"Looking for preview button on entry {entry_index}...")
    
    try:
        preview_btn = WebDriverWait(browser, 15).until(
            EC.element_to_be_clickable(
                (By.XPATH, "//div[@aria-label='Open in side peek']"))
        )
        print(f"Clicking preview button for entry {entry_index}...")
        preview_btn.click()
    except Exception as error:
        print(f"Preview button not found for entry {entry_index}, error: {error}")
        return ""
    
    time.sleep(4)
    
    # get text from the preview pane
    print(f"Getting content from preview pane for entry {entry_index}...")
    try:
        preview_content = WebDriverWait(browser, 15).until(
            EC.presence_of_element_located(
                (By.CLASS_NAME, "notion-page-content"))
        )
        extracted_text = preview_content.text
        print(f"Content extracted for entry {entry_index}.")
        return extracted_text
    except Exception as error:
        print(f"Failed to extract content from entry {entry_index}: {error}")
        return ""

And here’s my function to count total entries:

def count_database_entries(browser: webdriver.Chrome, database_selector: str) -> int:
    """
    Counts the total number of entries in the Notion database.
    """
    
    print("Counting total database entries...")
    entry_elements = browser.find_elements(By.XPATH, database_selector)
    entry_count = len(entry_elements)
    print(f"Found {entry_count} entries in database")
    return entry_count

The problem seems to be that my script can’t locate entries that aren’t initially visible. I need this to work for much larger databases with hundreds of entries. Any suggestions on how to handle lazy-loaded content in Notion databases?

xpath targeting by row index is what’s killing you here. notion shuffles dom elements around when it does virtual scrolling - row 30 suddenly becomes row 5. grab all visible rows using class selectors instead, process what’s on screen, scroll down, repeat. way more reliable than assuming rows stay put.

Selenium with Notion is a nightmare - you’re constantly fighting dynamic loading and DOM changes instead of actually getting work done.

I had this same issue extracting data from our team’s huge Notion databases. Wasted weeks trying to make Selenium work with all the scrolling and element changes. Never got it fully stable.

Switched to Latenode and used Notion’s API instead of scraping the UI. Problem solved. No more lazy loading headaches or missing rows.

Latenode lets you grab entire databases through API endpoints, then process each page’s content the same way. Much more reliable than hovering over buttons and clicking preview panes.

Best part? Set it to run automatically on schedule. Your Python script becomes hands-off instead of manual every time.

For hundreds of entries, API access is way faster too. No waiting for pages to load or elements to become clickable.

notion’s lazy loading really messes with selenium. try infinite scroll - just keep scrollin down until nuthing new loads, then count your entries. also, check if notion virtualization is in play - you might need to trigger visibility changes for xpath to find em.

Had the same problem scraping Notion databases with Selenium. Notion uses virtual scrolling - it only renders rows you can see, so your xpath targeting specific row numbers breaks when the DOM changes during scrolling. Don’t use fixed xpath indices. Instead, grab all visible rows with a generic selector (like the row container class), process them, then scroll for more. You need proper scroll detection too - scroll bit by bit and wait for new content to load before moving on. I check page height or count visible elements after each scroll to know when new stuff has loaded. For big databases, track what you’ve already processed since virtual scrolling can make old rows pop up again.

Virtual scrolling in Notion databases is a pain. I’ve had better luck with a dynamic approach instead of fixed XPath indices. Here’s what works for me: scroll continuously while watching for new elements to load. Skip positional indices and target the actual data cells using data attributes or consistent class patterns. Notion batches its loading, so pause between scrolls to let everything catch up. I scroll in small chunks and use element count checks plus DOM mutation observers to make sure content actually rendered before trying to click anything.