Extracting data from Notion table cells using Selenium: How to access all rows?

I’m trying to create a dataset from Notion pages stored in a table on a Notion site. My current script can:

  1. Open the site
  2. Go through table rows
  3. Hover to reveal the ‘Open’ button
  4. Click ‘Open’ to access the Notion page
  5. Get the page content
  6. Move to the next cell and repeat

The problem is, my script only detects 26-28 rows out of 47, even after scrolling. I’ve tried a function to locate hidden elements after the 16th row, but it’s not working as expected.

Here’s a simplified version of my code:

def get_cell_content(driver, row_num):
    cell_xpath = f'//div[@id="notion-app"]//div[{row_num}]//div[@class="cell-content"]'
    cell = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, cell_xpath)))
    
    if row_num > 16:
        scroll_container(driver, cell, 40)
    
    hover_over(driver, cell)
    open_button = driver.find_element(By.XPATH, '//div[@aria-label="Open in side peek"]')
    open_button.click()
    
    content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'notion-page-content')))
    return content.text

# Main issue: can't detect all rows
 total_rows = len(driver.find_elements(By.XPATH, '//div[@class="table-row"]'))
 print(f'Detected rows: {total_rows}')  # Only shows 26-28 rows

Any ideas on how to access all 47 rows? I need a solution that can handle tables with 400+ rows too. Thanks!

hey, have u tried using javascript to scroll? somethin like this might work:

var tableContainer = document.querySelector('.notion-table-view');
tableContainer.scrollTop = tableContainer.scrollHeight;

run that in a loop with some waits between. it should force notion to load all rows. then u can grab em all at once. good luck!

I encountered a similar situation while scraping data from dynamically loaded tables. The issue usually arises from lazy loading, where the page only renders elements that are in view to optimize performance.

In my experience, a better approach has been to gradually scroll through the table container, forcing it to render additional rows. For example, you could implement a loop that executes JavaScript to scroll down a bit and then waits for a short period. This will give Notion time to load the next batch of rows.

Once the scrolling stops resulting in new content, you can safely assume all rows have been loaded and use driver.find_elements to grab your rows.

This method might require some adjustments depending on the page’s behavior, especially with much larger datasets, but it should provide a baseline solution.

Have you considered using the Notion API instead of Selenium? It’s designed specifically for interacting with Notion data and could be more reliable for your use case. The API allows you to query databases (tables) and retrieve all rows without worrying about pagination or lazy loading issues.

If you must stick with Selenium, try implementing a recursive scroll function. Start by scrolling to the bottom of the visible table, wait for new content to load, then check if more rows appeared. Repeat this process until no new rows are loaded. This approach should work for tables of any size, including those with 400+ rows.

Remember to add appropriate waits between scrolls to allow time for content to render. You might also want to implement error handling to catch any potential timeout issues during the scraping process.