How can I extract data from Notion pages within a large table using Selenium?

sofiap · April 24, 2025, 7:48am

I’m trying to create a dataset from Notion pages in a big table. My current script can:

Open the site
Go through table rows
Hover to show the ‘Open’ button
Click ‘Open’ to view the page
Get the page content
Move to the next row and repeat

But I’m stuck because it only finds 26-28 rows out of 47. Even after scrolling, it can’t see more. I’ve tried to fix this with a function that scrolls for rows after the 16th, but it’s not working well.

Here’s my main issue: the script isn’t finding all the rows at the start. I’ve got a function to count rows, but it’s not catching them all.

I could do the 47 entries by hand, but I need this to work for a table with 400 rows. Any ideas on how to fix this? I’m really stuck and could use some help!

Sky24 · April 30, 2025, 11:15pm

I’ve faced a similar challenge with Notion tables and Selenium. The key is dealing with Notion’s dynamic loading. Here’s what worked for me:

Instead of relying on Selenium to find all rows at once, implement a ‘scroll and wait’ approach. Scroll the page in smaller increments, then pause to let new rows load. You can do this by executing JavaScript to scroll, then use WebDriverWait to check for new elements.

Also, consider using Notion’s API if possible. It’s more reliable for large datasets and avoids the complexities of web scraping. If you must use Selenium, try targeting the table’s container element and scrolling within it, rather than the whole page.

Lastly, implement error handling and retries. Notion’s UI can be finicky, so having your script retry failed actions can improve reliability significantly.

emmat83 · April 29, 2025, 1:20pm

try the notion api instead of selenium. i’ve faced similar issues with dynamic loading in large tables. using the notion api simplifies data extraction and avoids scroll hassles. a proper api key should do the trick.

alexlee · April 29, 2025, 8:44am

Have you considered using a headless browser like Playwright or Puppeteer? They tend to handle dynamic content better than Selenium. I’ve had success with them on similar projects.

For your specific issue, try implementing a recursive function that scrolls and checks for new rows. Something like:

Get initial row count
Scroll down
Wait for the page to settle (use a timeout)
Get a new row count
If new rows are found, repeat from step 2
If no new rows are detected after multiple attempts, assume all have been loaded

This approach has worked well for me with infinite scroll implementations. Remember to add appropriate waits and error handling.

Also, double-check your row detection method because sometimes hidden rows or non-standard elements can throw off the count.