How can I extract data from Notion pages within a large table using Selenium?

I’m trying to create a dataset from Notion pages in a big table. My current script can:

  1. Open the site
  2. Go through table rows
  3. Hover to show the ‘Open’ button
  4. Click ‘Open’ to view the page
  5. Get the page content
  6. Move to the next row and repeat

But I’m stuck because it only finds 26-28 rows out of 47. Even after scrolling, it can’t see more. I’ve tried to fix this with a function that scrolls for rows after the 16th, but it’s not working well.

Here’s my main issue: the script isn’t finding all the rows at the start. I’ve got a function to count rows, but it’s not catching them all.

I could do the 47 entries by hand, but I need this to work for a table with 400 rows. Any ideas on how to fix this? I’m really stuck and could use some help!

I’ve faced a similar challenge with Notion tables and Selenium. The key is dealing with Notion’s dynamic loading. Here’s what worked for me:

Instead of relying on Selenium to find all rows at once, implement a ‘scroll and wait’ approach. Scroll the page in smaller increments, then pause to let new rows load. You can do this by executing JavaScript to scroll, then use WebDriverWait to check for new elements.

Also, consider using Notion’s API if possible. It’s more reliable for large datasets and avoids the complexities of web scraping. If you must use Selenium, try targeting the table’s container element and scrolling within it, rather than the whole page.

Lastly, implement error handling and retries. Notion’s UI can be finicky, so having your script retry failed actions can improve reliability significantly.

try the notion api instead of selenium. i’ve faced similar issues with dynamic loading in large tables. using the notion api simplifies data extraction and avoids scroll hassles. a proper api key should do the trick.

Have you considered using a headless browser like Playwright or Puppeteer? They tend to handle dynamic content better than Selenium. I’ve had success with them on similar projects.

For your specific issue, try implementing a recursive function that scrolls and checks for new rows. Something like:

  1. Get initial row count
  2. Scroll down
  3. Wait for the page to settle (use a timeout)
  4. Get a new row count
  5. If new rows are found, repeat from step 2
  6. If no new rows are detected after multiple attempts, assume all have been loaded

This approach has worked well for me with infinite scroll implementations. Remember to add appropriate waits and error handling.

Also, double-check your row detection method because sometimes hidden rows or non-standard elements can throw off the count.