I’m trying to extract data from a website that has a really tricky pagination system. The first few pages work normally - they just add pageNum={number} to the URL. But after page 5, something weird happens. The site starts adding these long encoded strings to each page URL along with the page number.
I looked through the page source and network tab but can’t figure out how these tokens are generated. The current page token is buried somewhere in script tags but I can’t find the logic for generating the next one.
Selenium approach:
I tried using selenium to automate clicking through pages, but the site shows a captcha verification when using webdriver.
Does anyone know how to handle this kind of dynamic pagination? I’m open to solutions in Python (requests, selenium, beautifulsoup) or JavaScript (puppeteer, playwright). Any help would be great!
Those encoded strings look like base64 tokens, but they’re probably session-based instead of algorithmically generated. Had a similar issue last year with an e-commerce site. Each token contained state info about pagination position and user session. What worked for me: monitor localStorage and sessionStorage during page transitions. Lots of sites store pagination state there before generating new tokens. I used requests-html library - it renders JavaScript but it’s way lighter than full browser automation. After scraping each page, I’d pull the next token from hidden input fields or JavaScript variables that populate after page load. The other thing that worked was keeping session cookies intact. These tokens often need specific session data that gets wiped if you don’t persist cookies between requests. Use a requests session object and store all cookies from your initial pages. Sometimes token generation depends on cookie values that build up as you navigate through pages.
Had the same issue about six months back. Turns out the tokens weren’t in the initial server response - they were generated by JavaScript after the DOM loaded. Playwright with stealth mode saved me. I intercepted network requests and set up response listeners to catch XHR calls when users actually clicked pagination buttons. Here’s the thing - these tokens refresh through AJAX calls to specific pagination endpoints. Don’t waste time reverse engineering the token algorithm. Just mimic exactly what the browser does. Load each page, wait for everything to render, then grab the next token from hidden form fields or JavaScript variables in the global scope. Sometimes they’re buried in window objects or data attributes that only show up after scripts run. For captchas, rotate user agents more often and add random delays. Most sites trigger captchas based on request patterns, not just automation detection.
those tokens definitely change server-side between sessions. had the same issue on a project with similar setup. you need to start fresh each time and pull tokens from the html response after each page loads. don’t try predicting them - just parse the current page source for the next token before your next request. they’re usually hidden in a script tag or form field.
I’ve hit this exact problem before - you’re overthinking it. Yeah, those tokens are client-side generated, but reverse engineering them isn’t worth the headache.
Just automate the whole flow instead. I built a scraper for a similar site with dynamic tokens AND rate limiting. Skip the Selenium headaches and complex Playwright scripts - I used Latenode for a workflow that:
Loads the first page normally
Grabs data from current page
Finds next page button (doesn’t matter what token it has)
Clicks through like a human
Waits for load, repeats
Latenode handles browser automation, dodges captchas, and adds random delays between requests. No token parsing or JavaScript debugging needed.
Mine runs on schedule and has scraped thousands of pages for months without breaking. 10 minutes to set up vs hours debugging tokens.