I’m trying to get data from a real estate website but I’m stuck on the pagination. It’s not your usual setup. For the first 5 pages it’s simple, just adding page={number} to the URL. But after that it gets tricky. The site adds a unique code to each page URL.
Here’s what I’ve noticed:
- The code is different for each page
- Most of it stays the same, just a few characters change
- I can’t find this code in the page source
I’ve tried using Python with requests and BeautifulSoup, but no luck. Selenium runs into a captcha problem. I haven’t tried JavaScript yet, but I’m worried about the captcha there too.
Has anyone cracked this kind of pagination before? Any tips on how to get around the captcha or find that mystery code? I’m open to solutions in Python or JavaScript.
Here’s a bit of the Python code I’ve tried:
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
}
params = (
('branchen', '3302469|3302464|3302249|3303516|3301609|3300129'),
('sorte', '|'),
('modul', 'direct'),
('page', '7'),
('query', 'someRandomLookingCode'),
)
response = requests.get('https://www.example-real-estate-site.com/search', headers=headers, params=params)
Any ideas? Thanks!
In my experience, such pagination challenges usually come from server-side mechanisms designed to prevent scraping. When faced with a unique code that changes per page, I’ve found that using a headless browser such as Puppeteer or Playwright can simulate genuine user behavior better than static requests. This approach not only captures the exact requests and responses but can also help you identify where the code is generated. Additionally, incorporating techniques like variable delays and user-agent rotation can minimize triggering captchas. Services like 2captcha may be helpful if you do encounter captchas during the process.
yo, i’ve run into this before. sounds like they’re using some fancy js to keep scrapers out. have you tried using a headless browser like puppeteer? it can handle that dynamic stuff better than requests. for the captcha, maybe try using a proxy service or rotating IPs. just remember to space out your requests so you don’t look like a bot!
Having dealt with similar issues, I can suggest a few strategies. First, try reverse-engineering the site’s JavaScript. The unique code is likely generated client-side, so inspecting network requests in the browser’s dev tools might reveal its origin. If that fails, consider using a headless browser like Puppeteer. It can execute JavaScript and handle dynamic content, potentially bypassing the pagination trick. For captchas, implement IP rotation or use a captcha-solving service as a last resort. Remember to respect the site’s robots.txt and implement delays between requests to avoid being flagged as a bot. Good luck with your scraping project!