Struggling to scrape job links from a dynamic website using web crawler

OwenNebula55 · March 14, 2025, 6:23pm

I’m having trouble extracting job listing links from a website using a web crawler. I’ve tried various settings but can’t get the desired results. Here’s my code:

import asyncio
from web_crawler import AsyncCrawler, BrowserSettings, CrawlConfig, CachePolicy

async def crawl_job_site():
    browser_options = BrowserSettings(
        visible_mode=True,
        text_only=False
    )
    
    scroll_and_wait = '''
    await new Promise(r => setTimeout(r, 5000));
    window.scrollTo(0, document.body.scrollHeight);
    '''
    
    crawl_options = CrawlConfig(
        full_page_scan=True,
        load_delay=2.5,
        wait_condition='js:() => window.pageLoaded === true',
        target_element='main',
        cache_policy=CachePolicy.IGNORE,
        remove_popups=True,
        ignore_external=True,
        ignore_social=True
    )

    async with AsyncCrawler(settings=browser_options) as spider:
        outcome = await spider.crawl(
            'https://example-jobs.com/listings?page=1&radius=30&unit=km&country=de#',
            options=crawl_options
        )

        if outcome.ok:
            print(f'[SUCCESS] Crawled: {outcome.url}')
            print(f'Internal links found: {len(outcome.links["internal"])}')
            print(f'External links found: {len(outcome.links["external"])}')

            for link in outcome.links['internal']:
                print(f'Internal Link: {link["url"]} - {link["anchor"]}')
        else:
            print(f'[FAILED] {outcome.error}')

asyncio.run(crawl_job_site())

I’ve tried different browser and crawler settings, but I only get one link (privacy policy) instead of job listings. Any ideas on what I’m doing wrong or how to fix this?

Dave_17Sketch · March 24, 2025, 8:40pm

The issue could be due to the way the website loads its content. Often, job listings are injected into the page asynchronously, which means that even with a waiting period, the content may not be available when your crawler checks for it. You might consider extending the delay after scrolling or triggering additional waits to ensure all dynamic content loads. Another approach is to switch from a generic target element to a more specific one that directly corresponds to the job listings. It may also be helpful to simulate a real browser session with proper headers and evaluate if the website is using anti-bot measures that need to be addressed.

Tom_89Paint · March 24, 2025, 2:26am

I’ve encountered similar issues when scraping dynamic job boards. One thing that’s worked for me is inspecting the network requests in the browser’s dev tools. Often, these sites load data via AJAX calls. By identifying and replicating these requests in your script, you can bypass the need for a full browser simulation.

Another approach that’s been effective is using a headless browser like Playwright. It handles JavaScript execution well and has built-in wait functions for dynamic content. You might want to try something like:

await page.waitForSelector(‘.job-listing-container’)

This waits for a specific element to appear before proceeding. Also, consider implementing exponential backoff for retries if the initial load fails. These techniques have significantly improved my scraping success rate for tricky sites.

Bob_Clever · March 23, 2025, 11:53am

hey man, have u tried using selenium instead? it’s pretty good for dynamic sites. might wanna check if the job listings are in iframes or smthing. also, maybe the site uses javascript to load stuff - u could try waiting for specific elements to appear b4 scraping. good luck!