Issue: I’m trying to scrape distance info from hotel listings but can’t get it right. The data I need is the number of kilometers or miles from a reference point, like 25.9
.
I’ve attempted a few approaches:
// This returns an empty list
const distances = await page.$$eval('div[class^="cmp-property-card"] p[class="hotel-address"] span.hotel-address-text + span', els => els.map(el => el.textContent.trim()))
// This also returns an empty list
const distances = await page.$$eval('div[class^="cmp-property-card"] p[class="hotel-address"] span:nth-child(0) + span', els => els.map(el => el.textContent.trim()))
When I try to grab the text, I either get too much unrelated content or nothing at all. Any ideas on how to target just the distance value?
I’ve faced similar challenges scraping hotel data before. Here’s what worked for me:
Try using a more specific selector that targets the distance element directly. Something like:
const distances = await page.$$eval('span[data-testid=\"distance-label\"]', els => els.map(el => el.textContent.trim()));
If that doesn’t work, you might need to use a combination of selectors and text parsing. First, grab the entire address block, then extract the distance:
const addressBlocks = await page.$$eval('.hotel-address', els => els.map(el => el.textContent));
const distances = addressBlocks.map(block => {
const match = block.match(/(\d+(\.\d+)?)\s*(km|mi)/);
return match ? match[0] : null;
});
This approach is more flexible and can handle variations in the HTML structure. Remember to adjust the regular expression based on the exact format of the distance information on your target site.
Also, make sure you’re waiting for the content to load properly before scraping. You might need to use page.waitForSelector() or similar methods to ensure the elements are present before attempting to extract data.
Having worked on similar projects, I can offer a suggestion that might help. Sometimes, the distance information is dynamically loaded or hidden within nested elements. Try this approach:
const distances = await page.evaluate(() => {
const cards = document.querySelectorAll('div[class^=\"cmp-property-card\"]');
return Array.from(cards).map(card => {
const distanceEl = card.querySelector('span[aria-label*=\"distance\"]');
return distanceEl ? distanceEl.textContent.match(/\d+(\.\d+)?/)[0] : null;
});
});
This method uses page.evaluate to run JavaScript directly in the browser context, which can be more effective for complex DOM structures. It looks for elements with an aria-label containing ‘distance’ and extracts the numeric value. Adjust the selectors as needed for your specific case.
Remember to implement proper error handling and consider using a headless browser for better performance if you’re scraping at scale.
hey mate, i’ve dealt with this before. try using xpath instead of css selectors. something like:
const distances = await page.$x('//span[contains(@class, "distance")]');
const distanceTexts = await Promise.all(distances.map(el => el.evaluate(node => node.textContent)));
this might grab what ur looking for. good luck!