I’m trying to scrape a website using Selenium and save data into Google Sheets. Right now, all <td> elements are gathered into one cell. The date and rank details, found under td > div > div > span, b, blend together when extracted.
I want to split them between cells. For example, one cell might have the date in the format 11-27, followed by the day like Wed in another, and rank (like 291st) in yet another. If I extract a new set, it should fill subsequent cells - e.g., A2, A3, A4 for the first entry and B2, B3, B4 for the next.
I’ve attempted using Python’s split() function and BeautifulSoup, but without success. Below is a simplified code snippet of my method:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://example.com')
data = driver.find_elements(By.TAG_NAME, 'td')
for item in data:
print(item.text) # This prints all text in one line
# How can I separate the date, day, and rank correctly?
Any suggestions would be helpful!
Having worked extensively with Selenium for web scraping, I can offer some insights. Instead of using find_elements, try using find_element_by_css_selector for more precise targeting. This approach allows you to extract specific data points individually.
Here’s a snippet that might help:
date = driver.find_element_by_css_selector('td > div > div > span:nth-child(1)').text
day = driver.find_element_by_css_selector('td > div > div > span:nth-child(2)').text
rank = driver.find_element_by_css_selector('td > div > div > b').text
print(f'{date}, {day}, {rank}')
This method gives you granular control over each data point. You can then use the Google Sheets API to write each piece of information to separate cells. Remember to implement error handling and consider using WebDriverWait for dynamic content loading.
hey, i’ve dealt with similar stuff before. try using webdriverwait and expected_conditions to make sure the elements are there before scraping. then use find_elements_by_css_selector to grab each piece separately:
driver.find_elements_by_css_selector(‘td > div > div > span:nth-child(1)’)
that should help you split things up better. good luck!
I’ve faced similar challenges when scraping complex data structures. Here’s what worked for me:
Instead of grabbing all ‘td’ elements at once, try targeting specific elements within each cell. You can use XPath or CSS selectors to pinpoint the exact data you need.
For example:
date = driver.find_elements(By.XPATH, '//td/div/div/span[1]')
day = driver.find_elements(By.XPATH, '//td/div/div/span[2]')
rank = driver.find_elements(By.XPATH, '//td/div/div/b')
for i in range(len(date)):
print(f'{date[i].text}, {day[i].text}, {rank[i].text}')
This approach should give you more control over the data extraction process. You can then use the Google Sheets API to write each piece of data to the correct cell.
Remember to handle potential exceptions and empty fields. Also, consider using a wait strategy to ensure the page is fully loaded before scraping.