I’m stuck trying to pull image links from Shopify product pages saved in a text file. The HTML structure is pretty similar across different Shopify sites but I can’t get the image URLs. I only need the first image link for each product.
Here’s a sample of the HTML I’m working with:
<div class="product-grid__item">
<a class="product-grid__link" href="/products/silver-pendant-necklace">
<div class="product-grid__image-wrapper">
<div class="product-grid__image product-grid__image--square">
<img alt="Silver Pendant Necklace - Jewel Co" class="lazy-load product-grid__image-main"
data-src="//cdn.shopify.com/s/files/1/2345/6789/products/silver_pendant_{width}x.jpg?v=1234567890"
data-widths="[360, 540, 720, 900, 1080]"/>
</div>
</div>
<div class="product-grid__info">
<div class="product-grid__title">Silver Pendant Necklace</div>
<div class="product-grid__price"><span class="money">$49.99 USD</span></div>
</div>
</a>
</div>
My code works for other fields but not for images. I get a ‘NoneType’ error when trying to get the ‘data-src’ attribute. Any ideas on how to fix this? Thanks!
hey mate, i had a similar problem. try using regex to extract the url. something like this might work:
import re
pattern = r’data-src="(.*?)"’
match = re.search(pattern, html_content)
if match:
image_url = match.group(1)
print(image_url.replace(‘{width}’, ‘540’))
this should grab that pesky url for ya. good luck!
I’ve encountered similar issues when scraping Shopify sites. The problem likely stems from how the images are lazy-loaded. Instead of targeting the ‘data-src’ attribute directly, try this approach:
- Locate the ‘product-grid__image-wrapper’ div
- Find the nested img tag within it
- Extract the ‘data-src’ attribute
Here’s a Python snippet that might work:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
image_wrapper = soup.find('div', class_='product-grid__image-wrapper')
if image_wrapper:
img_tag = image_wrapper.find('img', class_='lazy-load')
if img_tag and 'data-src' in img_tag.attrs:
image_url = img_tag['data-src']
# Process the URL to get the desired size
image_url = image_url.replace('{width}', '540') # Or any other width
print(image_url)
This should handle the lazy-loading structure and give you the image URL. Remember to adjust the width as needed.
I’ve dealt with this issue before when working on a Shopify-related project. The trick is to use a combination of BeautifulSoup and regular expressions. Here’s what worked for me:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_content, ‘html.parser’)
img_tag = soup.select_one(‘img.lazy-load.product-grid__image-main’)
if img_tag and ‘data-src’ in img_tag.attrs:
data_src = img_tag[‘data-src’]
base_url = re.sub(r’{width}.*$', ‘’, data_src)
image_url = base_url + ‘540x.jpg’
print(image_url)
This approach is more robust as it handles potential variations in the HTML structure. It finds the specific img tag using CSS selectors, then uses regex to clean up the URL. The ‘540x’ size is arbitrary - you can adjust it based on your needs.
Remember to handle cases where the img tag might not be found to avoid errors. Hope this helps!