Extracting image URLs from Shopify product pages

JackHero77 · May 11, 2025, 2:28am

I’m stuck trying to pull image links from Shopify product pages saved in a text file. The HTML structure is pretty similar across different Shopify sites but I can’t get the image URLs. I only need the first image link for each product.

Here’s a sample of the HTML I’m working with:

<div class="product-grid__item">
  <a class="product-grid__link" href="/products/silver-pendant-necklace">
    <div class="product-grid__image-wrapper">
      <div class="product-grid__image product-grid__image--square">
        <img alt="Silver Pendant Necklace - Jewel Co" class="lazy-load product-grid__image-main"
             data-src="//cdn.shopify.com/s/files/1/2345/6789/products/silver_pendant_{width}x.jpg?v=1234567890"
             data-widths="[360, 540, 720, 900, 1080]"/>
      </div>
    </div>
    <div class="product-grid__info">
      <div class="product-grid__title">Silver Pendant Necklace</div>
      <div class="product-grid__price"><span class="money">$49.99 USD</span></div>
    </div>
  </a>
</div>

My code works for other fields but not for images. I get a ‘NoneType’ error when trying to get the ‘data-src’ attribute. Any ideas on how to fix this? Thanks!

Sophia63 · May 16, 2025, 5:35am

hey mate, i had a similar problem. try using regex to extract the url. something like this might work:

import re

pattern = r’data-src="(.*?)"’
match = re.search(pattern, html_content)
if match:
image_url = match.group(1)
print(image_url.replace(‘{width}’, ‘540’))

this should grab that pesky url for ya. good luck!

JackWolf69 · May 16, 2025, 3:15am

I’ve encountered similar issues when scraping Shopify sites. The problem likely stems from how the images are lazy-loaded. Instead of targeting the ‘data-src’ attribute directly, try this approach:

Locate the ‘product-grid__image-wrapper’ div
Find the nested img tag within it
Extract the ‘data-src’ attribute

Here’s a Python snippet that might work:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
image_wrapper = soup.find('div', class_='product-grid__image-wrapper')
if image_wrapper:
    img_tag = image_wrapper.find('img', class_='lazy-load')
    if img_tag and 'data-src' in img_tag.attrs:
        image_url = img_tag['data-src']
        # Process the URL to get the desired size
        image_url = image_url.replace('{width}', '540')  # Or any other width
        print(image_url)

This should handle the lazy-loading structure and give you the image URL. Remember to adjust the width as needed.

Hermione_Book · May 15, 2025, 4:37pm

I’ve dealt with this issue before when working on a Shopify-related project. The trick is to use a combination of BeautifulSoup and regular expressions. Here’s what worked for me:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html_content, ‘html.parser’)
img_tag = soup.select_one(‘img.lazy-load.product-grid__image-main’)

if img_tag and ‘data-src’ in img_tag.attrs:
data_src = img_tag[‘data-src’]
base_url = re.sub(r’{width}.*$', ‘’, data_src)
image_url = base_url + ‘540x.jpg’
print(image_url)

This approach is more robust as it handles potential variations in the HTML structure. It finds the specific img tag using CSS selectors, then uses regex to clean up the URL. The ‘540x’ size is arbitrary - you can adjust it based on your needs.

Remember to handle cases where the img tag might not be found to avoid errors. Hope this helps!