Extracting article publication dates from web pages using Google Sheets

I’m working on a project to get publication dates for news articles using Google Sheets. It’s tricky because some websites don’t show the date clearly. But I found that the info is often hidden in the page source.

Here’s what I’ve tried so far:

=ImportXml(A1, "//meta[@itemprop='datePublished']/@content")

This works for some sites, but not all. The problem is that different websites use different formats. For example, some use:

<meta property="article:published_time" content="2018-10-16T00:03:35+0100" />

When I try my formula on these, I get an error saying “Imported content is empty”.

I noticed that most of these tags have the word “published” in them. Is there a way to make my formula find any tag with “published” in it? That way, it could work for more websites.

Any help would be awesome! I’m stuck and not sure how to make this work for all the different formats out there.

Having worked on similar projects, I can suggest an alternative approach that might be more robust. Instead of relying on specific meta tags, you could use a broader XPath query combined with regex. Try this formula:

=REGEXEXTRACT(IMPORTXML(A1,“//meta[contains(@property, ‘published’) or contains(@itemprop, ‘datePublished’)]/@content”), “\d{4}-\d{2}-\d{2}”)

This searches for meta tags with ‘published’ in the property attribute or ‘datePublished’ in the itemprop attribute, then extracts the date. It’s more flexible and should work across various site structures. If this doesn’t catch all cases, you might need to expand the XPath query further or use multiple formulas for different date formats.

I’ve dealt with a similar challenge in my work, and I can share what worked for me. Instead of relying solely on ImportXml, I found success using a combination of ImportXml and RegexExtract. Here’s the formula I use:

=REGEXEXTRACT(IMPORTXML(A1,“//text()”), “(\d{4}-\d{2}-\d{2})”)

This approach looks for any date in the YYYY-MM-DD format within the entire page content. It’s more flexible and works across various sites, regardless of their specific meta tag structure.

For sites with different date formats, you might need to adjust the regex pattern. But this method has been quite reliable in my experience, catching dates from both visible content and hidden metadata.

Remember to handle errors gracefully, as some pages might not have a date that matches your pattern. You could wrap this in an IFERROR function to manage those cases.

hey alexj, i feel ur pain! ive wrestled with this too. have u tried using regex with importxml? something like:

=REGEXEXTRACT(IMPORTXML(A1, “//text()”), “\d{4}-\d{2}-\d{2}”)

this looks for dates anywhere on the page. it’s not perfect but might catch more than just meta tags. good luck!