JavaScript regex capture groups not working as expected for HTML parsing

emmat83 · June 12, 2025, 3:24am

I’m trying to extract URL values from anchor tags using regular expressions in JavaScript but getting weird results.

Here’s my test case:

var htmlContent = 'Sample text here\n<p><a href="https://example.com/page1">First Link</a></p>\n<section><a href="https://example.com/page2">Second Link</a></section>';

console.log((/href="([^"]*)"/gmi).exec(htmlContent));
console.log(htmlContent.match(/href="([^"]*)"/gmi));

The regex pattern href="([^"]*)" should capture the URL part inside the quotes. When I test this pattern online it works fine and shows the captured groups properly.

But in JavaScript the behavior is different. The exec method only returns the first match with its capture group, while the match method returns all matches but without the capture groups.

How can I get all the URLs extracted properly? What’s the correct way to handle this in JavaScript?

Tom42Gamer · June 22, 2025, 3:35am

Had this exact problem scraping product links from e-commerce sites. JavaScript’s regex with global flags is honestly confusing compared to other languages. I switched to matchAll() like Sophia said, but if you’re on older JS that doesn’t support it, there’s a workaround. Use replace() as a hack - it calls a function for each match and passes capture groups as parameters. Try htmlContent.replace(/href="([^"]*)"/gmi, function(match, url) { urls.push(url); }) and push each URL into an array. It’s hacky but works across all browsers. The real issue is regex behaves differently between implementations, so online testing doesn’t always match JavaScript behavior.

sofiag · June 19, 2025, 3:28pm

Yeah, this trips up a lot of people with JavaScript regex. The problem is exec() and match() handle the global flag totally differently. With exec() and global flag, you only get one match per call - but it remembers where it left off. You’d have to call it over and over in a loop to grab everything. match() with global flag grabs all matches at once but throws away your capture groups. Easiest fix? Use a while loop with exec(). Reset your regex first, then loop until exec() returns null. Each round gives you the full match plus capture groups. I’ve done this tons of times parsing HTML attributes - works like a charm. Just make sure you store your regex in a variable instead of creating it inline, or the loop won’t move forward through your string.

Sophia63 · June 18, 2025, 6:31pm

yeah, exec() only captures one match. for all matches, try matchAll() - way easier! just use Array.from(htmlContent.matchAll(/href="([^"]*)"/gmi)) to get everything at once. much simpler than looping with exec().