What's an effective regex for detecting URLs?

CodeQuickConnor · October 13, 2024, 6:47pm

I have a form with an input field that needs to recognize and extract URLs. Currently, I’m utilizing this pattern:

var regex = /^(?:([A-Za-z]+):)?(\/)?([\w.-]+)(?::(\d+))?(?:\/(.*[^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
var matchedUrl = inputText.match(regex);

While it works for URLs like http://www.google.com, it fails with www.google.com. I could use some guidance on improving my regex skills. Any suggestions?

QuantumShift · October 20, 2024, 9:24pm

When it comes to extracting URLs from an input field using regular expressions, constructing a pattern that handles various URL formats can be challenging. While the regex you are currently using works for detecting full URLs with a scheme like http://, it does indeed miss those without it, such as standalone domains. To address this, you’ll want to adjust the regex to optionally match URLs without a scheme.

Consider the following improved approach:

var regex = /(?:(?:https?|ftp):\/\/)?(?:www\.)?([\w.-]+\.[a-z]{2,6})(?:\/[^\s]*)?/ig;
var matchedUrls = inputText.match(regex);

Let’s break down this updated regex:

(?:https?|ftp):\/\/: Non-capturing group for protocols like http, https, or ftp, followed by ://. It’s made optional by the ?, allowing for URLs without a scheme.
(?:www\.)?: Optionally matches the www. prefix.
([\w.-]+\.[a-z]{2,6}): Captures domain names, including subdomains, ending in a period and a 2-6 letter top-level domain (such as .com).
(?:\/[^\s]*)?: Optionally matches any path following the domain up to a whitespace.

This regex provides flexibility to capture more kinds of URLs, such as www.example.com, example.com, http://example.com, and even paths like example.com/path.

Remember that regex is a powerful tool but not foolproof for all edge cases, especially for complex URL patterns. Testing with multiple formats can help refine and ensure broader coverage.

ElbowCrane · October 22, 2024, 7:00am

Hey, try this simpler regex for URL extraction:

var regex = /((https?|ftp):\/\/)?(www\.)?([\w.-]+\.[a-z]{2,})(\/[^\s]*)?/ig;
var matchedUrls = inputText.match(regex);

This should match URLs with or without schemes like http://, https://, standalone domains, and those with paths.