How can I automatically extract country codes from domain names in spreadsheet software?

I’m using a spreadsheet that has a variety of domain names, and I’m trying to find a way to automatically pull out the country codes if they exist. The country codes are not consistently placed within the URLs, which makes it hard to do this by hand.

Here are some example domains and what I need:

I have tried text splitting functions, but due to the varying locations of the country codes, this method does not work well. Some codes show up at the start, some at the end, and others might not have a code at all.

Is there a formula or function that can automatically recognize and extract these country codes from the domain names? I would appreciate a solution that I can use in Google Sheets or Excel without needing to manually check each one.

u could try using the RIGHT formula with a country code lookup table! Just grab the last 2-3 chars from each domain and match those. its simpler than regex, works for most like .uk, .fr, or .ru. not perfect but itll cover 80% of cases.

The Problem:

You’re trying to extract country codes from a list of domain names in a spreadsheet, but the codes aren’t consistently located within the URLs, making manual extraction impractical. Spreadsheet formulas are proving insufficient due to the variable positions of the country codes.

:thinking: Understanding the “Why” (The Root Cause):

Extracting country codes from inconsistently formatted domain names is challenging for spreadsheet formulas because they rely on structured, predictable data. Regular expressions (regex) offer a more flexible approach to pattern matching, handling variations in code placement. Spreadsheet functions like LEFT, RIGHT, MID, and FIND can work for simpler cases, but a more robust and scalable solution involves automated processes that can handle exceptions and complex patterns more effectively. Simple formulas struggle with the inherent variability in domain name structure.

:gear: Step-by-Step Guide:

The most efficient solution involves automating the process using a workflow tool that can handle pattern matching and data extraction. This approach surpasses the limitations of spreadsheet formulas, providing a more robust and scalable solution. Here’s how you can implement this:

Step 1: Choose an Automation Tool:

Select a visual automation tool that allows you to connect to spreadsheets, process data using logic, and write results back. This could be a tool like the one mentioned in the original post (https://latenode.com), or other similar automation platforms. Many offer free trials or community editions.

Step 2: Create the Workflow:

  1. Connect to Spreadsheet: Establish a connection between your chosen automation tool and your Google Sheet (or Excel file) containing the list of domain names.
  2. Data Extraction: Use the tool’s functionalities to extract data from the spreadsheet. This would generally involve a loop to process each domain name individually.
  3. Pattern Matching: Implement a pattern-matching function, possibly using regular expressions. This function needs to identify the country codes in the domain names, despite their varying locations. You’ll likely need a set of rules or a lookup table. For example, the regex \.([a-z]{2})$ might match two-letter codes at the end of the domain (e.g., .uk, .fr). More complex rules can be used to find codes like .co.uk or .com.br.
  4. Handle Exceptions: Account for edge cases such as domains without country codes or domains with unexpected formats. Define a default output (e.g., “NA”) for situations where no country code is found.
  5. Write Results: Use the tool’s functionality to write the extracted country codes back into your spreadsheet.

Step 3: Test and Refine:

After setting up your workflow, run it on a small subset of your data to test its accuracy. Adjust your pattern matching rules or exception handling as needed. Once satisfied, you can then process your entire dataset.

:mag: Common Pitfalls & What to Check Next:

  • Regex Accuracy: Ensure that your regular expressions correctly identify the country codes in diverse domain name formats. Consider testing your regex against sample data to ensure that it correctly identifies the target patterns and avoids false positives or negatives.
  • Exception Handling: Test your handling of edge cases and unexpected domain name formats to ensure your workflow gracefully handles all input and avoids errors.
  • Automation Tool Limitations: Familiarize yourself with the limitations and capabilities of your chosen automation tool. The specific functions, libraries, and data handling features will affect how you design your workflow.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!

Regex works great for this. I’ve done similar domain parsing when analyzing traffic data across different regions. Country codes follow pretty predictable patterns even when they move around. Most codes show up as either the final TLD (.fr, .uk, .ru) or second-to-last in compound domains (.co.uk, .com.br, .com.cn). You can build a formula that checks both spots. I used REGEXEXTRACT with nested IF statements in Google Sheets. For domains ending in known two-letter country codes, grab the last segment. For compound cases like .co.uk or .com.br, target the middle segment. The formula gets messy but handles most edge cases. Main issue is keeping an accurate list of valid country codes vs generic TLDs like .com or .org. I cross-referenced against ISO country code lists to avoid false positives.

I ran into this exact issue when analyzing website portfolios across different markets. Country codes don’t follow consistent patterns, so you can’t just use one extraction rule. Here’s what worked: I built a lookup table in Excel with all valid country codes first. Then I used MID, RIGHT, and FIND functions to check different positions in each domain, scanning right to left. The formula checks if the last segment matches a country code, then looks for compound patterns. So bbc.co.uk pulls ‘uk’ from after the ‘co.’ part, and sina.com.cn grabs ‘cn’ from the end. Generic domains like cnn.com get flagged as no country code since ‘com’ isn’t in my reference table. The trick is using actual ISO country codes in your lookup table instead of assuming any two letters work. Prevents false matches with generic extensions like ‘.tv’ or ‘.me’. Takes time to set up initially, but once it’s running, it’ll handle hundreds of domains without issues.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.