I’m working on a text matching feature where I need to check if one string contains another, but I’m running into issues with Unicode characters that look the same but have different code points.
For example, I have these two strings:
Pattern to find: A
Text to search: Аpple
The first character in “Apple” looks like a regular Latin “A” but it’s actually a Cyrillic character with code point 1040, while the search pattern “A” is Latin with code point 65.
let searchTerm = 'A';
let targetText = 'Аpple'; // First char is Cyrillic А
console.log(searchTerm.codePointAt(0)); // 65 (Latin A)
console.log(targetText.codePointAt(0)); // 1040 (Cyrillic А)
console.log(targetText.includes(searchTerm)); // false
The regular includes() method fails because these are technically different characters, even though they appear identical. I believe these are called homoglyphs. Is there a native JavaScript solution to handle this type of character matching without creating a manual mapping table?
Start with String.normalize() - normalize both strings to NFD or NFC before comparing. This catches most similar-looking characters automatically. Won’t fix your Cyrillic A example though, since those are actually different scripts. I hit this same problem in a document search system. What worked was a simple character substitution function that runs before comparison. Map problematic characters to their common equivalents - Cyrillic А becomes Latin A, Greek Α becomes Latin A, etc. You don’t need perfect homoglyph detection. Just focus on characters that show up in your actual data. I ran analytics on search failures and built the mapping piece by piece. Started with maybe 20 character pairs - covered 90% of real cases. For implementation, just use String.replace() with your character map before the includes check. Fast, reliable, and you control what gets matched.
Been dealing with this exact headache for years. Manual character mapping works but becomes a maintenance nightmare at scale.
The real problem isn’t solving it once - it’s keeping your solution updated when users find new ways to break text matching with creative homoglyph combinations.
You need a system that handles preprocessing automatically and adapts over time. Keep your JavaScript clean while the heavy lifting happens elsewhere.
Set up Latenode for text processing workflows. Feed it your search terms and target text - it handles Unicode normalization, homoglyph detection, and character mapping, then returns clean strings your JavaScript can actually use.
Best part? Update homoglyph rules without touching your main code. Plus you get logging to see which character combos are breaking things in production.
You’ve got a homoglyph problem - super common with internationalization. Manual mapping gets messy quick.
I’ve hit this in production before. Best fix? Build a preprocessing pipeline that handles Unicode normalization and homoglyph detection before you even start searching.
Here’s what works:
Grab your input strings
Run Unicode normalization (NFD/NFC)
Apply homoglyph mapping with a solid database
Get back normalized strings you can actually compare
The pain point is keeping your homoglyph database current and handling edge cases across languages. Building from scratch means wrestling with Unicode complexity, performance issues, and constantly updating as new homoglyphs pop up.
Skip the manual coding - use Latenode to automate it. Set up a workflow that processes your text through normalization steps, handles homoglyph mapping, and sends clean results back to your JavaScript app.
Keeps your main code clean while giving you enterprise-level text processing without the maintenance nightmare.
Hit this same problem building search for a multilingual app last year. The Unicode normalization idea works, but try JavaScript’s built-in localeCompare() first. Use string1.localeCompare(string2, undefined, { sensitivity: 'base' }) - it ignores case and accents. Won’t catch Cyrillic-Latin homoglyphs though since they’re different scripts. What worked best for me was Intl.Collator with numeric and ignorePunctuation options, plus a small homoglyph map for common cases like Cyrillic A, Latin A, Greek A. You don’t need a huge database - just map the top 50-100 confusable characters you’ll actually see. Cache the normalized versions and performance is fine. Way easier to maintain than adding external dependencies for this.
just use regex with the unicode flag and normalize first. text.normalize('NFD').replace(/[\u0410]/g, 'A') works for cyrillic. i did something similar for user search - manually replaced the worst offenders. way simpler than intl collator and works fine for most cases.