JavaScript regex to break string into individual characters while preserving HTML tags

Harry47 · June 21, 2025, 6:29pm

I’m trying to parse a string that has HTML markup mixed with regular text. I want to separate it so each character becomes its own element, but keep the HTML tags intact as complete units.

let text = 'hel<em class="highlight">lo</em> world';
console.log("parsing result: " + JSON.stringify(text.split(/(<[^>]*>)|/)));

This outputs:

result: ["h",null,"e",null,"l","<em class=\"highlight\">","l",null,"o",null,"</em>"," ",null,"w",null,"o",null,"r",null,"l",null,"d",null]

After removing null values I get the desired output:

final: ["h","e","l","<em class=\"highlight\">","l","o","</em>"," ","w","o","r","l","d"]

Is there a better regex pattern that can handle HTML elements as complete tokens while splitting everything else character by character without generating null values?

SoaringEagle · July 2, 2025, 11:33am

try a lookahead assertion: text.split(/(?=<[^>]*>)|(?<=<[^>]*>)|(?!$)/g). might work better, but ur current approach is solid too! just add .filter(Boolean) to remove the nulls - works for me.

alexlee · July 1, 2025, 12:59pm

You’re getting null values because your regex has a capturing group with alternation. When (<[^>]*>) matches an HTML tag, it captures fine. But when the second alternative (empty string) matches regular characters, it creates a null capture.

Just use match() instead of split(). Try this:

let result = text.match(/<[^>]*>|./g);

This matches either complete HTML tags or any single character. match() only returns actual matches - no null values from unused capture groups. I’ve used this approach in several projects and it’s way cleaner than filtering afterwards or dealing with complex lookahead assertions.

Tom42Gamer · June 28, 2025, 6:52pm

The problem is how split() works with capturing groups. It adds the captured parts to the result array, but also throws in undefined/null entries where the capture group didn’t match. Alexlee’s right - match() is way cleaner. But if you’re stuck with split(), you could try: text.split(/(?=<[^>]>)|(?<=<[^>]>)|(?=.)/g).filter(s => s !== ‘’). Fair warning though - this gets ugly fast because lookbehind assertions don’t work in all browsers. I’ve hit this same issue parsing HTML before. Match() scales much better when you need to handle self-closing tags or nested elements down the road.