JavaScript regex to break string into individual characters while preserving HTML tags

I’m trying to parse a string that has HTML markup mixed with regular text. I want to separate it so each character becomes its own element, but keep the HTML tags intact as complete units.

let text = 'hel<em class="highlight">lo</em> world';
console.log("parsing result: " + JSON.stringify(text.split(/(<[^>]*>)|/)));

This outputs:

result: ["h",null,"e",null,"l","<em class=\"highlight\">","l",null,"o",null,"</em>"," ",null,"w",null,"o",null,"r",null,"l",null,"d",null]

After removing null values I get the desired output:

final: ["h","e","l","<em class=\"highlight\">","l","o","</em>"," ","w","o","r","l","d"]

Is there a better regex pattern that can handle HTML elements as complete tokens while splitting everything else character by character without generating null values?

try a lookahead assertion: text.split(/(?=<[^>]*>)|(?<=<[^>]*>)|(?!$)/g). might work better, but ur current approach is solid too! just add .filter(Boolean) to remove the nulls - works for me.

You’re getting null values because your regex has a capturing group with alternation. When (<[^>]*>) matches an HTML tag, it captures fine. But when the second alternative (empty string) matches regular characters, it creates a null capture.

Just use match() instead of split(). Try this:

let result = text.match(/<[^>]*>|./g);

This matches either complete HTML tags or any single character. match() only returns actual matches - no null values from unused capture groups. I’ve used this approach in several projects and it’s way cleaner than filtering afterwards or dealing with complex lookahead assertions.

The problem is how split() works with capturing groups. It adds the captured parts to the result array, but also throws in undefined/null entries where the capture group didn’t match. Alexlee’s right - match() is way cleaner. But if you’re stuck with split(), you could try: text.split(/(?=<[^>]>)|(?<=<[^>]>)|(?=.)/g).filter(s => s !== ‘’). Fair warning though - this gets ugly fast because lookbehind assertions don’t work in all browsers. I’ve hit this same issue parsing HTML before. Match() scales much better when you need to handle self-closing tags or nested elements down the road.