JavaScript: Breaking down strings with HTML tags into individual characters and elements

I need help with parsing a string that has HTML markup mixed in. I want to break it down so that regular text gets split into individual letters, but HTML tags stay intact as complete elements.

let text = 'hel<em class="highlight">lo wo</em>rld';
console.log("parsing result: " + JSON.stringify(text.split(/(<[^>]*>)|/)));

This approach gives me:

["h",null,"e",null,"l","<em class=\"highlight\">","l",null,"o",null," ",null,"w",null,"o","</em>","r",null,"l",null,"d"]

After removing the null values, I get the desired output:

["h","e","l","<em class=\"highlight\">","l","o"," ","w","o","</em>","r","l","d"]

Is there a better regex pattern that can handle this parsing without creating those null entries that I have to clean up later?

give this regex a shot: text.split(/(?=<[^>]*>)|(?<=>)|(?!$)/g) - that should avoid those pesky nulls. but honestly, your current way works too. just filter out the nulls and move on. parsing html with regex can be a mess anyway!

Your regex is creating nulls because the capturing group (<[^>]*>) returns null when the non-capturing part matches. Quick fix: use text.split(/(<[^>]*>)/).filter(Boolean) - it’ll strip out all the empty strings and nulls at once. Or try [...text.matchAll(/(<[^>]*>)|(.)/g)].map(m => m[1] || m[2]) if you want more control over matching. I hit this same issue building a text parser for a markdown editor and matchAll handled complex nested tags way better.

Your regex creates capture groups with that alternation, which is why you’re getting nulls. Use text.match(/(<[^>]*>)|./g) instead of split. This matches complete HTML tags or individual characters without empty captures. Match returns an array of all matched elements, so you get clean results like ["h","e","l","<em class=\"highlight\">","l","o"," ","w","o","</em>","r","l","d"] right away. I’ve done this for text highlighting features - way cleaner than split plus cleanup.