Using regex patterns to transform HTML list elements into Jira formatting

TomDream42 · August 5, 2025, 6:01pm

I have a limited tool that can only do regex replacements using Boost library functions. No HTML parsing libraries are available. I need to convert HTML content to Jira markup and most formatting works fine with simple regex patterns.

The challenge is with mixed list types. Here’s what I’m working with:

<div>Ordered items:</div>
<ol>
    <li>First entry</li>
    <li>Second entry</li>
    <li>Third entry</li>
</ol>
<div>Unordered items:</div>
<ul>
    <li>Alpha point</li>
    <li>Beta point</li>
    <li>Gamma point</li>
    <li>Delta point</li>
</ul>

Target Jira output should be:

Ordered items:
    # First entry
    # Second entry
    # Third entry
Unordered items:
    * Alpha point
    * Beta point
    * Gamma point
    * Delta point

I can handle single list types easily but struggling with documents containing both ordered and unordered lists. Can this conversion be achieved using only regex replacement patterns?

nina.k · August 19, 2025, 3:06am

Regex gets messy here because you need context - is each <li> inside <ol> or <ul>? You’d need multiple passes and it becomes fragile.

I hit this exact problem when our team needed to convert docs to wiki format. Tried regex first - total nightmare to maintain.

Automation saved us. Set up a flow that takes HTML input, parses it properly (no regex hacks), and spits out clean Jira markup. You can handle complex stuff like nested lists, mixed content, and weird edge cases without writing hundreds of regex patterns.

For your case, build a simple automation that:

Identifies list contexts properly
Maps <ol><li> to # format
Maps <ul><li> to * format
Handles spacing and structure

This scales when you need more HTML elements or different output formats. Plus you get validation, logging, and error handling.

Check out Latenode for building transformation flows like this. Way cleaner than wrestling with regex: https://latenode.com

wanderingWeasel · August 18, 2025, 2:14pm

Skip the regex headache and automate this conversion instead.

I hit the same wall with legacy docs. Regex works initially but turns into a nightmare once you add nested lists, broken HTML, or need different formats down the road.

Automated workflows handle HTML parsing way better. Feed it raw HTML, it reads the document structure properly (no regex guessing), and spits out clean Jira markup every time.

Here’s how it works:

Parse HTML input (zero regex)
Walk the DOM structure
Convert ol/li to numbered format
Convert ul/li to bullets
Handle spacing and indentation right

This scales when requirements change. Need nested lists? Done. Different markup? Just tweak the output mapping. Error handling’s built in.

Latenode makes these document workflows dead simple. Way cleaner than babysitting fragile regex: https://latenode.com

JollyMusic3 · August 16, 2025, 8:39am

Regex for this gets tricky because you need to preserve context. The main issue is telling <li> elements apart based on their parent containers. I’ve dealt with this converting old documentation - capturing groups that include the parent list type work best. Don’t process <li> tags alone. Instead, grab the whole list structure: <ol>\s*((\s*<li>[^<]*</li>\s*)+)\s*</ol> for ordered lists. Then inside that captured group, replace each <li>([^<]*)</li> with # $1. Same thing for unordered lists but use * markers. The key insight? Treat each complete list as one unit, not individual items. This stops different list types from interfering with each other and keeps context intact. You’ll need nested replacements or some intermediate steps, but it’s totally doable with Boost regex if you structure the patterns right.

Tom42Gamer · August 15, 2025, 10:12am

Yes, it’s possible to achieve this using regex, but a two-pass approach is essential for accuracy. I’ve successfully implemented it. The first step involves targeting the ordered list; you’ll want to remove the <ol> tags and convert the list items by replacing <li>([^<]+)</li> with # $1. Subsequently, the unordered list can be processed similarly, using * instead. The critical part is to manage the sequence effectively. I found it helpful to designate boundaries with temporary placeholders and process each list type independently before cleaning up the output. While this method is more intensive than handling single lists, it can be efficiently managed using Boost regex as long as you pay attention to whitespace issues and peculiarities in HTML formatting.

alexr1990 · August 14, 2025, 12:51pm

this is doable but pretty hacky with just regex. I’d grab the ol content first with (?<=<ol>)(.*?)(?=</ol>), then replace the li tags inside that match. Problem is boost regex can be weird with lookbehinds depending on your version. another option - mark each list type with temp tokens, process them separately, then clean up the tokens after. did this on a similar project and it worked, but maintaining it sucked.