How to implement collapsible quoted content feature in email thread viewer

I’m building a web app that shows email conversations in a threaded view. The emails come from different email programs and can be either plain text or HTML format.

Since most users reply at the top of emails, I want to hide the repeated quoted content like Gmail does with their “show quoted text” link.

Figuring out what part is the new reply versus the old quoted stuff is tricky. I use "> " markers when I quote text, so I made a regex to find those lines and wrap them in a div that JavaScript can hide or show.

But then I saw that Outlook doesn’t use "> " markers. Instead it just adds a header block with info like From, Subject, Date above the quoted part. The quoted text itself stays normal. I can detect this pattern and hide everything after it.

I also checked Thunderbird and it uses "> " for plain text emails but uses blockquote tags for HTML emails. I haven’t tested Apple Mail, Lotus Notes, or tons of other email clients yet.

Am I going to need a different regex pattern for every email client? Or is there a better approach I’m not seeing?

Would love any advice, code examples, or suggestions for existing libraries that solve this problem!

I built something similar last year and quickly learned that handling every email client’s quirks individually is a nightmare. Skip the pattern matching - I went with a heuristic approach that looks for duplicate content within the email thread. I compare the current email against previous messages using a similarity algorithm. When content matches above a certain threshold (85% worked well for me), I mark it as quoted material. This catches most cases whether it’s Gmail’s format, Outlook’s headers, or Thunderbird’s blockquotes. The key insight: quoted content is about repetition, not formatting. You could combine this with your existing regex patterns as a fallback. Start with format detection for common cases, then use content similarity for everything else. It’s not perfect but handles edge cases way better than trying to enumerate every possible email client format.

Honestly, machine learning might be overkill here, but I trained a simple classifier on email samples to identify quote boundaries. Worked surprisingly well after I fed it a few hundred examples from different clients. The model caught patterns I never would’ve noticed manually - subtle spacing changes and signature blocks that mark where quotes start. Way less maintenance than managing dozens of regex patterns.

Try using Message-ID and References headers if you can access them. Most email clients keep these intact even when they mess up the quoted formatting. You can match these against your thread structure to figure out what’s a response vs original content. I built this into a corporate email system and it worked way better than just analyzing content. Build the message relationship tree first, then use that context for quote detection. Combine it with basic patterns like “On [date] wrote:” that show up across different clients and you’ll get solid coverage. Regex works, but it’s better as one signal among many rather than your main approach.