How to implement email quote detection and hiding functionality in web mail archive

I’m building a web app that shows email conversations in a threaded view. The emails come from different email programs and can be either plain text or HTML format.

Since most users reply at the top of emails, I want to hide the repeated quoted content similar to what Gmail does with their “show quoted text” feature.

Figuring out which part is the original message versus the reply is tricky. I use "> " markers when I quote text in replies. I made a regex pattern to find these lines and wrap them in a div so JavaScript can hide or show the quoted section.

Then I saw that Outlook works differently. It doesn’t use "> " by default. Instead it adds header info above the quoted message with details like From, Subject, and Date. The original message stays unchanged. I can detect this pattern and hide everything after it, assuming it’s a top-posted reply.

I also checked Thunderbird. It uses "> " for plain text emails but uses <blockquote> tags for HTML emails. I haven’t tested Apple Mail, Lotus Notes, or other email clients yet.

Am I going to need a separate regex pattern for each email client? Or is there a better approach I’m not seeing?

Looking for advice, code examples, or existing libraries that handle this!

I’ve dealt with this exact problem building a corporate email archive system. Don’t rely on just one approach - combine multiple heuristics for way better results. Here’s what I learned: quote detection isn’t just about finding markup patterns. You need to understand email structure. I built a scoring system that weighs different signals together - header patterns like “On [date] wrote:”, indentation shifts, line prefixes, and HTML font styling. The game-changer was realizing most email clients preserve structural elements even without standard quote markers. Take Outlook - it might not add "> " but it keeps consistent spacing and usually includes attribution lines. I also check for duplicate content across threads to catch cases where pattern matching breaks. This approach scales much better than client-specific regex since it adapts to variations within the same client.

Machine learning works surprisingly well here. I built something using basic pattern recognition that looks at structural elements instead of client-specific stuff. The trick is focusing on metadata patterns - timestamps, email addresses, signature blocks - that show up consistently across different clients. Train a simple classifier on emails from your target clients and it’ll catch quote boundaries pretty accurately. Start with features like line length variance, email headers, and indentation patterns. This scales way better than maintaining regex patterns for every client, especially when people switch email programs or use weird configurations.

i totally get it, it can be a nightmare! quotequail’s pretty good for this. using a library will save you from creating a mess with regex patterns for each client. trust me, keep it simple!