Email content separation algorithms similar to Gmail's quoted text detection

I’m working on an email application and need help with identifying quoted content from previous messages. Different email clients handle reply formatting in various ways:

  • Many add > symbols before each quoted line
  • Others place new content above or below the original message
  • Some mobile clients like webOS just append the old email without any formatting

Gmail has this awesome feature where it automatically hides the quoted portions and shows a “show quoted text” link. I’m wondering if there are any open source libraries or documented algorithms that can do similar text comparison and separation? I need something that can reliably distinguish between new content and previously sent messages regardless of how the email client formatted the reply.

I ran into this same issue building our email system last year. The real pain isn’t just finding quoted content - it’s dealing with how inconsistent different email clients and languages are. We went with a multi-layered approach mixing pattern matching and content analysis. First step: scan for reply indicators like “On [date] wrote:” or “From:” headers, then check indentation and line prefixes. The hardest part? Telling forwards from replies and handling HTML vs plain text. We kept flagging signature blocks as quoted content too. What really helped was message threading - if you’ve got conversation history, match text chunks against previous messages to catch exact duplicates.

hey, have you looked at the talon library on GitHub? it might be just what you need. it tackles all the quirky formats mobile clients do, and honestly, it’s way better than struggling with regex on your own.

Gmail uses a mix of techniques that work pretty well together. They’ve got machine learning models trained on tons of email data, but the real work happens through header parsing and content fingerprinting. The algorithm hunts for patterns like indentation shifts, reply prefixes across languages, and timestamps. Here’s the key part - it strips HTML markup before comparing text. Most solutions mess this up by trying to match formatted HTML directly. Gmail also tracks conversation history, so it catches quoted text even when forwarding chains scramble the formatting. If you’re building something like this, focus on preprocessing first. Clean up whitespace and strip client-specific junk before running your detection algorithms.