Algorithm to identify quoted content in email replies

I’m working on an email application and need help with detecting quoted text in replies. Different email clients handle quoting in various ways. Some add > symbols before each quoted line, others place new content above or below the original message, and some clients just paste the old email without any special formatting.

def detect_quoted_content(email_body, previous_email):
    quoted_lines = []
    current_lines = email_body.split('\n')
    
    for line in current_lines:
        if line.startswith('>'):
            quoted_lines.append(line)
    
    return quoted_lines

Google’s email service handles this really well by automatically hiding the quoted portions and showing a link to expand them. I want to build something similar that can recognize these repeated sections regardless of how they’re formatted. Has anyone implemented or found libraries that can do this kind of text comparison for email threads?

I encountered a similar challenge while developing a customer support system a couple of years ago. Relying solely on pattern matching can be ineffective due to the inconsistent formatting across email clients. My solution involved normalizing whitespace and removing timestamps and sender details. I then implemented a sliding window technique to analyze text chunks between the current message and previous emails in the thread. I found that if there was a 70% match in consecutive sentences, I could reasonably assume that it was quoted material. Inline replies add complexity, particularly when users respond amidst quoted text, but by maintaining a conversation history and applying edit distance calculations, I was able to distinguish new content from repeated quotes efficiently.

Your approach is too simplistic for the diverse formats of email replies. I’ve encountered similar challenges while developing an internal email system. It’s essential to utilize multiple detection strategies simultaneously. I created a fuzzy matching algorithm using Python’s difflib to compare the current email body with previous messages. If sections show over 0.8 similarity, I consider them as quoted content. Additionally, I searched for signature patterns and reply headers like “On [date], [person] wrote:” which help delineate new content from quoted text, even when formatting cues are absent. Outlook and corporate clients can strip all formatting, making it necessary to analyze conversation flow and match sentence structures rather than rely on non-existent formatting.

been there! regex works okay for basic stuff, but gmail’s parser is a pain to replicate. i tried levenshtein distance to match text blocks - it caught most duplicates even when formatting got messed up. check out the mailparser library, it’ll save you a huge headache.