Pattern matching to eliminate Gmail reply history

I'm trying to clean up email threads by keeping only the latest reply. Here's an example of what I'm dealing with:

My new message

On Monday, July 5, 2021 at 2:15 PM John Doe <[email protected]> wrote:
> Previous message content here
>
>

I want to get rid of everything starting from the 'On' line. Right now I'm using this pattern:

\wrote.*$

But it only removes stuff after 'wrote:'. How can I adjust it to catch the entire reply history starting from 'On'? I'd appreciate any help with this!

hey there! i’ve dealt with this before. try using this regex pattern:

On.?\n.?wrote:.*$

it should catch everything from ‘On’ onwards. make sure to use the multiline flag (m) when applying it. hope this helps ya out!

I’ve found a solution that works well for this scenario. Instead of relying solely on regex, consider using a combination of string manipulation and pattern matching. First, locate the index of ‘On’ followed by a day and date. Then, use that index to slice the string, keeping only the content before it. This approach is more robust and handles variations in reply formats.

Here’s a Python snippet that demonstrates this:

import re

def clean_email(text):
    match = re.search(r'\nOn [A-Z][a-z]+,', text)
    if match:
        return text[:match.start()].strip()
    return text

# Usage
cleaned_text = clean_email(original_email_text)

This method is flexible and can handle different date formats and variations in the ‘On’ line.

I’ve had similar issues with email threads getting cluttered. What worked for me was using a regex pattern like this:

(?s)^On\s+.?(?:wrote|sent):.$

The (?s) flag enables dot-all mode, allowing the dot to match newlines. This pattern catches everything from ‘On’ through the end of the email, including variations like ‘sent:’ instead of ‘wrote:’.

One caveat - make sure to test thoroughly with different email formats. Some clients use slightly different reply structures. You might need to tweak the pattern for edge cases.

Also, consider using a library like email-reply-parser if you’re doing this at scale. It’s more robust than regex alone for complex threading scenarios.