I’m working on an email project and need help with conversation threading. You know how Gmail groups related emails together and hides the quoted parts from previous messages?
Here’s what I want to do:
Say I have an original email like this:
Hi Sarah,
Can we meet for coffee tomorrow?
Thanks,
Mike
Then someone replies:
Mike,
Sure, let's meet at 3pm.
Best regards,
Sarah Johnson
On Thursday, March 15, 2024 at 2:15 PM, Mike Davis wrote:
> Hi Sarah,
> Can we meet for coffee tomorrow?
> Thanks,
> Mike
I want my system to figure out that the second email is replying to the first one. Plus I need it to detect where the quoted text starts (like the “On Thursday, March 15…” part) so I can hide it from users.
The tricky part is that different email clients format replies differently. Some use “>” symbols, others have different date formats, and HTML emails make it even more complex.
Has anyone found a good open source solution for this? Maybe there’s a library specifically for email threading or an open source email client that does this well?
Any suggestions would be really helpful!
hey, u could check out mailparser or email-reply-parser. they both do a decent job with quoted texts in emails from differnt clients. i’ve had success with mailparser, it handles most replies pretty well, but watch out for those odd edge cases, u might need regex.
I’m using email-reply-parser for the same thing right now. It’s built for finding quoted content and works pretty well with most email clients. For threading though, don’t try parsing the content - check the email headers instead. Look at In-Reply-To and References fields. Most email clients set these correctly even when the formatting’s all over the place. Email-reply-parser catches about 80% of quoted text in my experience. The stuff it misses is usually weird corporate signatures or oddball client setups. Heads up: HTML emails sometimes stick quoted content in div blocks instead of normal quote markers, so you’ll need different handling for HTML vs plain text.
Had this exact problem two years back building an internal email tool. Tried a bunch of options and settled on lamson with custom parsing. Lamson’s decent at threading - it checks message IDs and reference headers instead of just subject lines, which works way better. Had to write extra code for quoted content though since every email client does it differently. Outlook’s got different patterns than Thunderbird, and web clients throw in their own weirdness. Main thing I learned: don’t rely on just one detection method, combine several. Also check out Mailman’s source code - their conversation threading has been around forever and actually works.