Converting Gmail HTML emails to plain text using Python and IMAP4_SSL

I’m working on a project where we need to fetch emails from our Gmail account using Python and IMAP4_SSL. We’ve managed to retrieve the email bodies, but they’re in HTML format. Our goal is to convert these HTML emails into plain text.

Here’s a snippet of what we’ve tried so far:

import imaplib
import email

def fetch_emails():
    mail = imaplib.IMAP4_SSL('imap.gmail.com')
    mail.login('[email protected]', 'your_password')
    mail.select('inbox')
    
    _, message_numbers = mail.search(None, 'ALL')
    for num in message_numbers[0].split():
        _, msg = mail.fetch(num, '(RFC822)')
        email_body = email.message_from_bytes(msg[0][1])
        print(email_body)  # This prints HTML content

    mail.close()
    mail.logout()

fetch_emails()

Does anyone know how to convert the HTML content to plain text efficiently? Any tips or libraries that could help would be greatly appreciated!

hey, i’ve been doing something similar recently. try using the lxml library - it’s super fast and works great for parsing html. here’s a quick example:

from lxml import html

# ... your existing code ...

tree = html.fromstring(email_body)
plain_text = tree.text_content()
print(plain_text)

it’s pretty simple and gets the job done. hope this helps!

I’ve dealt with a similar issue in the past, and I found that using the BeautifulSoup library along with the html2text module works wonders for converting HTML emails to plain text. Here’s how you can modify your code:

First, install the required libraries:
pip install beautifulsoup4 html2text

Then, update your script like this:

import imaplib
import email
from bs4 import BeautifulSoup
import html2text

def fetch_emails():
    # Your existing code here...
    for num in message_numbers[0].split():
        _, msg = mail.fetch(num, '(RFC822)')
        email_body = email.message_from_bytes(msg[0][1])
        
        if email_body.is_multipart():
            for part in email_body.walk():
                if part.get_content_type() == 'text/html':
                    html_content = part.get_payload(decode=True).decode()
                    soup = BeautifulSoup(html_content, 'html.parser')
                    plain_text = html2text.html2text(str(soup))
                    print(plain_text)
                    break
        else:
            print(email_body.get_payload(decode=True).decode())

    # Rest of your code...

This approach has worked reliably for me, handling various HTML email formats and preserving most of the original text structure. It’s efficient and doesn’t require much additional processing.

I’ve found that using the email.parser module in combination with the html library can be quite effective for this task. Here’s an approach that’s worked well for me:

import imaplib
import email
from email.parser import BytesParser
from email.policy import default
import html

def fetch_and_convert_emails():
    mail = imaplib.IMAP4_SSL('imap.gmail.com')
    mail.login('[email protected]', 'your_password')
    mail.select('inbox')

    _, message_numbers = mail.search(None, 'ALL')
    for num in message_numbers[0].split():
        _, msg_data = mail.fetch(num, '(RFC822)')
        email_body = BytesParser(policy=default).parsebytes(msg_data[0][1])
        
        if email_body.is_multipart():
            for part in email_body.walk():
                if part.get_content_type() == 'text/html':
                    html_content = part.get_payload(decode=True).decode()
                    plain_text = html.unescape(html_content)
                    print(plain_text.replace('<br>', '\n').replace('<p>', '\n\n'))
                    break
        else:
            print(email_body.get_payload(decode=True).decode())

    mail.close()
    mail.logout()

fetch_and_convert_emails()

This method handles HTML entities and some basic formatting, providing a clean plain text output.