Extracting supplier and item information from Gmail messages

Hey everyone, I need some help with a project I’m working on. I’m trying to pull supplier and item details from emails in Gmail and put them into an Excel file. I want to get things like when the email was sent, what it’s about, the item name, how many there are, the price, and information about the supplier.

I’ve been using regex to extract the data, but it’s not consistent. When the emails have a structured format, I get partial info, but with unstructured emails, I end up with unexpected values like random numbers or letters.

I even tried using machine learning with Spacy, but that didn’t fix the issue. Any suggestions on how to improve this process would be greatly appreciated. Thanks for your help!

From my experience, a hybrid approach combining rule-based extraction and machine learning tends to yield the best results for email parsing tasks. I’d suggest implementing a custom entity extractor using libraries like NLTK or spaCy, tailored specifically to your email format and domain. This can help identify key information more accurately.

Additionally, consider using text classification algorithms to categorize emails before extraction. This can help apply different parsing strategies based on email type. For unstructured emails, you might want to explore more advanced NLP techniques like semantic similarity or contextual embeddings to improve accuracy.

Don’t forget to implement a robust error handling and logging system. This will help you identify patterns in problematic emails and iteratively refine your extraction process over time. It’s an ongoing process, but with persistence, you can significantly improve your results.

have u tried using OpenAI’s GPT API? it’s pretty good at understanding context and extracting info. you could feed it ur emails and ask for specific details. might cost a bit, but could save u tons of time. just an idea!

I’ve tackled a similar challenge in my work, and I found that a combination of approaches worked best. Regex can be useful for structured emails, but for unstructured ones, I’d recommend looking into Natural Language Processing (NLP) techniques beyond just Spacy.

One effective method I’ve used is to implement a Named Entity Recognition (NER) model specifically trained on your domain. This can help identify entities like product names, quantities, and prices more accurately. Additionally, you might want to consider using a pre-trained language model like BERT or GPT and fine-tuning it on your specific email dataset.

Another approach that yielded good results for me was using a rule-based system in conjunction with machine learning. You can define a set of rules for common patterns in your emails and fall back to ML predictions when those rules don’t apply.

Lastly, don’t underestimate the power of data cleaning and preprocessing. Spending time on normalizing your email content before extraction can significantly improve your results. It’s a bit of extra work upfront, but it pays off in the long run.