I’m learning Python and LangChain after working mostly with C#. I want to use AI to analyze my email data.
My current setup works like this:
- Pull emails using IMAP connection
- Convert emails to JSON format with JSONLoader
- Build vector database using FAISS and OpenAIEmbeddings
- Run similarity search to find 3 most relevant emails
- Send results to GPT 3.5 Turbo with my original question
My prompt template looks like:
User question: {user_input}
Relevant email data:
{search_results}
This works great for specific questions like “When did I last receive an email from [email protected]?” The system finds the right email and gives me the exact timestamp.
But I’m stuck with counting questions like “How many emails came from [email protected]?” The similarity search only returns the 3 most similar vectors, so I can’t get accurate counts. Even if I increased it to return 100+ emails, I can’t send that much data to the language model.
What’s the right approach for handling numerical queries that need to examine larger datasets?
Vector similarity search just wasn’t built for aggregation queries. I hit this same wall building a document analysis tool. Here’s what actually worked: use a two-tier setup. Keep your vector database for semantic searches, but add a simple relational database (SQLite’s fine) for structured email metadata - sender, recipient, timestamps, subject categories, whatever you need. When you process emails, pull this structured data while you’re creating embeddings anyway. Then build a basic query parser that spots counting patterns like ‘how many’ and routes those to SQL instead of vector search. Your example ‘How many emails came from [email protected]?’ just becomes SELECT COUNT(*) FROM emails WHERE sender = ‘[email protected]’. The language model works with that count result instead of trying to crunch through hundreds of email vectors. Scales way better than forcing vector databases to do something they’re not designed for.
Hit this exact same issue building a customer support ticket analyzer. You’re basically trying to use semantic search for stats queries - it’s a mismatch. Here’s what worked for me: I added query intent detection upfront. Built a simple classifier that sorts queries into ‘retrieval’ (find specific stuff) vs ‘analytical’ (counting, averages, etc). For analytical queries, I skip the vector DB completely and go straight to the raw email data with basic filters. The trick is keeping email metadata separate from embeddings when you first process everything. Pull out sender addresses, timestamps, subject patterns - anything you might want to count later - into a lightweight structure. Now your counting queries become simple filters instead of similarity searches. Cut my response times to under 200ms and way more reliable than forcing FAISS to do aggregation it wasn’t built for.
you might want to try using a hybrid approach. for counting, it’s better to skip the vector search and just use raw email data with something like pandas or sqlite. keep vector search for relevant queries, but add a query classifier to handle counting separately.
Had this exact problem building a support ticket analyzer a few years back. Your approach is fine - you’re just making one system do two different jobs.
Vector search rocks at finding similar stuff but sucks at counting. Found out the hard way when our CEO asked “how many billing tickets this month?” and my RAG system gave total garbage.
What fixed it: think like a database person, not just ML. When you’re pulling emails through IMAP, grab the countable data right away. Sender info, timestamps, basic keywords or categories.
Dump this lightweight stuff somewhere fast - Python dict, SQLite, even JSON if you don’t have crazy volume. You need instant access to data you can actually count.
Add a simple router before your current setup. Query has “how many,” “total,” or “count”? Send it to your metadata store. Everything else hits vector search.
Now counting queries are just lookups instead of trying to make FAISS do math. You keep all the semantic search power where it matters.
Took us maybe a day once I stopped overthinking it.
Bob’s hybrid approach works, but there’s a cleaner solution.
I’ve tackled similar email analysis problems at work. The trick is preprocessing your data to pull out metadata before it goes into the vector database.
I create a counting layer that tracks sender counts, date ranges, and common patterns. Then I use automation to route queries - counting questions hit the metadata layer, content questions go to vector search.
No need for separate databases or complex classification logic. Set up workflows that detect query types and route them automatically.
The game changer? Automate the entire email ingestion process. New emails get processed for content and metadata, then stored so both similarity search and counting are lightning fast.
I’ve built these systems using automation platforms - way more reliable than handling everything in code. Better performance and no manual query management.
Check out Latenode for workflow automation: https://latenode.com