I need to develop a robust RAG solution that works with CSV datasets for production use.
Two methods I’m considering:
Vector embeddings → Store in vector database → Process user questions → Run similarity/hybrid search → Send to LLM → Get response
Load CSV into dataframes → Have LLM generate Python code from user input → Execute queries on dataframe → Pass results to LLM for interpretation → Return final answer
The vector approach produces unclear responses since I’m treating structured data like unstructured text. The dataframe method works much better but I’m worried about performance at scale. What would you recommend for handling this properly?
definitely agree on that! treating structured data as if it’s text is a big no-no. also, validation and caching will help tons. once you optimize it properly, you’ll see a solid boost in performance for sure!
Been there, done that - multiple times. You’re right about vector embeddings being useless for CSV data. Learned that lesson the hard way on a customer analytics project.
I’d go with your second approach but tweak it:
Ditch pandas for DuckDB. It’s way faster on large datasets and won’t blow up your memory.
Cache the SQL/Python patterns. Most business questions are repetitive, so you won’t regenerate the same queries over and over.
Add validation before running any code. Whitelist operations so you don’t execute something dangerous.
Partition massive datasets. Only load what you need based on the initial query.
We handle 50M+ row datasets this way and get sub-10 second response times on most queries. The trick is never loading everything into memory.
This walkthrough shows a solid CSV analysis implementation with language models:
One last thing - always sandbox your code execution. Use containers or restricted Python environments so generated code can’t touch anything it shouldn’t.