I need help understanding the modern technology choices for creating a complete Retrieval-Augmented Generation system. I want to know what works best at different stages of the process:
Data splitting: What methods or libraries should I use to break documents into smaller pieces?
Vector creation: Which models are giving the best results for creating embeddings right now?
Storage and search: What’s the most effective way to save and find these vectors (databases, search systems, etc.)?
Result refinement: Are there good models or tools for improving the initial search results?
Complete workflow: Any platforms that connect all these pieces smoothly?
I’m looking for current best practices and real-world advice from people who have built these systems. What technologies are you using and what has worked well for you?
totally agree! yeah, sentence-transformers are great for embeddings, really easy to use. for storage, pinecone is nice, but be careful with those costs. also, have you tried out weaviate? it’s pretty cool for search!
I’ve worked with tons of RAG setups this past year. OpenAI’s text-embedding-ada-002 beats almost everything else for vectors, but you’ll pay for it. Skip fixed-size document splits - semantic chunking with LangChain keeps way more context intact. Chroma’s been rock solid for smaller projects and doesn’t need the headache of setting up dedicated vector databases. Want better results? Add a re-ranking step with cross-encoders like ms-marco-MiniLM - made a huge difference for me. Here’s what really matters: nail your chunk overlap and metadata tagging upfront. Trust me, it’ll save you from debugging hell later.
Been running production RAG systems for two years - learned some hard lessons. For chunking, recursive character splitting still works great if you tune it right. I use 500-800 tokens with 50-100 overlap depending on content type. Everyone’s obsessed with semantic chunking but the overhead usually isn’t worth it unless you’re dealing with really complex docs. On vectors, BGE models work solid - especially bge-large-en-v1.5. Often matches OpenAI performance at a fraction of the cost when self-hosted. For storage, I’ve been using PostgreSQL with pgvector. Familiar SQL interface and handles millions of vectors without killing your budget. The real game changer? Hybrid search. Combine dense vectors with BM25 sparse retrieval. Most people skip this but it catches edge cases pure vector search misses. I implemented it with Elasticsearch and saw 15-20% better retrieval quality across different domains.
Just finished migrating our company’s RAG pipeline last month, so this timing’s perfect.
For data splitting, I ditched the usual suspects and went with Unstructured.io. It handles PDFs, tables, and weird formatting way better than basic text splitters. Worth the extra setup time.
Vector models - everyone’s talking about OpenAI and BGE, but I’ve been testing Cohere’s embed-english-v3.0 lately. Surprisingly good results and their pricing works better for our volume.
Storage wise, I’m using Qdrant instead of the typical choices. The filtering capabilities are insane and it scales horizontally without the usual vector DB headaches. Plus their Python client is clean.
Here’s something nobody mentioned - implement query expansion before you hit the vector search. I use a small LLM to generate 2-3 variations of the user query, then merge those results. Bumped our hit rate by about 25%.
For reranking, Cohere’s rerank model works great, but if budget’s tight, just implement MMR (Maximum Marginal Relevance) to reduce duplicate results. Simple algorithm, big impact.
Complete workflow? I built our own orchestration with Prefect. Most platforms are either too rigid or too expensive. The custom route gives you control over every step.