RAG retrieval performance slower than direct LLM calls - is this normal?

bellagarcia · August 14, 2025, 7:39pm

I’m working on an AI chatbot using RAG with LLAMA3 and I’m noticing some performance issues. The retrieval part seems to be taking much longer than just calling the LLM directly.

For testing, I used a single webpage with around 1000 words. Here are my timing results:

retrieval_start = time.time()
vector_db = ChromaDB.from_docs(chunks=doc_splits, embedder=llama_embeddings)
searcher = vector_db.create_retriever()
user_query = "Tell me about machine learning"
relevant_chunks = searcher.get_similar(user_query)
context_text = merge_documents(relevant_chunks)
retrieval_end = time.time()
print(f"Retrieval duration: {retrieval_end - retrieval_start}")

llm_start = time.time()
response = llama_model(user_query, context_text)
llm_end = time.time()
print(f"LLM processing time: {llm_end - llm_start}")

The results show retrieval taking 2.3 seconds while LLM only needs 2.1 seconds. When I scale up to a 1.4MB ChromaDB, retrieval jumps to over 20 seconds but LLM stays around 3-4 seconds.

Am I doing something wrong with my setup or is this typical behavior for RAG systems? Any suggestions for optimization would be helpful.

aroberts · August 26, 2025, 3:30am

yeah, totally normal! retrieval can be slow with larger datasets. maybe try a faster embedding model or some indexing tweaks like HNSW if your setup allows. also, pre-compute those embeddings instead of generating them as you go - will def help!

miat · August 24, 2025, 4:03am

ChromaDB tanks hard once you move past toy examples. That jump from 2.3 to 20+ seconds with 1.4MB? Totally expected. Similarity search gets brutal as your collection grows without proper indexing.

I’ve built a few document retrieval systems and the issue’s usually ChromaDB’s brute force approach. With thousands of chunks, it’s comparing your query against every single vector. Switch to Pinecone or Weaviate - they’ve got optimized indexing baked in.

Your embedding model’s probably making it worse too. LLAMA embeddings are thorough but heavy. I switched to all-MiniLM-L6-v2 for queries and speed went way up with barely any hit to quality. You don’t need perfect embeddings for retrieval, just good enough to pull relevant context.

ZoeStar42 · August 23, 2025, 10:54am

Your timing looks pretty normal for RAG setups, but you can definitely speed things up. That retrieval bottleneck is super common - ChromaDB’s default settings aren’t built for production. I’ve built similar systems and the main issue is usually the collection config. Switch to persistent storage instead of in-memory operations.

One thing that cut my retrieval times from 15+ seconds down to under 3 was using approximate nearest neighbor search with a lower precision threshold. You’ll lose some accuracy but the speed boost is worth it. Also try different document chunking - smaller chunks with overlap actually improve both speed and relevance.

Embedding generation is another big bottleneck. If you’re generating embeddings on-the-fly for queries, switch to a lighter model or use sentence transformers with GPU acceleration. That’ll cut processing time significantly. Your LLAMA3 times look fine, so I’d focus on fixing the retrieval pipeline first.

sapphireSkies · August 22, 2025, 4:32am

Yeah, those numbers are totally normal for RAG setups, especially when you’re running everything sequentially. The real killer isn’t retrieval - it’s all the overhead from embedding generation, vector searches, and document merging happening one by one.

I ran into this exact problem last year building a customer support bot. Instead of tweaking each piece separately, I threw the whole RAG pipeline onto Latenode and set it up with parallel workflows.

Game changer was caching embeddings, running similarity searches in parallel batches, and pre-warming the LLM context while retrieval was still going. What took 20+ seconds for big datasets now finishes in under 5.

You also get monitoring built-in so you can see your actual bottlenecks instead of guessing. Way easier than managing all those timing calls yourself.

Check it out: https://latenode.com