RAG performance issues - retrieval takes longer than LLM inference

henryg · August 14, 2025, 2:47am

I’m working on a chatbot using RAG architecture combined with LLAMA3 model. I noticed that the retrieval step using ChromaDB is significantly slower compared to the actual LLM generation phase.

Here are my timing results when processing a single webpage containing approximately 1000 words:

# Performance measurements
Retrieval phase duration: 2.245 seconds
LLM generation duration: 2.118 seconds

My implementation looks like this:

embedding_model = OllamaEmbeddings(model="llama3")
vector_db = Chroma.from_documents(documents=document_chunks, embedding=embedding_model)
doc_retriever = vector_db.as_retriever()

user_query = "What is COCONut?"

# Measure retrieval time
start_time = time.time()
relevant_docs = doc_retriever.invoke(user_query)
combined_context = merge_documents(relevant_docs)
retrieval_time = time.time() - start_time
print(f"Retrieval duration: {retrieval_time}")

# Measure LLM time
start_time = time.time()
response = ollama_model(user_query, combined_context)
llm_time = time.time() - start_time
print(f"LLM duration: {llm_time}")

The problem gets worse as my database grows. With a ChromaDB size of 1.4MB, retrieval takes over 20 seconds while LLM inference still only needs 3-4 seconds.

Is this normal behavior for RAG systems or am I doing something wrong in my setup?

Claire29 · August 24, 2025, 5:11pm

that’s not how it should be! maybe you’re re-embedding docs too often? try embedding once, save it, and reuse the vector database. also, double-check that you’re using the same embedding model for both indexing and querying. that could speed things up a lot!

Nova56 · August 23, 2025, 11:49pm

Your bottleneck is definitely Ollama’s LLAMA3 for embeddings. Hit the same wall last year on a similar build. You’re running LLM inference twice - once for embeddings during retrieval, then again for text generation. Ollama embeddings are convenient but painfully slow for production. Switched to sentence-transformers and dropped retrieval from 15+ seconds to under 500ms. Also check if ChromaDB’s rebuilding the index each query - that’d explain why it gets worse as your database grows. Use a dedicated embedding model for semantic search instead of forcing a chat model to do embeddings.

nina.k · August 23, 2025, 11:45am

Been there, done that. You’re running embeddings on every retrieval instead of doing it once and storing them properly.

I hit this same issue building a knowledge base for our internal docs. The fix? Automate your embedding pipeline so it only runs when content changes.

Set up monitoring for your document sources, process new content through your embedding model, and auto-update ChromaDB. Your retrieval becomes a fast vector lookup instead of re-embedding everything.

For ChromaDB performance as it grows, you need proper indexing or maybe switch to a more scalable vector database. But the real win is separating embedding generation from retrieval completely.

I built this whole flow with automation - document monitoring, chunking, embedding, database updates, and performance monitoring. Dropped retrieval times from 15+ seconds to under 200ms.

Treat this as a pipeline problem, not code optimization. Automate the heavy lifting so retrieval stays fast regardless of database size.

Check out Latenode for building automated pipelines like this: https://latenode.com