Performance Issues: RAG Implementation Running Slower Than Direct LLM Calls

I’m working on an AI chatbot using RAG combined with LLAMA3. I noticed that my RAG setup with chromadb performs much slower compared to direct LLM queries.

Here’s what I’m seeing with a basic webpage containing around 1000 words:

embedding_model = OllamaEmbeddings(model="llama3")
vector_db = Chroma.from_documents(documents=text_chunks, embedding=embedding_model)
doc_retriever = vector_db.as_retriever()
user_query = "What is DataMining?"

start_time = time.time()
relevant_docs = doc_retriever.invoke(user_query)
context_data = merge_documents(relevant_docs)
end_time = time.time()
print(f"Retrieval time: {end_time - start_time}")

start_time = time.time()
response = ollama_model(user_query, context_data)
end_time = time.time()
print(f"LLM processing time: {end_time - start_time}")

The results show:

Retrieval time: 2.3451234567890123
LLM processing time: 2.0987654321098765

When my chromaDB grows to about 1.4M in size, retrieval takes over 20 seconds while the LLM still processes everything in 3-4 seconds. Am I doing something wrong with my implementation, or is this normal behavior for RAG systems?

chromaDB just ain’t built for that scale. With 1.4M records, ur gonna run into memory issues and slow retrieves no matter the embed model. Try adding a batch_size limit and set k=5 for fewer results. Also, avoid loading the whole db into memory every time - it kills perf.

Been there, done that. RAG at scale is a nightmare when you’re juggling all the pieces yourself.

It’s not just picking the right embedding model. You’ve got this complex pipeline where every query hits multiple steps - embed, search vectors, grab documents, merge context, then process through the LLM. Each step kills your speed.

I ditched managing it all in code and moved my entire RAG workflow to Latenode. Huge performance boost. The platform handles orchestration way better and lets you tune each step without rebuilding from scratch.

You can run parallel embedding generation, set up smart caching, even pre-compute embeddings for frequent queries. The workflow automation cuts out most of that overhead you’re hitting.

Built-in monitoring and scaling too. No more wondering why retrieval suddenly takes 20 seconds.

Check it out: https://latenode.com

Yeah, this is super common with RAG setups, especially when you’re using llama3 for both embedding and generation. I hit the same wall when I moved past toy examples. Your bottleneck is probably the embedding process during retrieval. Every time you query chromadb, it has to embed your query with llama3, then run similarity search across all stored embeddings. With 1.4M records, that gets expensive fast. I saw huge improvements switching to lighter embedding models like sentence-transformers for retrieval while keeping llama3 for generation. Also worth implementing proper indexing in chromadb and batch processing if you can. That 20-second retrieval time screams embedding bottleneck, not vector search. Another thing - chromadb might not be cut out for your scale. For production workloads, consider Pinecone or Weaviate, though they bring their own complexity.

Your implementation looks solid but you’re hitting the classic RAG scaling problem. Been there multiple times.

It’s not just the embedding model. You’re running everything sequentially and ChromaDB can’t handle your data size well. Three fixes that worked for me:

First, go async with retrieval. You’re waiting for each step when you could run embedding and vector search in parallel.

Second, fix your chunking. Most people chunk way too small and end up with huge vector databases that crawl during similarity search. I use 500-800 token chunks with 50 token overlap - much faster retrieval.

Third, cache frequent queries. Simple Redis setup cut my retrieval time by 60% in production.

With 1.4M records, ChromaDB’s probably your bottleneck. I switched to Qdrant at similar scale and saw major improvements. Way better memory management and faster similarity search.

This video covers RAG optimization techniques that’ll help with your performance issues:

Also try splitting your vector database by topic or date if possible. Smaller search spaces = faster retrieval regardless of embedding model.