Need help choosing fast and accurate RAG model

JollyMusic3 · August 6, 2025, 3:03am

Hey everyone, I’m seeking advice on which LLM is the best fit for RAG tasks. My current setup includes two Xeon processors, 64GB of RAM, and dual RTX 3060 graphics cards with 12GB each. I’m using Ollama within Docker on Windows alongside OpenWebUI.

I’m in the process of developing a system that can efficiently search through essential company documents like procedures, guides, and SOPs. It’s crucial that the model provides accurate answers quickly as this project serves as a demonstration for my organization.

Up to this point, I’ve tried several models including Llama 3.1, Qwen, and Mistral. Unfortunately, I haven’t found one that excels in both speed and accuracy. The models with high precision tend to lag, while the fast ones often yield errors. For example, just yesterday, one model returned a phone number that only had 9 digits instead of the required 10.

As I’m relatively new to this field, I’m primarily using the default settings. If adjusting the configurations in Ollama or OpenWebUI could improve performance, I’d greatly appreciate any guidance on what to modify.

Could anyone suggest models that effectively balance speed and accuracy? Thank you!

benmoore · August 15, 2025, 1:33am

Yeah, this sounds familiar. Had the same issues building a RAG system for our knowledge base last year. Your hardware’s solid for this. Phi-3 Medium worked surprisingly well for document retrieval on dual GPU setups like yours. It’s great at keeping context accurate for structured docs. Also try Mistral 7B Instruct v0.3 - the newer version handles factual extraction way better. For your config, bump num_ctx in Ollama to 8192 or higher. Prevents truncation of long documents. Set repeat_penalty around 1.1 to fix those formatting errors. The phone number issue? That’s usually incomplete chunks during retrieval. Increase chunk overlap to 100-150 tokens so you get complete info blocks instead of fragments. Speed vs accuracy becomes less of a problem when your retrieval feeds complete, relevant context. Test with a small batch of your most critical documents first, then scale up.

nateharris · August 14, 2025, 9:58pm

I’ve been running similar RAG setups in production for a while, and your hardware should handle this fine. Those accuracy issues? They’re probably coming from your retrieval pipeline, not the model.

Try Llama 3.1 8B, but fix your chunking strategy and embeddings first. That phone number problem screams incomplete information blocks. Chunk by semantic boundaries, not character counts.

In Ollama, drop the temperature to 0.1 or 0.2 for consistent outputs. Bump up your context window too. For OpenWebUI, increase top_k retrieval so you get more relevant chunks before the LLM processes them.

What helped me tons was preprocessing documents so critical stuff like phone numbers, codes, and procedures don’t get mangled during chunking. Add some custom parsing for structured data.

Still not fast enough? Run two models in parallel - a quick one for simple queries, a better one for complex stuff. Your dual 3060s can handle it.

This video breaks down RAG fundamentals really well. The embedding and chunking sections will help with your accuracy problems.

sapphireSkies · August 14, 2025, 2:01am

Your model choice isn’t the problem. You’re running everything locally when you should automate the whole RAG pipeline.

I’ve built tons of document search systems. What actually works? Connect everything through automation instead of juggling separate tools manually.

Build a workflow that auto-processes documents, creates proper embeddings, and routes queries to the right model based on complexity. Those phone number issues? They happen because your documents aren’t getting preprocessed correctly before chunking.

For your hardware, go with Llama 3.1 7B but automate the prompt engineering. Create different prompt templates for different document types - SOPs need different handling than general procedures.

Speed vs accuracy tradeoff? Gone when you build proper routing logic. Simple factual stuff goes to faster models, complex reasoning goes to better ones. All automatic.

Your dual 3060s are fine, but you’re wasting time configuring Docker containers and managing different interfaces. Connect everything through one automation platform that handles the orchestration.

I use this approach for all our internal tools now. Documents get processed automatically, embeddings update when files change, queries get routed intelligently. Zero manual config.