One thing that’s been nagging at me is how to actually evaluate whether I’m choosing the right models for different parts of a RAG pipeline. Like, I can pick an OpenAI model or Claude or whatever else is available, but how do I know if my retriever is actually finding good documents or if my generator is actually using them well?
What I’ve been doing is building basic test cases with known good answers, running them through different model combinations, and checking the outputs. But I feel like I’m doing this somewhat blindly.
The advantage of having access to 400+ models in one place is that you can actually run these comparisons without setting up separate subscriptions and billing relationships. You just swap nodes and re-run. That’s genuinely useful for experimentation.
But there’s a flip side—when you have that many options, how do you systematically compare them without spending forever testing permutations? I’ve been thinking about this as retrieval quality (does it find relevant info?) and generation quality (does it use that info well), but beyond eyeballing outputs, I’m not sure what metrics actually matter.
How are people actually approaching this? Are you measuring retrieval accuracy separately from generation quality, or is there a simpler way to think about this?
The advantage of having 400+ models accessible is exactly what you’re describing—you test without setup friction. That matters more than most people realize.
For evaluation, separate retrieval from generation. Test your retriever with a set of queries and manually check if it surfaces relevant documents. Once that’s solid, test generation quality with the same context. This isolates what works and what doesn’t.
You don’t need complex metrics initially. Honest evaluation of outputs against your use case beats elaborate scoring systems that don’t match reality.
In practice, I’ve found that separating these concerns is essential. The retriever’s job is narrow—find relevant documents. The generator’s job is to use those documents coherently.
What worked for me was building a small benchmark set of queries with known good context. Run those through your retriever, check if it pulls the right documents. That’s independent of which LLM you use in generation.
Then for generation quality, you feed it the good context and see if it produces useful answers. This way you’re testing each component’s actual capability rather than mixing signal from both.
The key insight is that model testing for RAG requires you to isolate variables. If you’re testing a new retriever and a new generator simultaneously, you have no idea which one is causing problems.
I’d recommend creating a small dataset of representative queries and expected outputs. Evaluate retrieval by checking whether returned documents actually contain the information needed. Evaluate generation by checking whether it answers correctly given good context.
Having multiple models available makes this faster because you’re not constrained by a single provider. You can run several retriever models and several generator models relatively quickly to find good pairs.