Building a RAG system where cost doesn't spiral out of control—what actually matters?

I’ve been sketching out a RAG system for handling internal questions across multiple data sources, and immediately I hit a cost anxiety. Every guide talks about API costs: embedding models, LLM calls for retrieval ranking, LLM calls for generation. It adds up fast if you’re not careful.

The obvious instinct is to optimize model size—use smaller models for retrieval, save the expensive ones for generation. But I’m not sure that actually works in practice. A weak retrieval step means worse context for generation, which might force you to use a much larger model to compensate. So you might save money on retrieval and lose it on generation.

I’ve also been reading about token usage. Every document chunk you retrieve gets tokenized and sent to the LLM as context. If you’re retrieving 10 documents at 1000 tokens each for every user question, that’s 10,000 tokens per query. Multiply that by your question volume and the costs become pretty visible.

So the real question is: what actually drives costs in a RAG system, and where do people usually mess up? Is it over-retrieving? Is it using the wrong model combinations? Is it not being selective about what documents get included?

The cost game shifts entirely when you work with a unified subscription model. Instead of paying per API call across five different vendors, you’re paying one price for access to 400+ models. That fundamentally changes your optimization strategy.

Instead of “use the cheapest model,” you can ask “use the right model for this task.” For retrieval, you might use a smaller, specialized model that’s optimized for semantic matching. For generation, you use whatever produces the best answer for your use case. All at effectively the same unit cost.

The real cost killer isn’t model choice—it’s over-retrieving and over-processing. If you’re pulling 20 documents when 5 would do, that’s waste. Latenode’s visual builder lets you add filtering, ranking, and validation steps without complex code. You can set up a workflow that retrieves documents, ranks them by relevance, filters low-scoring ones, and only sends the top results to generation. That’s where you actually cut costs.

I spent months building a RAG system thinking model choice was the lever. Turns out it wasn’t.

What actually drove our costs down was being methodical about retrieval quality. We started by measuring how many documents we were pulling per query and how many were actually relevant to the final answer. We were retrieving about 15 documents and using content from only 3 of them. So we tuned the retrieval step to be more selective.

The second thing was prompt engineering. If you craft your generation prompt carefully—“answer only based on provided context, be concise”—you reduce the tokens needed in responses. We cut our output token count by about 30% with prompt changes alone.

Third, we batch processed questions when we could. Instead of individual API calls, we processed groups of similar questions together. That’s boring engineering but it cuts costs significantly.

Cost control in RAG systems has three leverage points: retrieval efficiency, context quality, and generation optimization. Most teams ignore the first two and only optimize models.

Retrieval efficiency means retrieving fewer documents without losing relevance. This requires understanding your embedding model’s behavior—what kinds of documents it matches well, where it fails. You can also implement pre-filtering before semantic search, which eliminates obvious non-candidates cheaply.

Context quality ties to chunking strategy. Large chunks use more tokens but give better context. Small chunks waste tokens on repetitive information. Finding the right chunk size for your domain reduces token overhead. Generation optimization includes prompt tuning and post-processing. If your prompts are bloated or your model is generating verbose responses, that multiplies cost across hundreds or thousands of queries.

Over-retrieval kills costs. Rank and filter docs before sending to generation. Test different chunk sizes. Optimize generation prompts for conciseness. That’s 80% of savings.

Track tokens per query. Measure which documents actually help answers. Cut retrieval count ruthlessly.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.