I had a workflow that needed extraction, classification, and reasoning. With access to many models under one subscription, I stopped thinking in terms of ‘one model to rule them all’ and started matching models to tasks. For extraction I used a smaller, cheaper model; for final reasoning I picked a more capable one. I also built a simple routing layer: quick inexpensive checks first, then escalate to a better model only when confidence was low.
I’m interested in how others architect model routing and cost tradeoffs in long-running automations. Any patterns that worked for you?
we used a confidence threshold. light models do the first pass. if confidence < 0.8 we escalate. that saved a lot on model cost while keeping accuracy high for the important decisions.
also useful: cache intermediate model outputs for retry, and batch similar tasks to amortize per-call overhead. that reduces costs on long runs where many similar items are processed.
In one deployment I implemented a tiered model-routing strategy. The pipeline first runs a fast model for parsing and field extraction. If the extraction contains low-confidence items, we call a mid-tier model specialized in correction. Only the top-tier model handles ambiguous policy decisions. The routing logic records cost and latency metrics per call. Every week we ran a small optimizer that suggested moving specific payload types up or down a tier based on error rates and spend. This feedback loop kept accuracy high while controlling costs. For long-running processes we also created an emergency budget cap that temporarily switched all non-critical calls to the cheapest model when daily spend hit a threshold. That cap prevented runaway bills without stopping the workflow entirely.
Design model routing as a separate, measurable layer. Have clear metrics: cost per call, latency, and downstream error impact. Start with parsers on cheap models, validators on mid tier, and decision-making on top-tier models. Implement fallback rules and a spend guardrail. With those in place you can run long processes and adjust routing based on observed performance and budget constraints.