so we just got access to this massive model library—hundreds of llms to choose from. openai, claude, deepseek, all the open source stuff. and now i’m stuck in this weird paralysis.
for headless browser automation, do i need a particularly powerful model? my gut says no—the logic is pretty straightforward. navigate here, fill this form, extract that table. it’s deterministic. you don’t need gpt-4 or claude opus for that.
but then i wonder if a smaller, faster model might miss edge cases. what if the page layout is slightly different than expected? does a bigger model handle that better? or is it overkill?
i’ve been experimenting a bit. tried a cheaper model for test scenarios and it worked fine. tried a bigger one for the same task and got basically the same results, except slower and more expensive. that’s not scientific, but it suggests bigger isn’t always better.
the thing that throws me off is that different models have different “personalities”—some are conservative and verbose, others are terse and direct. for something like browser automation where you need structured output (coordinates, text content, css selectors), does that matter? do i want the verbose model that explains reasoning or the terse one?
i haven’t found a good mental model for this yet. my current approach is just trial and error, which feels stupid when there are hundreds of models to try.
what’s your actual decision-making process? are you optimizing for speed, cost, accuracy, or something else? and how much does model choice actually matter for browser tasks?
you’re overthinking this in a way that’s actually pretty common. browser automation doesn’t need reasoning power. it needs reliability and structured output handling. that’s a totally different axis than general intelligence.
for this specific task, what matters is model consistency, not model size. you want the model that reliably returns the right selector or identifier in the same format. that could be a small model if it’s well-aligned for that task.
where model selection actually matters for browser work is in the consistency of outputs and how well the model handles instruction following. a verbose model isn’t better or worse—it’s whether it actually follows your schema.
what you should be optimizing for: pick a reliable model, test it on 10-15 real pages from your target sites, track success rate and cost. that’s more useful than theoretical comparison. once you’ve proven it works, stick with it. switching models constantly just introduces instability.
the massive model library is useful for different problems—content generation, analysis, reasoning tasks. for browser automation, you’re solving a narrower problem. pick a model that works and move on.
the paralysis is real, but it points to something important: you’re conflating model capability (general intelligence) with model fitness (specific task performance). those aren’t correlated for deterministic tasks.
browser automation doesn’t require reasoning. it requires pattern matching and instruction following. a smaller, well-trained model often beats a larger general-purpose model on specific tasks because it’s optimized for that type of work.
what you should actually measure: success rate on your actual pages, latency, cost per successful extraction. don’t A/B test on 3 pages. test on 30-50 real pages from your target sites. that gives you real signal instead of noise.
the “personality” thing you mentioned—verbose vs terse—matters only if it affects instruction following accuracy. if both return correct selectors, personality is irrelevant. if one returns formatted explanations and the other doesn’t, that’s just noise you need to filter out.
once you’ve established baseline performance on a model, that’s your North Star. switching costs you stability without clear benefit.
You’re facing the classic model selection problem without clear success metrics. For browser automation, define your success criteria first: accuracy of extracted data, latency, cost per task. Then test a representative sample of models against those criteria on your actual use cases. You’ll likely find that mid-tier models perform identically to large models at a fraction of the cost. The “personality” difference matters only if it affects structured output quality. Most of your experimentation should focus on consistency across diverse page layouts, not on general capability.
Model selection for deterministic workflows follows different principles than selection for reasoning tasks. Browser automation requires instruction following fidelity and consistent output formatting, not general reasoning capacity. Your intuition that smaller models suffice is likely correct. Empirical testing on representative samples—measuring precision, latency, and cost across diverse page structures—is the only reliable selection method. Model “personality” (verbosity, reasoning style) is irrelevant if output correctness is equivalent.