Which ai model should i actually use for interpreting page content and building robust puppeteer selectors?

I’ve been thinking about a different approach to the selector fragility problem. Instead of hardcoding selectors, what if I used an AI model to interpret page content and generate selectors on the fly? The theory is that an AI can adapt to DOM changes better than static code.

But I’m stuck on the practical question - if I have access to a bunch of different AI models, which ones are actually good at this task? Some models are better at vision tasks, some at language, some at reasoning. Using the wrong model seems like it would just create different problems.

I could throw GPT-4 at it, but that might be overkill and expensive. I could use a lighter weight model but maybe lose accuracy. I’m not even sure what the right evaluation criteria are here.

Has anyone actually experimented with this? Which models worked well for understanding page structure and generating selectors? Are you picking a single model or combining multiple ones?

This is a perfect use case for having access to 400 plus AI models in one subscription.

Instead of guessing which model works best, you can actually test multiple ones on your specific task. Vision models like Claude or GPT-4V understand page layouts really well. Text based models like Claude or Deepseek are excellent at reasoning about DOM structure. Smaller models like Mistral or Llama handle lightweight tasks cheaper.

Latenode gives you access to all of these through one subscription. You can build workflows that test different models and pick the one with the best accuracy and cost balance for your specific site. No vendor lock-in to one model.

For selector generation specifically, a multi-step approach works well. Use a vision model to understand the page layout, then use a reasoning model to generate robust selectors based on that understanding. The platform lets you chain these together easily.

I experimented with this last year and the model choice actually matters way more than I expected. GPT-4 was the most reliable but also the most expensive. Claude was a solid middle ground. For lightweight tasks like simple selector generation I found that smaller models like Mistral worked fine and saved a lot on API costs.

The trick I found was not trying to be too clever. I gave the model a screenshot of the page and asked it to identify a specific element and suggest CSS selectors. Multi-step reasoning models worked better than pure language models because they can actually reason through the page structure.

Testing on your actual site selectors matters. What works great for one site structure might fail on another. I ended up building a small eval set where I manually verified that the model generated working selectors for different page types.

Model selection depends on your input modality and latency requirements. Vision based models excel at interpreting visual layouts but require screenshots and have higher latency. Text based models work with DOM trees or page HTML and respond faster. For selector generation, combining approaches works well - use a vision model for visual verification, text models for reasoning about structure. Smaller specialized models often outperform large general purpose models on narrow tasks once they’re fine tuned for your domain. Cost and latency always matter in production systems.

Optimal model selection for selector generation requires evaluating multiple dimensions. Vision models demonstrate superior performance on layout interpretation but introduce latency and cost overhead. Language models are efficient for HTML parsing and XPath reasoning. Hybrid approaches that use vision for visual verification and language models for structural reasoning typically achieve best results. Consider building an evaluation framework against your target sites to measure accuracy, latency, and cost across different models. Multi model orchestration, where you use different models for different aspects of the task, often provides better results than single model approaches.

Vision models for layout, text models for structure. Test on your sites. Smaller models often work fine. Hybrid approaches work best.

Test vision and text models on your target sites. Claude is reliabel midpoint. Combine approaches for best results. Cost matters in production.

This topic was automatically closed 6 hours after the last reply. New replies are no longer allowed.