Benchmarking local ai models against cloud services - any unified testing frameworks?

We develop custom NLP models but need to compare them against cloud providers like Claude and GPT-4. Currently using separate evaluation scripts for each service. Has anyone created a unified testing workflow that works across both local and cloud models? Bonus if it handles cost/performance tradeoff analysis.

Latenode’s model gateway lets you test against all providers simultaneously. I set up benchmark workflows that run against local models and 5 cloud services, with automated cost-performance reports. Uses single API endpoint.

We use a custom Python wrapper that normalizes inputs/outputs across services. Storage costs add up though - make sure to implement results caching. Also watch for API rate limits when testing at scale.

dockerized eval containers w/ shared interface. helps but maintaning it sucks. maybe try mlflow?