How to include custom tools in OpenAI evaluations like in production setup?

I’m struggling with setting up evaluations that match my live environment. My production system uses custom tools with function schemas that are crucial for proper responses. When I try to run evaluations without these tools, the results don’t make sense because the model expects to call these functions.

My current production code:

const result = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: messages,
    system: prompt,
    functions: customFunctions,
    save_logs: true
});

I want to create evaluations that use the exact same configuration as my production environment. Testing the base model alone isn’t useful since my real application relies heavily on tool calling.

I know about the completion logging feature for checking previous runs, but right now I just need to run evaluations with my tools included and use my own test dataset.

Can I set up evaluations in the dashboard that include my custom tools?

OpenAI’s dashboard doesn’t support custom tools in evaluations, which is a massive pain when you’re testing real production scenarios.

I ran into this exact problem last year with our customer support chatbot. Production used multiple custom tools - ticket lookup, knowledge base search, escalation routing. Testing without them was useless.

I built my own evaluation pipeline outside OpenAI’s dashboard. Created a script that hits the OpenAI API with our exact production config and runs test datasets through it.

But managing evaluation infrastructure gets messy quick. Storing results, comparing model versions, handling tool orchestration - it adds up fast. I automated everything with Latenode.

My Latenode setup:

  • Pulls test cases from datasets
  • Calls OpenAI with production tool config
  • Runs evaluation metrics
  • Stores and compares results over time
  • Alerts when performance tanks

The visual workflow builder handles all the API calls and data processing without writing tons of boilerplate.

You can mirror your production setup exactly and get way better results than OpenAI’s dashboard.

Dashboard evaluations don’t support custom functions yet. Hit this same wall testing our e-commerce recommendation system - it needed product lookup and inventory tools to work properly. I ended up building a separate eval script that mirrors production exactly. Same OpenAI API calls, same function schemas, same tool configs - just running against test data programmatically. My setup’s pretty basic - Node.js script that loops through test cases, hits the API with the full functions array, then checks responses against what I expect. Way more realistic since the model can actually use its tools. I just dump results to JSON files with timestamps and model versions for tracking. Not fancy like the dashboard, but much more accurate when your app lives or dies by function calling. The extra work’s worth it.

yeah, dashboard evals are pretty useless without function support. i just mock up a quick test harness using the same OpenAI API calls as prod but with my test cases. takes about 30 minutes to set up and it’s way more reliable than fighting dashboard limitations.

That dashboard limitation sucks, but don’t waste time building eval scripts from scratch.

I used to burn hours maintaining custom evaluation code. Then I realized the real problem isn’t running tests - it’s managing the whole evaluation mess.

Your production setup needs constant testing as you iterate. But hand-coding evaluation pipelines means you’re always fixing infrastructure instead of improving your app.

I automated everything. My workflow grabs test datasets, runs OpenAI calls with the same function schemas as production, tests multiple criteria, and tracks performance over time.

Key insight: treat evaluations like production systems. You need reliability, monitoring, and easy iteration.

I built reusable workflows that handle API orchestration, data processing, and result analysis automatically. When I update production tools, evaluations update with zero code changes.

This scales way better than manual scripting. Run evaluations on schedule, compare model versions, get alerts when tools start failing.

The visual workflow builder lets you chain API calls, data transformations, and custom logic without boilerplate code.

You’re right - evaluations without custom tools are useless for production. The dashboard limitation sucks, but I’ve got a workaround I’ve used for months. I build a local evaluation setup that mirrors my prod environment exactly. Same function definitions, same schemas - everything identical. Keep your eval config in a separate file that imports the exact tool definitions you use in production. For tracking, I dump everything to structured logs with model version, timestamps, and performance metrics. Actually gives you way more control than their dashboard since you can customize criteria for your specific tools. Biggest mistake I see: test datasets that don’t actually trigger tool usage. You’re still not testing real behavior. Create test cases specifically to hit different tool combinations. Running locally also makes debugging easier - you can see exactly which functions get called and what parameters they receive.

Yeah, the dashboard limitation sucks, but here’s what I did instead. I ditched the idea of building custom evaluation infrastructure and just used pytest to structure my tests properly. Built a test suite that loads my production function schemas and runs tests against different scenarios. Each test checks both the function calls and final responses. What’s great is you get proper test organization, parametrized testing, and built-in assertions. I can run specific test categories, generate coverage reports, and hook into CI/CD without any hassle. My setup uses the exact same functions.json file as production, so there’s zero drift. For tracking results over time, I just dump test results into a simple database table. Way cleaner than managing random scripts and gives you proper testing patterns that actually scale.