How to include custom functions in OpenAI evaluations like in my live setup?

I have a live system that uses custom functions with specific schemas. When I try to test responses, the evaluation doesn’t work right because it’s missing these functions.

My live code looks like this:

const result = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: messages,
    functions: myFunctions,
    temperature: 0.7
});

I want to run tests using the same functions and my own test questions. I know about the completion logs feature but I need something different.

Can I set up evaluations in the dashboard that include my custom functions? The evaluation needs to match my production setup exactly or the results won’t be useful.

The OpenAI dashboard evaluations don’t support custom functions directly - you’ll need to build your own evaluation pipeline. I ran into the same issue last year and ended up creating a separate testing framework that mirrors my production environment. What I did was create a simple evaluation script that uses the same openai.chat.completions.create() call with your custom functions, then runs your test cases against it. You can measure whatever metrics matter to you - accuracy, function calling correctness, response quality, etc. For systematic testing, I built a JSON file with test cases containing the input messages and expected outcomes. The script iterates through these, makes API calls with your functions included, and compares results. It’s more work than using the dashboard but gives you complete control over the evaluation process and ensures your testing environment matches production exactly.

yeah the dashboard wont handle custom functions, so you gotta roll your own testing setup. i just use a simple node script that loads my function definitions and runs test cases against them. make sure to test both the function selection AND parameter extraction - thats where most bugs happen in my experiance.

I faced this exact problem when transitioning from development to production testing. The key insight is that you need to replicate your entire function calling context during evaluation, not just the model responses. What worked for me was setting up a local evaluation environment that mimics your production API structure completely. I created a wrapper function that takes your test prompts, applies the same function schemas, and captures both the function calls and their execution results. The critical part is validating that the model selects the right functions with correct parameters, not just measuring response quality. I also found it helpful to log the complete conversation flow including function returns, since that affects subsequent model behavior. Consider using a simple database or file system to store evaluation runs so you can track performance changes over time as you modify your functions or prompts.