LangSmith Evaluation Tests Pass When They Should Fail

I’m having trouble with my LangSmith evaluation setup. I created test cases with wrong reference answers on purpose to see if the evaluator would catch them, but all my tests keep passing when they should fail.

The problem is that the evaluator seems to compare the AI responses against some internal correctness standard instead of using my reference outputs. I wanted it to fail because my reference answers are intentionally wrong.

Here’s my test setup:

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
import openai
from langsmith.wrappers import wrap_openai

client = Client()

# Create test dataset
test_dataset = "math_validation_tests"
dataset = client.create_dataset(test_dataset, description="Testing evaluation behavior")
client.create_examples(
    inputs=[
        {"prompt": "What is 8 - 3?"},
        {"prompt": "Calculate 4 + 4"},
        {"prompt": "What is the square root of 16?"}
    ],
    outputs=[
        {"result": "8 - 3 equals 7"},  # Wrong answer
        {"result": "4 + 4 equals 9"},  # Wrong answer  
        {"result": "Square root of 16 is 5"}  # Wrong answer
    ],
    dataset_id=dataset.id,
)

# AI function
ai_client = wrap_openai(openai.Client())

def run_prediction(input_data: dict) -> dict:
    prompt = [{"role": "user", "content": input_data["prompt"]}]
    result = ai_client.chat.completions.create(messages=prompt, model="gpt-3.5-turbo")
    return {"prediction": result}

# Run evaluation
test_results = evaluate(
    run_prediction,
    data=test_dataset,
    evaluators=[LangChainStringEvaluator("labeled_criteria", config={"criteria": "correctness"})],
    metadata={"test_version": "1.0.0"},
)

Even though my reference outputs have wrong math answers, the evaluator gives passing scores because it recognizes that GPT’s responses are mathematically correct. I expected it to compare against my reference outputs and fail the tests since they don’t match. How can I make it use my reference answers for comparison instead of evaluating absolute correctness?

Yeah, that’s actually how the labeled_criteria evaluator works with correctness criteria. It doesn’t compare against your reference outputs at all - it does semantic analysis to check if responses are factually accurate. When you use correctness criteria, LangSmith basically ignores your reference answers and evaluates against what’s objectively true. You need an evaluator that actually uses your reference data. I’d build a custom evaluator that directly compares the AI output to your reference answers using string matching or similarity scoring. A simple function that checks if the prediction matches your intentionally wrong reference would work way better for what you’re testing.

Your problem comes from how LangSmith’s correctness evaluator actually works. I ran into this same issue debugging evaluation workflows - the correctness evaluator doesn’t compare against your reference answers at all. Instead, it checks responses against objective facts, so your intentionally wrong reference answers get completely ignored. You need a reference-based evaluator instead. Try exact_match - it’ll actually use your dataset outputs for comparison. Or go with a semantic similarity evaluator that compares predictions directly to your reference data without fact-checking. This is intentional behavior - the correctness evaluator catches factual errors no matter what’s in your reference set. That’s why it keeps passing even with wrong reference answers.

Your problem is that labeled_criteria with correctness doesn’t actually compare against your reference outputs. It does semantic evaluation against ground truth instead - basically checking if the math is right regardless of what you put in the reference. Switch to exact_match or build a custom evaluator that directly compares predictions to your reference answers. Try LangChainStringEvaluator("exact_match") or a simple string similarity evaluator. The correctness evaluator is built to be objective about factual accuracy, so it’ll keep passing your tests even when your reference data is wrong. That’s why your tests aren’t failing like you expect.

This is a super common issue with evaluation pipelines. The labeled_criteria evaluator with correctness doesn’t work like you’d expect.

I hit this same problem last year building validation tests for our ML pipeline. The correctness evaluator does its own fact-checking - it doesn’t care what you put in the outputs field. It’s checking objective truth, not comparing against your reference outputs.

You need either an exact match evaluator or a custom one. Here’s what fixed it for me:

# Replace your evaluator with this
evaluators=[LangChainStringEvaluator("exact_match")]

Or build a simple custom evaluator that actually uses your reference data:

def reference_comparison_evaluator(run, example):
    prediction = run.outputs["prediction"]
    reference = example.outputs["result"]
    return {"score": 1 if prediction == reference else 0}

The correctness evaluator works great for objective evaluation, but when you’re testing evaluation logic itself, you need something that respects your reference answers even when they’re intentionally wrong.

Yeah, this is annoying but expected. The correctness evaluator does its own fact-checking instead of using your reference outputs - it ignores bad reference data and checks against what’s actually true. That’s why your wrong math answers aren’t failing the tests. Switch to exact_match evaluator and it’ll compare directly against your references.