LangSmith Evaluation Tests Pass When They Should Fail

I’m having trouble with my LangSmith evaluation setup. I created test cases with wrong reference answers on purpose to see if the evaluation would catch them and fail the tests. But instead, all my tests keep passing.

The problem is that the evaluator seems to compare the AI responses against some internal knowledge instead of using my reference outputs. I want it to compare the GPT responses with my provided reference answers and fail when they don’t match.

Here’s my test setup:

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
import openai
from langsmith.wrappers import wrap_openai

client = Client()

# Create test dataset
test_dataset = "math_questions_test"
dataset = client.create_dataset(test_dataset, description="Math evaluation tests")
client.create_examples(
    inputs=[
        {"prompt": "What is 8 + 7?"},
        {"prompt": "Calculate 6 x 3"},
        {"prompt": "What's the square root of 16?"}
    ],
    outputs=[
        {"result": "8 + 7 equals 20"},  # Wrong answer intentionally
        {"result": "6 x 3 equals 12"},   # Wrong answer intentionally  
        {"result": "Square root of 16 is 5"}  # Wrong answer intentionally
    ],
    dataset_id=dataset.id,
)

# AI system function
openai_client = wrap_openai(openai.Client())

def run_prediction(input_data: dict) -> dict:
    user_messages = [{"role": "user", "content": input_data["prompt"]}]
    ai_response = openai_client.chat.completions.create(messages=user_messages, model="gpt-3.5-turbo")
    return {"result": ai_response}

# Run evaluation
results = evaluate(
    run_prediction,
    data=test_dataset,
    evaluators=[LangChainStringEvaluator("labeled_criteria", config={"criteria": "correctness"})],
    metadata={"test_version": "1.0.0"},
)

The evaluator keeps giving scores of 1 because it recognizes that GPT gave the right mathematical answers, even though my reference outputs are intentionally wrong. How can I make it actually compare against my reference answers instead of using its own correctness logic?

You’re encountering this issue because the LangChainStringEvaluator with the “correctness” criterion assesses factual accuracy rather than direct reference matching. It utilizes its own knowledge to evaluate the correctness of math answers, which is why your intentionally incorrect reference answers are not resulting in test failures.

Consider switching to LangChainStringEvaluator(“labeled_score_string”), as this option compares outputs directly with your reference answers. Alternatively, you might create a custom evaluator for precise string matching or similarity evaluations.

I faced the same challenge while testing my pipeline last month. The correctness evaluator functions correctly but is unsuitable for reference matching scenarios. After I switched to labeled_score_string, my tests began to fail appropriately when the outputs did not align with my references.

Hit this exact issue building automated testing for ML pipelines last year. The “correctness” evaluator runs its own math validation and completely ignores your reference data.

Instead of fighting LangSmith’s evaluator quirks, I’d automate the whole evaluation process. Built a workflow that handles test dataset creation, runs evaluations with multiple comparison methods, and generates reports automatically.

You can set up exact string matching, semantic similarity, or any custom comparison logic you want. Works with whatever evaluation framework you’re using - LangSmith, custom setup, doesn’t matter.

Running daily regression tests on our LLM outputs now. When I need tests to fail on purpose (like yours), they actually fail. Want semantic comparison? That works too. No more fighting built-in evaluator assumptions.

For your immediate problem - yeah, switch to direct comparison. Long term though, automating your entire eval pipeline saves tons of headaches.

The problem is you’re using the “correctness” evaluator, which checks against ground truth instead of your reference outputs. This evaluator has built-in math reasoning, so it knows 8+7=15, not 20 - doesn’t matter what your reference says. You need the “qa” evaluator or a custom one that does direct comparison. Same thing happened to me when building regression tests for my chatbot - kept getting high scores even with garbage reference answers. Turns out the evaluator was doing semantic understanding instead of just matching outputs. Switch to LangChainStringEvaluator(“qa”) or write a simple custom evaluator that compares actual output text against your reference. That’ll give you the failing behavior you want when testing your pipeline.

You’re using the wrong evaluator. The “correctness” criterion doesn’t compare against your reference outputs - it just checks if the math is right using its own logic. That’s why your bad reference outputs aren’t failing the tests. I hit this same issue when building validation tests for our deployment pipeline. You want the “similarity” evaluator or just write a custom function that does direct text comparison. The similarity evaluator actually uses your reference outputs as the baseline, which sounds like what you need. If you need strict matching, write a simple custom evaluator that compares the actual output strings against your reference data. You’ll have full control over the comparison logic and your intentionally wrong references will fail like they should.

Hit this same issue 6 months back building test suites for our LLM pipeline. The “correctness” evaluator is basically a mini LLM doing its own reasoning - it doesn’t compare against your reference at all.

You need string similarity evaluation, not correctness. Switch your evaluator to:

evaluators=[LangChainStringEvaluator("string_distance")]

For exact matching, use a custom evaluator:

def reference_match_evaluator(run, example):
    predicted = run.outputs.get("result", "")
    reference = example.outputs.get("result", "")
    return {"key": "reference_match", "score": 1.0 if predicted.strip() == reference.strip() else 0.0}

String distance actually uses your reference outputs for comparison instead of doing its own math validation.

After switching to reference-based evaluation, my broken test cases finally failed like they should. Way better for regression testing.

Yeah, this is a common gotcha with LangSmith evals. The “correctness” evaluator is basically an LLM judge that knows math, so it completely ignores your reference outputs. Try using a custom evaluator instead - something like def my_eval(run, example): return {"score": 1 if run.outputs["result"] == example.outputs["result"] else 0}. Way simpler and actually compares against what you put in the dataset.