Why are my LangSmith evaluations passing when they should fail?

I’m having trouble with my LangSmith evaluation setup. I created test cases with wrong reference answers on purpose to see if the evaluator would catch them. But instead of failing like I expected, all my tests are passing.

The problem seems to be that the evaluator is checking if the AI model’s responses are correct instead of comparing them to my reference outputs. Here’s my test setup:

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
import openai
from langsmith.wrappers import wrap_openai

client = Client()

# Create test dataset
test_dataset = "math_evaluation_dataset"
dataset = client.create_dataset(test_dataset, description="Testing evaluation logic")

client.create_examples(
    inputs=[
        {"prompt": "Calculate 15 - 8"},
        {"prompt": "What is 3 * 4?"},
        {"prompt": "Tell me the square root of 16"}
    ],
    outputs=[
        {"result": "15 - 8 equals 10"},  # Wrong on purpose
        {"result": "3 * 4 equals 15"},   # Wrong on purpose  
        {"result": "Square root of 16 is 5"}  # Wrong on purpose
    ],
    dataset_id=dataset.id,
)

# Setup AI function
openai_client = wrap_openai(openai.Client())

def run_prediction(input_data: dict) -> dict:
    user_messages = [{"role": "user", "content": input_data["prompt"]}]
    ai_response = openai_client.chat.completions.create(
        messages=user_messages, 
        model="gpt-3.5-turbo"
    )
    return {"result": ai_response}

# Run evaluation
test_results = evaluate(
    run_prediction,
    data=test_dataset,
    evaluators=[LangChainStringEvaluator("labeled_criteria", config={"criteria": "correctness"})],
    metadata={"test_version": "1.0.0"},
)

I expected the evaluator to compare the AI responses against my intentionally wrong reference answers and fail the tests. Instead, it’s validating that the AI gave mathematically correct answers and ignoring my reference outputs completely. How can I make it actually use my reference answers for comparison?

You’re hitting a common gotcha with LangChain evaluators. The labeled_criteria evaluator with ‘correctness’ doesn’t just match strings against your reference outputs - it uses an LLM to judge if the response is mathematically correct, not compare it to your dataset. Switch to LangChainStringEvaluator('labeled_score_string') or write a custom evaluator that actually compares the AI response to your reference answer. I hit this exact issue testing my pipeline last month. The docs really should explain the difference between criteria-based evaluation and reference comparison better.

Hit this same problem building evaluation pipelines for our ML team. It’s way more complex than just choosing the right evaluator.

Manual setups like this are a pain to maintain. Want to add test cases? Modify evaluators? Try different validation scenarios? You’re back to writing code and debugging Python.

I automated the whole workflow instead. Now I can:

  • Set up multiple evaluation types (string matching, semantic comparison, custom logic) without code
  • Run evaluations on schedules or triggers
  • Compare results across models automatically
  • Scale to hundreds of test cases without performance hits

The automation switches between evaluator types based on what I’m testing. For your case, it’d route to exact string matching for reference comparisons, then flip to semantic evaluation for other scenarios.

Best part? I modify evaluation logic through a visual interface instead of debugging Python every time requirements change. Saves me about 20 hours monthly on evaluation maintenance.

This scales much better than manual LangSmith when you’re running production ML systems.

The issue arises from using labeled_criteria with ‘correctness’, which performs semantic evaluation rather than direct string comparison with your reference answers. It evaluates the factual accuracy of the AI response, thus overlooking your deliberately incorrect reference outputs. For reference-based evaluation, consider switching to LangChainStringEvaluator('qa'), which effectively compares predictions against the provided ground truth. Alternatively, you could implement a custom evaluator to align the AI output with your dataset’s reference field. This was a challenge I faced during my testing as well, where criteria-based evaluators focus on content quality instead of validating against specific dataset entries.

Yeah, labeled_criteria doesn’t do string matching - it’s semantic evaluation. Basically asks GPT to judge if your output is factually correct instead of comparing against your references. Use LangChainStringEvaluator('exact_match') or LangChainStringEvaluator('embedding_distance') if you want actual comparison to your dataset outputs.

Been there! Your evaluator’s acting like a smart teacher instead of a strict grader. The labeled_criteria with correctness checks if answers make logical sense, not whether they match your reference data exactly.

I hit this same issue building test suites. The evaluator ignores your wrong reference answers because it validates mathematical accuracy, not literal comparisons.

Quick fix - swap to LangChainStringEvaluator('qa') or build a simple custom evaluator for actual string matching:

def custom_reference_evaluator(run, example):
    prediction = run.outputs['result']
    reference = example.outputs['result']
    return {"score": 1.0 if prediction == reference else 0.0}

This’ll give you the failing tests you want. LangSmith has solid docs on custom evaluators, plus there’s a good walkthrough here:

Once you switch, your intentionally wrong reference answers will finally cause proper test failures.