I’m having trouble with my LangSmith evaluation setup. I created test cases with wrong reference answers on purpose to see if the system would catch them and mark tests as failed. But all my tests keep passing even though the reference outputs I provided are clearly incorrect.
The problem seems to be that the evaluator is comparing the GPT model responses against what’s actually correct, not against my reference outputs from the dataset. Here’s my test setup:
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
import openai
from langsmith.wrappers import wrap_openai
client = Client()
# Create test dataset
test_dataset = "math_correctness_test"
dataset = client.create_dataset(test_dataset, description="Math evaluation tests")
client.create_examples(
inputs=[
{"problem": "What is 8 - 3?"},
{"problem": "Calculate 4 + 4"},
{"problem": "What is the approximate value of Pi?"}
],
outputs=[
{"result": "8 - 3 equals 7"}, # Wrong on purpose
{"result": "4 + 4 equals 9"}, # Wrong on purpose
{"result": "Pi is approximately 2"} # Wrong on purpose
],
dataset_id=dataset.id
)
# Setup model function
openai_client = wrap_openai(openai.Client())
def run_prediction(input_data: dict) -> dict:
prompt = [{"role": "user", "content": input_data["problem"]}]
result = openai_client.chat.completions.create(
messages=prompt,
model="gpt-3.5-turbo"
)
return {"prediction": result}
# Run evaluation
results = evaluate(
run_prediction,
data=test_dataset,
evaluators=[LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": "correctness"}
)]
)
The evaluator keeps giving scores of 1 and marking everything as correct because it sees that GPT gave the right mathematical answers, even though those answers don’t match my intentionally wrong reference outputs. How can I make it actually compare against my dataset outputs instead of just checking if the answer is factually correct?
Hit this same problem last year building eval pipelines. The “correctness” evaluator does factual validation and completely ignores your reference data.
Ditch that evaluator. Use “exact_match” or build something custom:
from langsmith.evaluation import EvaluationResult
def reference_comparison_evaluator(run, example):
prediction = run.outputs["prediction"]
reference = example.outputs["result"]
# Simple string comparison
match = prediction.strip().lower() == reference.strip().lower()
return EvaluationResult(
key="reference_match",
score=1 if match else 0
)
Use this instead of the LangChain one. It’ll actually check against your dataset outputs instead of doing its own validation.
What’s the point of reference comparison if the evaluator thinks it knows better than your data?
The problem is LangChain’s labeled_criteria evaluator doesn’t actually compare against your reference data - it just validates if the answer makes sense semantically. I ran into this same issue when testing model degradation with intentionally wrong ground truth. Your evaluator is acting like an independent fact-checker, ignoring your dataset outputs completely. This breaks controlled testing with known-bad references. Ditch LangChainStringEvaluator and use string_distance evaluator instead, or build a custom one that directly compares model output to example.outputs. The correctness criteria totally bypasses your reference data, which is why wrong answers keep passing. You need evaluators that treat your dataset as the source of truth, even when it’s factually incorrect.
Others got the main issue right, but here’s how I’d actually fix this without fighting evaluator configs.
Had the same headaches trying to make evaluation frameworks work exactly right. Wasted tons of time debugging evaluator logic and custom comparison stuff.
I just automate everything with Latenode now. Build a workflow that grabs your test cases, runs them through your model, then does exact comparison against reference outputs. No wondering what the evaluator’s gonna check.
Set up nodes for data extraction, model calls, and custom comparison logic. You control exactly how matching works - exact string, similarity thresholds, whatever. Easy to add logging and alerts when tests randomly fail too.
Way cleaner than dealing with LangSmith’s evaluator assumptions. Build once, run automatically, actually trust your results.
Your issue stems from using the “correctness” evaluator, which assesses if the answers are semantically valid rather than strictly matching your reference outputs. To resolve this, switch to an evaluator that is designed for exact matches or develop a custom evaluator that checks for string similarity. The “labeled_criteria” evaluator prioritizes factual accuracy, leading it to overlook your intentionally incorrect reference outputs. Focus on evaluators aimed at capturing exact matches or similarity for results that align with your reference answers.
yeah, the labeled_criteria evaluator isn’t even looking at your reference outputs - it’s only checking if GPT’s math is right. try a string similarity evaluator or build a custom one that actually compares against your dataset outputs instead of just checking factual accuracy.
The problem is that labeled_criteria with correctness acts like a fact-checker, not a comparison tool. It’s checking math accuracy against ground truth instead of comparing to your provided outputs. I’ve hit this same issue testing evaluation frameworks. Switch to string_evaluator with exact_match criteria - it’ll do literal string comparison against your reference data. Or try embedding_distance evaluator if you want semantic similarity instead of exact matches. Your current setup basically tells the evaluator to ignore your reference answers and validate correctness on its own. To test evaluation logic with intentionally wrong references, you need evaluators that actually use your dataset outputs as the baseline, not ones that apply their own validation rules.