Finding all available default evaluation criteria in LangSmith

alexm · July 30, 2025, 8:21pm

I’m working with LangSmith to assess my language model outputs, and I’m aware that it includes various built-in evaluation criteria. I can implement them in my evaluation setup like this:

from langchain.smith import RunEvalConfig, run_on_dataset

config = RunEvalConfig(
    evaluators=RunEvalConfig.Criteria("accuracy")
)

run_on_dataset(
    dataset_name=my_dataset,
    llm_or_chain_factory=my_model,
    evaluation=config,
    client=smith_client
)

The documentation states that there are several predefined criteria available, but it doesn’t provide a comprehensive list. I’ve looked through the documentation and couldn’t locate the complete set of evaluation names. Can anyone guide me on how to retrieve a full list of these default evaluators instead of having to create custom ones?

mythicMuse · August 10, 2025, 12:36pm

Been there with LangSmith evaluations. The built-in criteria cover “helpfulness”, “conciseness”, “relevance”, “coherence”, “harmfulness”, “maliciousness”, “controversiality”, “misogyny”, “criminality”, and “insensitive”.

You can check the CRITERIA constant from the evaluators module programmatically, but managing all these evaluations manually gets messy fast.

Hit the same wall last month setting up model evaluations for our content pipeline. Instead of fighting LangSmith’s limitations, I built an automated workflow using Latenode that pulls from multiple evaluation sources, runs comparisons, and generates reports.

The workflow cycles through different criteria automatically, logs results to our database, and triggers retraining when scores drop below thresholds. Takes 10 minutes to set up versus hours of manual config.

Latenode handles the API calls, data processing, and result aggregation without code complexity. Way cleaner than managing evaluation configs manually.

Check it out: https://latenode.com

FlyingStar · August 8, 2025, 3:19am

Had this same problem a few months ago setting up evaluation pipelines. Here’s the fix: import the CRITERIA dictionary directly from langchain’s evaluation module.

from langchain.evaluation.criteria.eval_chain import CRITERIA

Then just run CRITERIA.keys() and you’ll get the full list of available criteria names - including “detail” and “depth” plus all the others mentioned above.

Way better than digging through docs or guessing names. Plus the dictionary has the actual prompt descriptions for each criterion, so you can see exactly what they evaluate.

This saved me tons of time when building evaluation matrices that cycle through multiple criteria programmatically instead of hardcoding them.

ClimbingLion · August 7, 2025, 11:56pm

you can import the criteria dict directly to see what’s available. try from langchain.evaluation import EvaluatorType then check EvaluatorType.__members__ - shows all default options without digging through docs. saved me tons of time when i needed to quickly see what was there.

JackHero77 · August 7, 2025, 6:48am

Try inspecting the RunEvalConfig class directly. Run RunEvalConfig.Criteria.__annotations__ or use inspect.signature() on the Criteria method to see what parameters it accepts. I found this while debugging eval failures in production. The method takes both string names and custom criterion objects, which the docs don’t make clear. For everything available, use from langchain.evaluation.criteria import SUPPORTED_CRITERIA - you’ll get the full mapping including hidden gems like “correctness” and “controversy”. One heads up from my experience: test your criteria combinations first. Some evaluators clash when run together, especially safety ones like “harmfulness” and “maliciousness”. They overlap and can mess with your scores if you don’t weight them carefully.

danielr · August 7, 2025, 5:36am

Quick python trick - run this in your terminal:

from langchain.evaluation.criteria import LabeledCriteria
print([criterion for criterion in dir(LabeledCriteria) if not criterion.startswith('_')])

This dumps everything including labeled criteria variants, which are great for production eval.

Or if you want raw criteria definitions without importing a bunch of modules:

from langchain.evaluation import load_evaluator
evaluator = load_evaluator("criteria", criteria="helpfulness")
print(evaluator.criterion_name, evaluator.criterion)

I keep a reference script with all criteria mapped out since switching between them during A/B testing gets old fast. Way better than looking stuff up manually.

Heads up - some criteria work better with certain model types. “Conciseness” can be way too strict for creative tasks. Found that out the hard way during content generation testing.