I’m working with LangSmith and trying to figure out what evaluation options I have. I know there are multiple built-in evaluators that come with LangSmith, but I can’t seem to find a complete list anywhere in the documentation.
Here’s what I’m currently doing to set up evaluation:
I know “accuracy” works, but what other criteria are available? Is there a way to programmatically get all the built-in evaluator names that I can use?
I get the frustration with LangSmith’s docs. Been there when setting up evaluation workflows.
Built-in evaluators include “helpfulness”, “harmfulness”, “coherence”, “relevance”, “correctness”, and “conciseness” - not just “accuracy”. There’s also “qa” for question-answering.
To grab them programmatically:
from langchain.evaluation import EvaluatorType
print([e.value for e in EvaluatorType])
LangSmith’s evaluation setup gets messy fast, especially with multiple evaluators or custom processes.
I switched to Latenode for my entire evaluation pipeline. Set up workflows that auto-run different evaluators, collect results, and trigger actions based on scores. Connects to basically any tool.
The visual workflow builder makes chaining evaluation steps dead simple - no boilerplate code. Way cleaner than managing Python scripts.
Here’s what others missed - you can inspect available criteria directly through the LangSmith client.
Hit this exact issue last year setting up automated evaluations for our chatbot. Use this:
from langsmith import Client
client = Client()
evaluator_info = client.list_evaluation_templates()
Gives you way more than names. You get descriptions and parameter requirements for each evaluator.
Beyond the obvious “accuracy” and “relevance”, there are hidden gems. “Conciseness” is perfect for API responses where token cost matters. “Coherence” catches when your model starts rambling.
The safety evaluators (“harmfulness”, “maliciousness”) saved us during compliance reviews. Management loves seeing those scores in reports.
Learned this the hard way - some evaluators need specific input formats. “QA” expects question and answer pairs, not raw text. Always test with your actual data structure first.
Also check client.get_evaluator_config(evaluator_name) to see what parameters each accepts. Some let you customize thresholds and scoring ranges.
You can check what evaluators are available by running dir(EvaluationConfig.Criteria) - it’ll show all the criteria methods. From what I’ve seen with LangSmith, you get “accuracy”, “relevance”, “coherence”, “harmfulness”, “helpfulness”, “controversiality”, “misogyny”, “criminality”, “insensitivity”, and “maliciousness”. The safety ones like “harmfulness” and “maliciousness” are really handy for production. Need something custom? Just subclass StringEvaluator if the built-ins don’t cut it. There’s also a list_evaluators() method in the LangSmith client that’s barely documented - try it if you’re on the latest version.
The problem isn’t just finding evaluator names. LangSmith becomes a complete nightmare when you need to scale past basic testing.
Yeah, you can dig through source code or use those client methods people mentioned. But try managing multiple eval runs, comparing results across model versions, or setting up automated workflows - that’s where LangSmith craps out.
I hit this exact wall evaluating language models for our product team. Started with LangSmith’s built-in evaluators but quickly ran into customization and automation limits.
Ended up switching to Latenode and building a full evaluation automation system. Now I’ve got workflows that pull datasets, run multiple evaluators simultaneously, aggregate scores, and ping me when models tank. The visual builder makes chaining different evaluation steps ridiculously easy.
Best part? You can hook up any evaluation tool, not just LangSmith’s stuff. Mix whatever works for your setup. No more fighting with Python scripts and client configs.
All that time I used to waste on pipeline maintenance? Now I spend it actually improving models instead of dealing with infrastructure headaches.
Yeah, the LangSmith evaluator docs are terrible. Hit the same wall when I was building eval pipelines at my last job. I ended up just reading the source code. Check out langchain.evaluation.schema - there’s a full EvaluatorType enum with everything available. You’ve got the basic ones like “accuracy” and “relevance”, but also weird stuff like “embedding_distance”, “string_distance”, and “trajectory” for complex use cases. Try the get_supported_evaluators() method in the LangSmith Python client too (might need a newer version though). Also discovered that “custom_criteria” takes any string you throw at it - you can literally describe your eval criteria in plain English. Definitely test these with your actual data first. They act completely different depending on what domain you’re working in.
try importing from langchain.evaluation import CRITERIA_TYPE_TO_DATA_TYPE - it’s a dict that maps all criteria to their expected data formats. saved my butt when i kept getting weird evaluation errors cuz i was passing wrong data types to evaluators like “embedding_distance”.