I’m using LangSmith to assess outputs and I’m curious about the built-in evaluators available for use. Although the documentation mentions several built-in options, it doesn’t provide a comprehensive list.
While I know “validity” is an option, I would like to learn about any additional criteria that may be available. Can anyone guide me on how to access this information or direct me to the relevant documentation?
quick tip - run dir() on EvaluateSetup.Criteria to see what methods and attribs are available. LangSmith often has more criteria than what’s in the docs. also worth checkin their GitHub issues - people post undocumented criteria they’ve found there all the time.
I’ve found LangSmith has several built-in criteria besides “validity” through trial and error. The ones that work for me: “helpfulness”, “harmlessness”, “relevance”, “coherence”, and “conciseness”. “Correctness” is solid for factual content too.
There’s no easy programmatic way to get the full list that I know of. I check the LangSmith web interface under evaluation settings - dropdown menus sometimes show what’s available. You can try looking at EvaluateSetup.Criteria class docs in your IDE or use Python’s help() function, but it’s hit-or-miss.
I’d test these criteria names in your setup to see what’s currently supported. The options seem to grow with updates.
I’ve run into this with LangSmith evaluations too. Use inspect.getmembers() on the Criteria class instead of just checking the directory - it’ll show you what’s actually available. Some criteria are version and region specific, so what works in one instance might not work in another. The evaluation framework gets updated constantly but the docs don’t keep up. Here’s what worked for me: open the web interface and check the network requests in browser dev tools. You’ll see exactly which criteria names get sent to the API endpoints. I found several criteria this way that aren’t documented anywhere, including some domain-specific ones. Just heads up - some criteria need extra config parameters beyond the name string, so you’ll still need to test each one to figure out how they actually work.
Check LangSmith’s GitHub repo directly. I ran into the same documentation gaps and this is what worked for me. Go to the evaluation modules and find the criteria validation files - they’ve got hardcoded lists of all the supported criteria names that you can’t get through the regular API. I found stuff like “consistency”, “clarity”, and “bias” that way, none of which were in the official docs. The source code’s always more up-to-date than their documentation since it shows what’s actually built. You can also grep for criteria pattern matching - try searching “VALID_CRITERIA” or “SUPPORTED_EVALUATORS”. This trick also showed me that some criteria only work in certain deployments.
Been there with LangSmith’s evaluation criteria hunt. Manual testing works but gets tedious fast across multiple projects.
I built simple automation that queries all available criteria systematically. Instead of guessing or checking dropdowns manually, I created a workflow that tests each potential criteria name against the API and logs what’s actually supported.
Best part? You can run it whenever LangSmith updates and auto-sync results to your docs or internal tools. No more hunting through interfaces or outdated documentation.
I used Latenode since it handles API calls, error handling, and result processing without tons of boilerplate code. Set it to run weekly and notify your team when new criteria drop.
Same pattern works for discovering other “hidden” API features where docs lag behind actual capabilities.
The LangSmith Python client has an undocumented method that helped me with this exact issue last month.
Import the client and check the schema definitions:
from langsmith import Client
client = Client()
# Look at the evaluation schema
print(client._get_evaluation_criteria_schema())
If that method doesn’t exist in your version, try introspecting the criteria validation logic instead. LangSmith validates criteria names server-side, so invalid ones throw specific errors that list valid options.
I wrote a quick script that intentionally passes garbage criteria names and parses the error responses. The API returns helpful errors like “Invalid criteria ‘xyz’. Valid options are: [list]”
This caught criteria like “factuality”, “completeness”, and “fluency” that I hadn’t seen documented anywhere. Way more reliable than guessing or checking UI dropdowns that might vary by environment.
Just wrap your evaluation setup in try-catch and log the error details when testing new criteria names.