Dataset Evaluation Error with LangSmith - Language Model Issue

Pete_Magic · August 3, 2025, 1:25am

I’m trying to run evaluations on a dataset using LangSmith but keep running into problems. Every time I try to evaluate my dataset, I get this error message:

ValueError: Evaluation with the <class ‘langchain.evaluation.qa.eval_chain.QAEvalChain’> requires a language model to function. Failed to create the default ‘gpt-4’ model. Please manually provide an evaluation LLM or check your openai credentials.

Here’s the code I’m using for the evaluation:

evaluate_dataset(
    langsmith_client=ls_client,
    dataset_id="Test Dataset",
    model_factory=chat_model,
    eval_settings=config_object,
)

I’ve tried using different API keys including my personal OpenAI key and Azure OpenAI credentials from other working projects. The weird thing is that my API key works fine when I test it separately:

# Initialize LangSmith client and verify connection
ls_client = Client()

chat_model = ChatOpenAI(openai_api_key='my-actual-key-goes-here')
chat_model.predict("Test message!")

The test call works perfectly but the dataset evaluation keeps failing. Has anyone encountered this before? What am I missing here?

lily_luminesce · August 15, 2025, 1:21am

Been fighting LangSmith evaluation issues for years. The problem is LangSmith creates this nasty dependency chain that’s a nightmare to debug and maintain.

Skip QAEvalChain’s broken credential handling entirely. I’d automate the whole thing with Latenode instead. Set up a flow that handles dataset evaluation without getting stuck in LangSmith’s internal mess.

Latenode workflow:

Pulls your dataset from LangSmith
Runs evaluations with your OpenAI model configured right
Clean credential management
Pushes results back or exports however you want

I’ve automated similar pipelines this way - kills all the credential and dependency headaches. You get proper error handling and can tweak evaluation logic without diving into LangSmith’s weird internals.

Saves hours of debugging when you’re running regular evaluations. Way cleaner than working around LangSmith’s hardcoded junk.

evelynh · August 13, 2025, 2:29pm

LangSmith’s evaluation system has this annoying quirk where it completely ignores your model_factory parameter in certain scenarios. Hit this exact issue 6 months ago when migrating our evaluation pipeline.

QAEvalChain bypasses your provided model and tries to spin up its own default GPT-4 instance. Your chat_model works fine standalone, but the evaluation framework just doesn’t respect it.

Here’s what actually worked:

from langchain.evaluation import load_evaluator

# Create the evaluator with your specific LLM
qa_evaluator = load_evaluator("qa", llm=chat_model)

# Then pass it to your evaluation
evaluate_dataset(
    langsmith_client=ls_client,
    dataset_id="Test Dataset", 
    model_factory=chat_model,
    evaluators=[qa_evaluator],  # Use your custom evaluator
    eval_settings=config_object,
)

This forces the evaluation to use your properly configured LLM instead of letting it create its own broken one.

Also make sure you’re not mixing langchain versions. Had issues where newer LangSmith expected different evaluation interfaces than older langchain provided.

SkippingLeaf · August 12, 2025, 7:10pm

Hit this same annoying bug 3 weeks ago during a migration. LangSmith’s evaluate_dataset function has a hardcoded OpenAI dependency that ignores whatever you pass through model_factory. Fixed it by setting the OPENAI_API_KEY environment variable before running evaluation, even though my chat_model already had the key configured. LangSmith’s internal chains don’t inherit credentials from your model - they try creating fresh OpenAI connections using env vars instead. Try this before your evaluation: ```python
import os
os.environ[‘OPENAI_API_KEY’] = ‘your-key-here’

ameliat · August 12, 2025, 1:30pm

Hit this same issue 6 weeks ago. LangSmith’s evaluation framework has multiple components that create their own LLM instances, ignoring your chat_model config entirely. Fixed it by setting the environment variable AND creating a custom evaluation function. The evaluate_dataset function creates separate LLM instances for different steps, and some don’t inherit your model_factory settings. Here’s what worked: python import os os.environ['OPENAI_API_KEY'] = 'your-key' # Force the evaluator to use your model from langchain.evaluation import EvaluatorType custom_config = { "evaluators": [{ "evaluator_type": EvaluatorType.QA, "llm": chat_model }] } evaluate_dataset( langsmith_client=ls_client, dataset_id="Test Dataset", model_factory=chat_model, evaluation_config=custom_config ) The environment variable handles internal credential lookups while the explicit evaluator config forces your chat_model into the actual evaluation logic. Annoying that you need both, but that’s how LangSmith’s pipeline works.

nateharris · August 12, 2025, 1:05pm

Hit this exact problem 2 months ago. LangSmith’s evaluation system is completely broken for credential handling.

The issue? evaluate_dataset ignores your model_factory parameter entirely. It creates its own LLM instance with default settings instead.

I had to ditch the model_factory approach and configure evaluation manually:

from langchain.evaluation.qa import QAEvalChain

# Create your own eval chain with proper credentials
eval_chain = QAEvalChain.from_llm(llm=chat_model)

# Then use it in a custom evaluator function
def my_evaluator(run, example):
    return eval_chain.evaluate_strings(
        question=example.inputs["question"],
        answer=run.outputs["answer"],
        prediction=run.outputs["answer"]
    )

evaluate_dataset(
    langsmith_client=ls_client,
    dataset_id="Test Dataset",
    model_factory=chat_model,
    evaluation_config={"custom_evaluators": [my_evaluator]}
)

You have to build the evaluation chain yourself and force it to use your configured LLM. LangSmith’s built-in evaluation can’t handle custom credentials.

Also check your langchain and langsmith versions match. Mismatched versions break credential passing in bizarre ways.

emcarter · August 12, 2025, 10:17am

i’ve faced similar issues before! it seems like the evaluator might not be recognizing your chat_model. try adding evaluation_config={“llm”: chat_model} into eval_settings. also, double-check if the evaluate_dataset function has a dedicated llm parameter.

josephk · August 12, 2025, 2:10am

This error indicates that LangSmith’s evaluation framework is defaulting to its internal GPT-4 model rather than utilizing your provided chat_model. The evaluation chain may not be recognizing your LLM configuration from the model_factory parameter. To rectify this, consider wrapping your chat_model in a lambda function like this: model_factory=lambda: chat_model. Additionally, ensure your eval_settings config_object includes necessary LLM specifications that won’t conflict with your chat_model. In some cases, version mismatches between the langsmith and langchain packages have been known to cause similar issues.

CreatingStone · August 11, 2025, 6:49am

check ur langchain version compatibility first - ive seen this when mixing older langchain with newer langsmith. the evaluation framework cant figure out which LLM interface to use. try downgrading langchain to match ur langsmith version (or upgrade langsmith). also double-check ur not mixing async/sync model instances in ur config_object.

Ryan_Innovative · August 11, 2025, 6:36am

Hit this same issue last month during a project rollout. The problem is LangSmith’s QAEvalChain ignores your model_factory setup and tries to create its own GPT-4 client internally.

What fixed it: pass the LLM reference directly through the evaluators parameter instead of relying on model_factory. The evaluate_dataset function treats these as separate components, so your working chat_model never makes it to the evaluation chain.

Try this approach:

evaluate_dataset(
    langsmith_client=ls_client,
    dataset_id="Test Dataset",
    model_factory=chat_model,
    evaluators=[{"type": "qa", "llm": chat_model}],
    eval_settings=config_object,
)

This forces the evaluation system to use your authenticated chat_model instead of creating a broken instance. The model_factory handles main execution while evaluators handles assessment logic - both need your LLM config passed explicitly through different parameters.

Samuel87 · August 11, 2025, 3:37am

Had this exact problem last month - drove me crazy for hours. QAEvalChain creates its own LLM instance no matter what you pass through model_factory. I fixed it by setting the evaluation LLM directly in the config_object instead of using the model_factory parameter. Try adding evaluators=[{"evaluator_type": "qa", "llm": chat_model}] to your config_object. Also check you’re importing from the right langchain evaluation modules - some deprecated imports caused similar credential issues for me. Since your standalone chat_model works, your API setup’s fine. The evaluation pipeline just isn’t picking it up.