I’m trying to run evaluations on a dataset using LangSmith but keep running into problems. Every time I try to evaluate my dataset, I get this error message:
ValueError: Evaluation with the <class ‘langchain.evaluation.qa.eval_chain.QAEvalChain’> requires a language model to function. Failed to create the default ‘gpt-4’ model. Please manually provide an evaluation LLM or check your openai credentials.
I’ve tried different API keys including personal OpenAI keys and Azure OpenAI keys from other working projects. The weird thing is that my model works fine when I test it separately:
# Initialize LangSmith client and verify model works
my_client = Client()
my_model = ChatOpenAI(openai_api_key='(actual key goes here)',)
my_model.predict("Test message!")
The test call works perfectly but the dataset evaluation still fails. Has anyone encountered this before? Really need some guidance here!
your model_factory might be the actual model instead of a factory function. LangSmith needs a callable that returns the model, not the model itself. try model_factory=lambda: my_model instead of passing my_model directly.
Been there - debugging this evaluation chain manually every time is a nightmare.
It’s not just your model config. LangSmith’s evaluation pipeline breaks constantly with credential mismatches or model init failures. I got tired of fighting these API headaches, so I automated everything.
Built a flow that handles dataset evaluation start to finish. It initializes models, runs evaluations automatically, and retries failed calls. No more debugging credentials or guessing which parameters go where.
Works with any model provider too. Switch between OpenAI, Azure, whatever - no code changes needed. You can schedule regular runs and get alerts when stuff breaks.
Saved me tons of time on LangSmith’s quirks. Evaluations run themselves while I actually work on improving models.
I’ve hit this exact error before - it’s a config issue with how LangSmith handles model initialization during evaluation. Your model works fine in isolation, but evaluate_dataset creates its own context that doesn’t inherit your model config properly. What fixed it for me was making sure OpenAI credentials are available at the environment level, not just passed as parameters. Set your API key as an environment variable before running evaluation: os.environ['OPENAI_API_KEY'] = 'your_key_here'. Also, explicitly include the model specification in your eval_settings config. The evaluation framework sometimes ignores the model_factory parameter and tries to spin up its own default GPT-4 model, which is why you’re getting that failed model creation error.
I encountered a similar issue recently when using LangSmith for evaluation. It seems that while your model operates correctly on its own, the problem arises within the evaluate_dataset method, which may not be referencing the model as expected. Make sure that you specify the model in both the model_factory and the eval_settings. Specifically, adding evaluation_llm=my_model in your eval_settings should help solve the problem. This necessitates redundancy in your code, but it appears to be required for the function to work properly.
This happens when QAEvalChain can’t reach the language model during evaluation, even though your model runs fine on its own. The problem is that evaluate_dataset builds its own evaluation chain internally and needs the LLM passed through a specific parameter. Try explicitly passing your model as the evaluator LLM:
evaluate_dataset(
langsmith_client=my_client,
dataset_id=“Test Dataset”,
model_factory=my_model,
eval_settings=config_settings,
evaluators=[QAEvalChain.from_llm(my_model)]
)
Also check that your OPENAI_API_KEY environment variable is set correctly - LangSmith might be trying to spin up its own default model instead of using yours. I ran into this exact issue last month and the explicit evaluator parameter fixed it completely.