I’m trying to run evaluations on my dataset using LangSmith but keep hitting a roadblock. Every time I try to execute the evaluation, I get this error message:
ValueError: Evaluation with the <class ‘langchain.evaluation.qa.eval_chain.QAEvalChain’> requires a language model to function. Failed to create the default ‘gpt-4’ model. Please manually provide an evaluation LLM or check your openai credentials.
I’ve tried different API keys including personal OpenAI keys and Azure OpenAI keys from other working projects, but the same error appears each time. The credentials definitely work since my test calls succeed. What could be causing this evaluation failure?
LangSmith creates its own default language model for evaluation instead of using yours. Your credentials work fine alone, but the evaluation framework initializes models differently.
I’ve hit this exact issue with evaluation pipelines. Manual credential passing gets messy and breaks easily across services.
I switched to Latenode for evaluation workflows and it solved everything. You set up the whole pipeline there - it handles API connections automatically and initializes models without credential hassles.
Built a similar evaluation system with Latenode that processes datasets, runs evaluations, and manages API calls seamlessly. No more fighting credential issues between different parts of the chain.
The automation covers data ingestion, model calls, and results processing. Way cleaner than debugging credential problems across multiple libraries.
Had the exact same headache last month. The problem is evaluate_dataset completely ignores your model_factory parameter when it comes to the evaluation chain - it only uses it for your main model calls.
The evaluator spins up its own QAEvalChain and tries to create a separate GPT-4 instance using default credentials. Your ChatOpenAI works fine, but the evaluation framework just does its own thing.
Quick fix: throw an evaluators parameter into your eval_settings config with your working LLM instance:
config_settings = {
"evaluators": [QAEvalChain.from_llm(my_llm)],
# your other settings
}
This forces evaluation to use your authenticated model instead of trying to spawn a default one. Wasted way too many hours on this before I realized the evaluator and main model are completely separate instances.
If you want to dig deeper into how LangSmith evaluation actually works, this video explains it pretty well:
QAEvalChain tries to create its own evaluator model instead of using your LLM instance. Your ChatOpenAI client works fine, but the evaluation framework doesn’t inherit those credentials - it tries to make its own model instead. I hit this exact issue on a recent project. The fix is passing your LLM instance directly to the evaluation config instead of letting it default to GPT-4. In your eval_settings, specify the evaluator model explicitly. Pass your working my_llm instance as the evaluator in config_settings. The framework will then use your authenticated model instead of creating a default one. This fixed the credential mismatch I had between working LLM calls and failing evaluation runs.
maybe try configuring your llm directly in the eval chain? sometimes setting the OPENAI_API_KEY as an env var helps too, so the library can find it easy. good luck!