Streamlit session state variables not accessible in LangSmith evaluation function

I’m trying to build a model evaluation system with LangSmith but running into issues when my evaluation function tries to access Streamlit session state variables. The problem happens specifically when the evaluation runs.

My Setup:

  • Created a virtual environment and installed required packages (streamlit, langsmith, langchain, openai, python-dotenv)
  • Set up API keys in environment file
  • Running the app with streamlit run command

The Issue:
When my evaluation function tries to update the conversation_log variable stored in session state, it throws an error. The evaluation seems to run in a different context where Streamlit variables aren’t available.

Code Example:

import streamlit as st
from langsmith import Client, evaluate
from langchain_openai import ChatOpenAI
from langsmith.evaluation import LangChainStringEvaluator
import openai

# Initialize conversation log
if 'conversation_log' not in st.session_state:
    st.session_state.conversation_log = []

# Setup client and dataset
client = Client()
testset_name = "Response Testing Dataset"
testset = client.create_dataset(testset_name)

# Create test examples
client.create_examples(
    inputs=[
        {"query": "Explain machine learning"},
        {"query": "What is Python?"},
        {"query": "Define artificial intelligence"},
    ],
    outputs=[
        {"response": "ML is a subset of AI that learns from data"},
        {"response": "Python is a programming language"},
        {"response": "AI mimics human intelligence in machines"},
    ],
    dataset_id=testset.id,
)

# Evaluation prompt
GRADING_TEMPLATE = """Grade this answer as an expert teacher.
Question: {query}
Expected: {response}
Student Answer: {result}
Return CORRECT or INCORRECT:"""

# Setup evaluator
eval_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
response_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_model})

# My application function
def generate_response(query):
    client = openai.Client()
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer briefly and accurately."},
            {"role": "user", "content": query}
        ]
    ).choices[0].message.content
    return result

def evaluation_wrapper(inputs):
    answer = generate_response(inputs["query"])
    # This line causes the error - can't access streamlit session state
    st.session_state.conversation_log.append(answer)
    return {"result": answer}

# Run evaluation
test_results = evaluate(
    evaluation_wrapper,
    data=testset_name,
    evaluators=[response_evaluator],
    experiment_prefix="test-run",
)

Expected Result: The conversation_log should update with each evaluation.
Actual Result: Error occurs because session state is not accessible during evaluation.

Has anyone encountered this issue before? How can I make Streamlit session variables available to the LangSmith evaluation function?

The evaluation context is completely cut off from Streamlit’s web session - this bit me on a similar project a few months back. LangSmith creates its own execution environment with zero access to your browser session variables.

I ended up building a simple callback mechanism. Write a custom wrapper that captures everything in a regular dictionary during evaluation, then push all that data back to session state in one go when evaluate() finishes.

def evaluation_wrapper(inputs):
    answer = generate_response(inputs["query"])
    # Store in module-level variable instead
    evaluation_results.append(answer)
    return {"result": answer}

# Before evaluation
evaluation_results = []
test_results = evaluate(...)

# After evaluation completes
st.session_state.conversation_log.extend(evaluation_results)

This keeps your evaluation logic clean and avoids session state dependencies. Plus you get better error handling since evaluation won’t crash if session state gets weird.

This tutorial really helped when I was figuring out session state patterns:

Basically treat evaluations as pure functions that return data, then handle UI updates separately.

been there, langsmith evaluations run in isolation, so streamlit’s session_state isn’t available. here’s what works: create a regular python list in your evaluation_wrapper to collect results, then update session_state once evaluate() finishes. try results_buffer = [] before running the evaluation, append everything there instead of session_state, then st.session_state.conversation_log.extend(results_buffer) at the end. keeps things cleaner anyway.

You’re hitting a core Streamlit limitation. When LangSmith runs evaluations, it spawns separate processes that can’t access your session state. Had this exact problem last month. Easiest fix: pass the conversation log as a parameter to your evaluation function instead of trying to grab it from session state. Modify your evaluation_wrapper to accept the log as an extra parameter, collect results in a local variable during evaluation, then bulk update session state when it’s done. Alternatively, use a global variable or class attribute to store conversation data during evaluation, then sync it back to session state afterward. Bottom line - evaluations should be stateless anyway. You don’t want UI dependencies messing with your model testing.

Yeah, this happens because LangSmith evaluations run completely outside your Streamlit process. Ran into the exact same thing building evaluation pipelines last year.

Don’t fight the session state limitations - just automate the whole thing. Use Latenode to handle your LangSmith evaluations as a separate pipeline that runs independently from your Streamlit app.

Build a Latenode workflow that triggers your evaluation function, grabs all the conversation logs automatically, and dumps results in a database or file. Your Streamlit app just pulls that data when it needs to show results.

Way cleaner approach. Your evaluation logic stays pure (no UI dependencies) and you can schedule evaluations on triggers, webhooks, or timers. Bonus: automatic retry logic when evaluations fail.

I’ve used this pattern for several model evaluation systems. Latenode handles the orchestration between LangSmith, data storage, and your Streamlit frontend. No more context switching headaches.

Your Streamlit app becomes just a dashboard showing evaluation results while the actual testing runs as a robust automated pipeline in the background.

The problem is LangSmith’s evaluation framework runs in a separate thread pool where Streamlit’s session context doesn’t exist. Hit this while building automated testing for conversational AI systems. Don’t fight the architecture - use a thread-safe data structure to bridge the gap. Create a queue.Queue() or collections.deque() at module level before your evaluation starts. Your evaluation wrapper pushes results there, then drain the queue back into session state once evaluate() returns. python from collections import deque conversation_queue = deque() def evaluation_wrapper(inputs): answer = generate_response(inputs["query"]) conversation_queue.append(answer) return {"result": answer} # After evaluation while conversation_queue: st.session_state.conversation_log.append(conversation_queue.popleft()) This handles concurrent access properly and avoids file system overhead. Accept that evaluation runs in isolation, so design for eventual consistency rather than real-time session updates.

Yeah, this is a classic Streamlit + LangSmith headache. The issue is that LangSmith’s evaluate function runs in its own context, totally separate from your Streamlit session state. Session state only exists in your web session - it’s not accessible during background processes or external calls. Here’s what worked for me: I ditched session state for evaluation tracking and used file-based logging instead. Created a temp JSON file to store conversation logs during evaluation, then loaded those results back into session state when it finished. You could also just use a regular list variable outside session state in your evaluation wrapper, then merge those results back afterward. Another option is separating evaluation from your Streamlit interface completely. Run evaluation as a standalone process and save results to a file or database that your Streamlit app reads from. This actually makes your architecture cleaner since evaluation shouldn’t depend on UI state anyway.

I hit this exact problem building evaluation pipelines for our company chatbot. Here’s what’s happening: LangSmith’s evaluation engine spawns worker processes that can’t access your Streamlit session context. I’ve found the closure pattern works best. Set up a mutable object outside your evaluation function, then capture it in the wrapper scope: python conversation_buffer = [] def evaluation_wrapper(inputs): answer = generate_response(inputs["query"]) conversation_buffer.append(answer) # Direct access to outer scope return {"result": answer} # After evaluation completes st.session_state.conversation_log.extend(conversation_buffer) This skips file I/O overhead and keeps everything in memory during evaluation. The trick is treating evaluation as a batch operation - accumulate results, then sync back to session state when the whole process finishes. Way more reliable than trying to bridge the process boundary on every single evaluation call.