Streamlit session state not visible in LangSmith evaluation function

I’m attempting to create an evaluation framework with LangSmith, but I’m facing challenges when my evaluation logic tries to interact with Streamlit session state variables.

Here’s a summary of the steps I’ve taken:

  1. Launched a basic Streamlit application integrated with LangSmith
  2. Developed a dataset for evaluation containing test questions and responses
  3. Programmed an evaluation function designed to update session state during the evaluation process
  4. Executed the evaluation using LangSmith’s ‘evaluate’ function

What I anticipated: The variable session_state.message_log should be modified during the execution of the evaluation function.

What I encountered: An error indicating that the session state variable is unreachable.

Here’s a streamlined version of my code:

import streamlit as st
from langsmith import Client, evaluate
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator
import openai

load_dotenv()

# Initialize message log
if 'message_log' not in st.session_state:
    st.session_state.message_log = []

client = Client()

# Setup evaluation dataset
test_dataset = "Assessment Dataset"
dataset = client.create_dataset(test_dataset)
client.create_examples(
    inputs=[
        {"query": "Explain machine learning"},
        {"query": "What is artificial intelligence"},
        {"query": "Define neural networks"},
    ],
    outputs=[
        {"response": "ML is a subset of AI that learns from data"},
        {"response": "AI simulates human intelligence in machines"},
        {"response": "Neural networks mimic brain structure for computation"},
    ],
    dataset_id=dataset.id,
)

# Evaluation prompt
GRADING_TEMPLATE = """Grade this answer as an expert teacher.
Question: {query}
Correct answer: {response}
Student answer: {result}
Respond with PASS or FAIL:"""

grading_prompt = PromptTemplate(
    input_variables=["query", "response", "result"], 
    template=GRADING_TEMPLATE
)

grading_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
grader = LangChainStringEvaluator("qa", config={"llm": grading_model, "prompt": grading_prompt})

openai_client = openai.Client()

def generate_answer(query):
    return openai_client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": "Answer briefly and accurately."},
            {"role": "user", "content": query}
        ],
    ).choices[0].message.content

def evaluation_function(inputs):
    result = generate_answer(inputs["query"])
    st.session_state.message_log.append(result)  # This line causes the error
    return {"result": result}

# Run evaluation
test_results = evaluate(
    evaluation_function,
    data=test_dataset,
    evaluators=[grader],
    experiment_prefix="test-run",
)

The error occurs when attempting to access st.session_state.message_log within the evaluation function. What can I do to resolve this issue?

Yeah, session state isolation is a pain. I’ve hit this debugging evaluation setups - the context separation always catches people off guard.

Skip the shared memory and temp file headaches. Just automate everything instead. Build a workflow that runs your LangSmith evaluations, grabs the results, and sends them where you need them.

Set up triggers for starting evaluations, auto data collection while they run, and real-time updates back to Streamlit via webhooks or database writes.

Your evaluation logic stays clean, no hacky session state workarounds, and you get better tracking. Bonus: you can schedule regular evaluations or compare results across runs.

The automation handles the messy coordination so your code stays simple.

Been there. LangSmith’s evaluate function spawns its own processes, completely cut off from your Streamlit session state.

Hit this same issue building evaluation dashboards. Fixed it with a shared data structure both contexts can reach.

Use a global list or dictionary instead of session state for tracking results:

# Replace session state with a global variable
EVALUATION_LOG = []

def evaluation_function(inputs):
    result = generate_answer(inputs["query"])
    EVALUATION_LOG.append(result)  # Works across processes
    return {"result": result}

Once evaluation finishes, transfer the data back to session state:

# After evaluate() finishes
st.session_state.message_log.extend(EVALUATION_LOG)
EVALUATION_LOG.clear()  # Clean up

Keeps evaluation tracking simple without extra dependencies. Global variable survives process isolation while session state stays in your Streamlit context.

Just clear the global list between runs so you don’t pile up old data.

There’s a cleaner way to handle this. Skip the session state headaches and complex file systems - just automate the whole evaluation pipeline.

I’ve hit similar evaluation tracking issues. The problem is you’re mixing two execution contexts that hate each other.

Here’s my approach: build an automated workflow that handles everything - runs evaluations, captures data, feeds results back to Streamlit seamlessly.

Set up triggers for your LangSmith evaluations, auto-log results to a database or webhook, then update your Streamlit interface in real time. Throw in notifications when evaluations finish.

This kills the session state problem entirely. Evaluations run independently, data gets captured right, Streamlit stays responsive.

Bonus: you can schedule regular evaluations, compare results over time, and trigger evaluations when data changes.

The automation handles all the messy coordination so you don’t have to.

The problem is LangSmith’s evaluate function runs in a separate execution context where Streamlit’s session state doesn’t exist. LangSmith can’t access session state when processing evaluations. I hit the same issue while tracking evaluation metrics in a Streamlit dashboard. Here’s what worked for me: create a logging system outside session state. Write your results to a local file or database during evaluation, then read them back into Streamlit afterward. Even better - completely separate your evaluation logic from the Streamlit interface. Run LangSmith evaluations as standalone scripts, save results, then load them into Streamlit for visualization. This works way better for longer evaluations since you won’t lose everything if your Streamlit session times out.

This happens because LangSmith’s evaluation framework runs in a completely different environment than your Streamlit app. Session state only exists in the Streamlit server process - LangSmith’s evaluation workers can’t access it. I hit this exact issue last month building evaluation workflows. Easiest fix? Use a class to maintain state during evaluations, then transfer everything to session state when you’re done. Here’s what works:

class EvaluationTracker:
    def __init__(self):
        self.results = []
    
    def evaluate_with_tracking(self, inputs):
        result = generate_answer(inputs["query"])
        self.results.append(result)
        return {"result": result}

tracker = EvaluationTracker()
evaluate(tracker.evaluate_with_tracking, data=test_dataset, evaluators=[grader])
st.session_state.message_log.extend(tracker.results)

Keeps everything contained and predictable. No messy file system stuff or global variables to clean up.

langsmith runs evals async in isolated workers, so you can’t access streamlit session state from there. easy workaround: use tempfile or sqlite as a bridge. set up a temp database before the eval kicks off, have your evaluation function write results to it, then pull everything back into session state once evaluate() finishes. I’ve been using this approach for my eval pipelines and it beats dealing with global variable cleanup.

Had this same issue building evaluation pipelines. LangSmith’s evaluate function runs in isolated worker processes that can’t access your Streamlit session context. Session state only exists in the main Streamlit thread.

Fixed it with Python’s multiprocessing.Manager - creates shared data both contexts can access:

from multiprocessing import Manager

manager = Manager()
shared_log = manager.list()

def evaluation_function(inputs):
    result = generate_answer(inputs["query"])
    shared_log.append(result)
    return {"result": result}

# After evaluation completes
st.session_state.message_log.extend(list(shared_log))

This handles process isolation properly and survives the evaluation workflow without needing external files or databases. Manager creates proxy objects that work across process boundaries - exactly what you need. Just convert back to regular Python types when moving to session state since proxy objects can get weird in Streamlit.