Optimizing Memory and State Management in FastAPI Chatbot with Langchain

I built a chatbot using FastAPI with Langchain and deployed it on Render’s free tier. Right now I’m using threading to handle multiple users but I’m not sure if this is the right approach. The bot streams responses and works fine but I’m concerned about some potential problems.

Memory Issues: Each user gets their own thread which seems like it will eat up too much memory and CPU when more people start using it.

State Storage: All conversation states live in memory right now. When the app restarts everything gets wiped out. I know I could use a database but then I need to figure out how to clean up old conversations and manage the data properly.

Is threading the best way to handle multiple users? What’s the standard way to manage conversation states without keeping them forever? This is my first time building something like this so any advice would be helpful.

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, JSONResponse
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph
from vector_db import DocumentRetriever
from fastapi.middleware.cors import CORSMiddleware
import os

api_app = FastAPI()
api_app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

graph = StateGraph(state_schema=MessagesState)
llm_client = ChatNVIDIA(model="meta/llama-3.1-70b-instruct")
doc_retriever = DocumentRetriever()

async def process_request(state: MessagesState, config):
    recent_messages = state['messages'][-15:]  # Keep last 15 messages
    query = recent_messages[-1].content
    
    docs = doc_retriever.search_documents(query, top_k=2)
    context = "\n".join([doc.page_content for doc in docs])
    
    bot_config = config["configurable"].get("bot_type")
    system_msg = f"You are a helpful assistant. Context: {context}"
    
    full_messages = [SystemMessage(content=system_msg)] + recent_messages
    
    try:
        result = await llm_client.ainvoke(full_messages)
        return {"messages": result}
    except Exception as err:
        raise Exception(f"Processing failed: {err}")

graph.add_node("processor", process_request)
graph.add_edge(START, "processor")

storage = MemorySaver()
bot_app = graph.compile(checkpointer=storage)

@api_app.post("/message")
async def handle_message(request: Request):
    payload = await request.json()
    text = payload.get("text", "")
    session_id = payload.get("session_id", "")
    
    if not text:
        return JSONResponse(content={"error": "Text required"}, status_code=400)
    if not session_id:
        return JSONResponse(content={"error": "Session ID required"}, status_code=400)
    
    async def generate_response():
        settings = {"configurable": {"thread_id": session_id}}
        input_data = {"messages": [HumanMessage(content=text)]}
        
        try:
            async for chunk, meta in bot_app.astream(input_data, config=settings, stream_mode="messages"):
                yield chunk.content
        except Exception as err:
            yield f"Error: {str(err)}"
    
    return StreamingResponse(generate_response(), media_type="text/plain")

You’re overcomplicating this. Skip the threading and memory management headaches - just automate the whole thing.

I’ve dealt with chatbot scaling before. Best move? Offload this complexity to an automation platform. Set up your FastAPI endpoints to trigger workflows that handle conversation state, user sessions, and data cleanup automatically.

No more worrying about memory limits or restart issues. Everything runs in managed workflows with built-in state persistence. Want conversation analytics, user routing, or response caching? Add them without touching your core code.

For cleanup, create scheduled workflows that purge old conversations after X days. Done.

Your FastAPI stays lean - it just triggers workflows and returns responses. The heavy lifting happens elsewhere with proper resource management.

Here’s how to automate this setup: https://latenode.com

the threading thing works fine with langgraph’s MemorySaver. memory usage isn’t too bad since langchain takes care of the tough parts. for keeping state, try redis with a TTL policy to auto-delete old convos after 24 hrs. this way, no extra cleanup needed. should be fine on render’s free tier till you get more users.

Had the same problem when I deployed my FastAPI chatbot. The threading with Langgraph’s MemorySaver actually works fine for moderate traffic - threads aren’t expensive since they’re lightweight async operations, not real OS threads.

I went with SQLite and a daily cleanup job. Way simpler than Redis if you’re broke. Just store session_id, messages, and timestamp. Index the timestamp column and delete conversations older than 7 days. Run cleanup in the background with APScheduler.

Learned this the hard way - truncating to 15 messages is smart for costs, but store the full conversation in your database while only sending recent context to the LLM. Users want their chat history to stick around even if the AI doesn’t remember everything. This setup’s handled thousands of conversations without breaking on similar hosting.

Your approach works, but there’s a better way. Don’t worry about threading overhead - the real killer is conversation state bloat. I’ve watched FastAPI bots crash when MemorySaver hoards too much session data. Try a hybrid setup: keep active sessions in memory, then dump inactive ones to disk after 30 minutes. PostgreSQL works great on Render’s free tier - just make a sessions table with a jsonb column for message history. Only load conversations back when users actually return. For cleanup, keep it simple. Run a cron job that nukes conversations older than 30 days. Most people never look at old chats anyway. Your 15-message window is perfect for LLM context, but store everything so users don’t lose their history. One thing that bit me - don’t cache vector search results between requests. That’s where document retrieval systems really leak memory.