I need guidance on managing concurrent user sessions for a voice AI system I’m developing.
Current Setup:
Server: FastAPI with WebSocket connections
Speech Processing: Gemini Live API for voice-to-text and text-to-voice conversion
AI Framework: LangChain with dual-agent setup:
Primary handler (StreamAgent) manages live audio and basic function calls
Secondary processor (CoreAgent) executed as a function by the primary handler, includes database tools and maintains chat history via ConversationBufferMemory
Scaling Concerns with Session State
Right now, each WebSocket connection spawns a fresh CoreAgent instance. This approach effectively separates user-specific information like language preferences and conversation memory.
Main question: Is creating individual agent instances for each user connection a viable approach for production deployment?
I’m worried about RAM consumption when dealing with hundreds of simultaneous connections, where every user maintains their own agent in memory.
What alternative patterns should I consider?
Would using external storage like Redis for session data with stateless agent workers be more efficient?
What’s the recommended approach for maintaining separate conversation histories per user in WebSocket-driven LangChain applications?
I want to ensure solid architecture before going live. Any insights would be helpful!
Been dealing with this for years. Individual agents will destroy your system once you get real traffic.
Here’s what I learned building voice systems at scale: ditch persistent agents completely. Go event-driven.
Each WebSocket message triggers a workflow that loads user context, runs your dual agent setup, saves state back. No agents sitting in memory between requests.
Your FastAPI server becomes just a message router. Real processing happens in scalable workflows that run anywhere.
I’ve seen this handle massive concurrent loads without issues. Each voice interaction triggers a workflow instead of hitting a persistent object.
Your Gemini Live API calls and LangChain processing get distributed automatically. User context loads fresh each time but stays consistent.
No more managing Redis sessions manually or babysitting agent lifecycles. Workflow automation handles the entire pipeline from speech input to response.
This is exactly what automation workflows solve better than traditional architectures.
You’re right to worry about RAM usage. Creating an agent for every connection will kill your performance once you hit more than a few dozen users. I’ve built similar systems before, and going stateless with external storage is the right move. But don’t just use Redis - try a hybrid approach. Cache the conversation state you need frequently in Redis, but store full chat histories in PostgreSQL. For your LangChain setup, you’ll want to make CoreAgent stateless by moving ConversationBufferMemory outside it. Load the conversation history when you need it, process the request, then save the updated state back. This makes your agents request-based instead of tied to connections. Here’s what worked for me: build a session manager that handles conversation state separately from WebSocket connections. User connects? Load their context. They disconnect or timeout? Save any changes. Think of your agents as processing units, not stateful objects. Each voice interaction should load context, run through your dual-agent pipeline, then clean up. Your memory usage stays consistent no matter how many users you have online.
You’re spot on about RAM consumption. Per-connection agents break down around 100-200 concurrent users in my experience. But there’s a middle ground that works better than going fully stateless - agent pooling with session affinity. Keep a fixed pool of CoreAgent instances and assign them to connections dynamically. When a WebSocket connects, grab an available agent and load the user’s conversation state into its ConversationBufferMemory. On disconnect, save the memory and toss the agent back in the pool. This gives you predictable memory usage while keeping stateful agents during active sessions. Your pool size becomes the memory limit, not connection count. For persistence, use PostgreSQL for conversation histories and Redis just for active session metadata. LangChain’s conversation memory serializes to JSON easily for database storage. The trick is separating connection management from agent lifecycle. Your WebSocket handlers become lightweight proxies that route to pooled agents, and session state gets loaded/saved as needed. I’ve used this approach successfully with voice apps where maintaining conversational context throughout a session really improves response quality.
the agent pooling idea seems overcomplicated. just serialize your langchain conversation memory to postgres after each message. keep websocket handlers simple and rebuild agent context from the database when you need it. works great for voice apps since users naturally pause between interactions.
I’ve hit this exact scaling problem building voice AI systems for thousands of users.
Don’t create individual agent instances per connection - it’ll destroy your RAM. Found this out the hard way when 500+ users crashed our server during launch.
Redis works, but manually managing all that state gets ugly quick. You’re juggling serialized conversation histories, connection drops, and agent lifecycles.
What actually works in production:
Ditch persistent agents. Use workflow automation for your voice AI pipeline instead. Trigger stateless workflows for each interaction and store conversation context externally - load it when needed.
This scales horizontally without memory headaches. Each voice interaction becomes a workflow execution that runs anywhere in your infrastructure.
I’ve seen this handle 10k+ concurrent sessions easily. Treat each user interaction as an event triggering the right workflow, not a persistent agent instance.
For your FastAPI WebSocket setup, keep connections lightweight and let workflows handle the heavy Gemini Live API and LangChain processing.
the agent-per-connection approach will destroy you at scale. hit 200 concurrent users and watched memory explode. switched to worker agents pulling sessions from a queue instead of keeping everything in memory - problem solved. gemini live api calls became way smoother since connections weren’t getting tied up during processing.