I’m working with LangChain and testing out LangSmith for monitoring. I built a ReAct agent using LangGraph that includes system prompts and custom functions. When I compare the token usage between LangSmith and OpenAI’s native playground, there’s a huge difference.
OpenAI playground shows 1713 total tokens (1608 input + 105 output) for the same conversation. But LangSmith only reports 267 tokens (234 input + 33 output). That’s a massive gap.
I tested the output tokens separately using OpenAI’s tokenizer and confirmed the 33 output tokens are correct. For basic input without functions, both platforms show 115 tokens. The missing ~119 tokens seem to be from function definitions and tool results that LangChain adds internally.
It looks like LangSmith only tracks tokens from the basic message history, not the additional context like tool definitions or internal prompt modifications that LangChain injects.
Update: Found the solution! You need to set stream_usage=True in your model configuration. Without this parameter, LangSmith doesn’t get accurate usage data from the API.
For ChatOpenAI models: ChatOpenAI(model="gpt-4o-mini", stream_usage=True)
good catch on the stream_usage param! hit the same issue last month - drove me crazy. token discrepancies mess up cost tracking big time. fyi this happens with anthropic’s claude models too if anyone’s using those.
Had this exact issue two weeks ago with my multi-agent setup. Token reporting was so off I thought my code was broken. What’s annoying is they don’t mention this parameter anywhere in the basic LangChain docs for token tracking. Spent hours debugging, thinking my custom tools were somehow dodging the monitoring. They should just enable stream_usage by default for LangSmith - accurate token counting isn’t optional in production. Thanks for posting the fix, would’ve saved me hours of headaches.
Hit this exact issue during deployment and it wrecked my cost calculations. The token discrepancy gets way worse with complex function schemas that have multiple parameters. LangSmith was under-reporting by almost 70% in some cases, especially with verbose tool descriptions. What really bugs me is that this parameter isn’t even documented in the LangSmith integration guide. It gets worse with nested function calls - each tool invocation adds massive token overhead. The stream_usage parameter does add a bit of latency, but it’s absolutely worth it for accurate production monitoring.