Token counting discrepancy between Langsmith and OpenAI tokenizer

I’m working with a langgraph setup and noticed something weird with how tokens are being counted. When I compare the token count in Langsmith versus what I get from OpenAI’s tokenizer or my own Python script, the numbers are way different.

What I’m seeing:

  • Langsmith shows: 2,067 tokens
  • My Python code shows: 11,185 tokens
  • OpenAI’s online tokenizer also shows a much higher count

My Python test code:

import tiktoken

def count_text_tokens(text_input: str) -> int:
    encoder = tiktoken.get_encoding("cl100k_base")
    token_count = len(encoder.encode(text_input))
    return token_count

sample_text = """ Content copied from langsmith trace """
print(count_text_tokens(sample_text))

This outputs 11185 tokens, which matches what I see in OpenAI’s tokenizer.

My questions:

  1. What’s causing this huge difference in token counting between Langsmith and OpenAI?
  2. How can I configure Langsmith to use the proper token counting method? I’m using this simple code: result = ChatOpenAI().invoke("Hello!")

Any help would be appreciated!

Had a similar issue debugging my langgraph workflows - turned out to be message formatting that wasn’t obvious. When you use ChatOpenAI().invoke(), the library automatically wraps your text in message structures with role definitions and metadata. Your raw Python tokenization doesn’t account for this wrapper content.

Try inspecting what’s actually being sent to the API. Add debug logging before your invoke call:

llm = ChatOpenAI()
messages = llm._convert_input_to_chat_messages("your text")
print(messages)

This shows you exactly what gets tokenized. The message formatting adds tokens for role indicators, JSON structure, and internal prompt engineering that Langsmith counts but your manual script misses.

Could also be that your langgraph setup includes streaming or partial responses that get tokenized incrementally. Langsmith might show tokens for just one chunk while your script counts the full assembled text. Check if your workflow has retry logic or multiple API calls happening behind the scenes - that’d explain why the counts are so different.

I hit this exact problem a few months ago and it drove me crazy for days. The difference you’re seeing is because Langsmith counts tokens at different stages of your request/response cycle, not just your input text.

When you use ChatOpenAI().invoke(), Langsmith tracks the entire conversation - system messages, function definitions if you’re using tools, and the API response. Your Python script only counts the raw text you’re feeding it.

Check the full trace details in Langsmith to verify this. Look at the “inputs” and “outputs” sections. You’ll probably find extra content like system prompts or tool schemas that aren’t in your sample text. That 2,067 tokens might be just one piece of the total interaction.

Different tokenizer versions can cause small variations, but nothing as big as what you’re seeing. The massive difference means you’re comparing apples to oranges - different content is being tokenized.

This token counting mess usually happens because Langsmith’s internal model config doesn’t match what you’re testing with.

Hit the same issue last year - production costs were way off from my local estimates. Langsmith wasn’t using cl100k_base for counting, it defaulted to different encoding based on model version.

Try this:

llm = ChatOpenAI(model="gpt-3.5-turbo")  # or whatever model
print(llm.get_num_tokens("your text here"))

This forces the same tokenizer your actual ChatOpenAI instance uses. If numbers still don’t match Langsmith, check your model parameter in the ChatOpenAI constructor.

Also - got any custom message formatting or templates in your langgraph? Those add invisible tokens that manual counting misses.

That 5x difference is way too big for just system prompts or metadata. Something’s fundamentally different between the tokenizers.

langsmith’s prob caching token counts from earlier runs. i’ve hit this bug b4 - shows old numbers that won’t refresh until you clear the session or restart. run your invoke call in a clean environment and see if count updates. also make sure you’re checking the right trace - ui sometimes mixes up multiple runs.

Been there way too many times. Token discrepancies are debugging nightmares, especially when you’re optimizing costs across multiple langgraph workflows.

Yeah, it’s the execution stages thing everyone’s talking about - Langsmith tracks them separately. But manually checking traces and comparing counts? Total time sink.

I got burned on token estimates for a big project and automated the whole thing. Built workflows that monitor token counting across all my langgraph chains in real time. Pulls from both Langsmith and OpenAI APIs, compares automatically, flags discrepancies instantly.

The automation grabs exactly what’s being tokenized at each stage - system prompts, tool definitions, message formatting, all of it. When counts don’t match, I get detailed breakdowns showing exactly where the difference is.

No more copying content into test scripts or digging through trace logs. It runs continuously and alerts me when token patterns change unexpectedly. Saved me tons of debugging time.

Build something similar - you’ll catch these instantly instead of after the fact.

Been dealing with token counting headaches for years - manual debugging gets old fast.

That discrepancy? Langsmith’s counting tokens from one part of your chain while Python counts everything. Different langgraph stages = different token counts.

I stopped doing manual checks and automated the whole thing. Built monitoring that tracks token usage across all workflow stages, pulls from both Langsmith and OpenAI APIs, compares counts, and alerts me when something’s off.

It also logs exactly what content gets tokenized at each step. Now I can instantly see if system prompts or tool definitions are in one count but not the other. Saves hours of detective work weekly.

Build something similar for your langgraph setup - you’ll get real-time visibility into where tokens go. Way better than copying content and running test scripts.

The token count difference means you’re seeing different parts of your langgraph execution. Langsmith tracks tokens for individual nodes, not the whole workflow. I hit this same issue debugging my chains. Here’s what’s happening: ChatOpenAI().invoke() creates multiple tokenization events - preprocessing, the API call, and response processing. Langsmith tracks each one separately. Your 2,067 count is probably just one node, while your Python script counts everything. Go to the specific trace in Langsmith and expand all execution steps. You’ll see multiple token counts that should add up closer to your 11,185. Also check if your langgraph setup does any text preprocessing. Sometimes content gets modified before hitting the LLM, which messes up manual comparisons. The raw trace data shows exactly what text actually reached the tokenizer vs. what you sent.