Token counting discrepancy between Langsmith and OpenAI tokenizer

I’m working with a langgraph setup and noticed something strange. The token counts I see in Langsmith don’t match what I get from OpenAI’s tokenizer or when I count tokens using Python.

What I’m seeing:

  • Langsmith shows: 2,067 tokens
  • My Python script shows: 11,185 tokens
  • OpenAI’s online tool shows similar numbers to Python

My token counting script:

import tiktoken

def calculate_tokens(text: str) -> int:
    encoder = tiktoken.get_encoding("cl100k_base")
    token_count = len(encoder.encode(text))
    return token_count

sample_text = """ Content copied from langsmith trace """
print(calculate_tokens(sample_text))

Questions I have:

  1. What causes this huge difference in token counting between Langsmith and OpenAI?
  2. Is there a way to configure Langsmith to use the correct token counting method? I’m using this simple code: result = ChatOpenAI().invoke("Hello!")

The difference is really significant and I need to understand why this happens.

Token counting gets messy because you’re dealing with multiple abstraction layers that each add overhead.

Langsmith shows one view, your Python script shows another - neither gives you the full picture of what’s happening in production. You need visibility into the entire pipeline.

I hit this same nightmare trying to optimize costs across different LLM calls. Manual debugging with logs and comparing outputs gets old fast.

What worked: automated monitoring that tracks tokens at every step. Build workflows that capture exact API payloads, run tiktoken calculations automatically, and compare everything in real time.

Skip the manual log checking and debug scripts. Automate the whole token tracking process. Set up triggers that run token counting on API calls, store results, and alert you when things don’t match.

You’ll catch token count issues immediately instead of finding them when your bills are wrong. You can also auto-test different message formats to see exactly where extra tokens come from.

Automation saves tons of debugging time and gives you consistent monitoring across all LLM integrations.

Check out how to build this automated monitoring: https://latenode.com

This happens because Langsmith shows different token counts depending on where you’re looking. The main view might only display completion or input tokens, while your tiktoken script counts everything. I’ve hit this exact problem - the summary shows one number but drill down and you’ll see totally different counts. The issue is Langsmith aggregates tokens across multiple API calls or splits them by message type in ways that aren’t clear from the UI. Click into individual trace steps instead of the top-level summary. Each step shows different token counts that should add up to your expected total. Also check if streaming or retries are enabled - they can mix multiple token counts together in the display.

I’ve hit this exact problem way too many times. You’re comparing two different things.

Langsmith traces show you sanitized text that strips out the actual message structure. But ChatOpenAI().invoke() wraps everything in proper chat format with roles, timestamps, and metadata.

Here’s what actually gets sent:

[
  {"role": "system", "content": "..."}, 
  {"role": "user", "content": "your text here"}
]

Each message adds tokens for JSON structure, role labels, and formatting. Langchain also sneaks in system messages you didn’t set.

To debug this, grab the actual API payload before it goes to OpenAI:

import logging
logging.getLogger("openai").setLevel(logging.DEBUG)

Then tokenize that exact payload, not just what you see in traces. The big difference makes sense when you count all those wrapper tokens.

Langsmith’s UI sucks at showing what actually hits the API.

That’s probably because Langsmith only shows input tokens while your tiktoken script counts the whole conversation. I hit this same issue when debugging billing costs. Langsmith’s trace view often just shows user message tokens, but the actual API call includes system prompts, message formatting, and metadata - that’s a ton of extra overhead. Look for the raw view or API tab in the trace to see exactly what got sent, not just the visible content. The formatting differences between chat messages and raw text can add hundreds or thousands of tokens depending on your setup. Your cl100k_base encoding’s right, but you need to tokenize the complete structured message format that actually hits the API.

Had this exact problem a few months ago and wasted way too much time on it. Langsmith and your Python script are counting tokens from different parts of the conversation. When you use ChatOpenAI().invoke(), Langsmith tracks everything at the API level - system messages, formatting overhead, maybe conversation history you’re not counting manually. Your script’s probably only counting the raw text you pulled out, but Langsmith sees the full payload hitting OpenAI’s API. Log the actual messages going to the API with Langchain’s verbose mode or check if there’s system prompts or extra context getting added automatically. The cl100k_base encoding’s right, but you need to tokenize the exact same content that actually gets sent to OpenAI, not just what you see in the trace.

check if langsmith’s only showin prompt tokens instead of total ones (prompt + completion). the ui might display just 1 part while your tiktoken script counts everythin. also, make sure you’re usin the same model encoding - gpt-4 and gpt-3.5 have different tokenizers, even tho cl100k_base works for both.

check if you’re usin streaming mode - it messes up token counts big time. langsmith sometimes splits streamed responses awkwardly, leadin to partial counts. also, make sure you’re using the same model version. gpt-4-turbo counts tokens a bit different than regular gpt-4, even with the same encoding.