Tokenization¶
Models don't read characters or words โ they read tokens. Understanding tokens explains your bill, your context limits, and a whole class of "weird" model behavior.
Overview¶
Before an LLM sees your text, a tokenizer chops it into tokens โ chunks that are often word-pieces (roughly ยพ of a word on average in English). The model works entirely in these units. Everything downstream โ cost, context limits, and some surprising failures โ is measured in tokens, so this small concept has outsized practical importance.
Learning Objectives¶
By the end of this page you will be able to:
- Explain what a token is and why models use subword units.
- Estimate token counts and therefore cost.
- Diagnose token-related issues (truncation, context limits, odd character handling).
Theory¶
Why not just use words or characters?¶
- Characters are too fine โ sequences get very long, and the model wastes capacity learning to spell.
- Words are too coarse โ the vocabulary would be enormous and couldn't handle new or misspelled words.
Subword tokenization is the sweet spot. Common words become single tokens; rare words split into pieces. This keeps the vocabulary manageable (typically ~50kโ200k tokens) while handling any input.
flowchart LR
A["'Tokenization is fun'"] --> B[Tokenizer]
B --> C["['Token', 'ization', ' is', ' fun']"]
C --> D["[15496, 1634, 318, 1257]"]
D --> E[Model]
Note two things in that example:
- "Tokenization" splits into
Token+izationโ a rare-ish word broken into pieces. - The leading space is part of the token (
is,fun). Tokenizers usually attach spaces to the following word, which is why spacing can subtly affect behavior.
Rough rules of thumb (English)¶
- 1 token โ 4 characters โ ยพ of a word.
- 100 tokens โ 75 words.
- A dense page of text โ 500โ800 tokens.
These are approximations โ code, numbers, and other languages tokenize differently (often less efficiently).
Tokens = money and limits¶
Two of the most practical facts in AI engineering:
- You are billed per token โ both input (your prompt) and output (the response), usually at different rates.
- The context window is measured in tokens โ the model can only "see" a fixed number at once (prompt + response combined). Exceed it and content is dropped or the call errors.
flowchart TB
subgraph "Context window (e.g. 200k tokens)"
S[System prompt] --- H[Chat history] --- R[Retrieved docs] --- U[User message] --- O[Room for output]
end
If you cram huge documents into the prompt, you pay for every token and risk crowding out room for the answer. This is a core reason RAG exists โ retrieve only the relevant chunks instead of stuffing everything in.
Why tokenization causes "weird" behavior¶
- Counting letters ("how many 'r's in strawberry?") is hard because the model sees tokens, not letters.
- Arithmetic on long numbers is unreliable โ digits get grouped into tokens inconsistently.
- Non-English and code often use more tokens per idea, costing more and filling context faster.
These make more sense once you remember: the model never sees the raw characters you do.
Practical Example¶
Count tokens before you send a request so you can predict cost and stay within limits:
# Anthropic exposes a token-counting endpoint so you can measure before sending.
from anthropic import Anthropic
client = Anthropic()
result = client.messages.count_tokens(
model="claude-sonnet-5",
messages=[{"role": "user", "content": "Tokenization is fun!"}],
)
print(f"Input tokens: {result.input_tokens}")
For OpenAI models, the tiktoken library counts locally:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode("Tokenization is fun!")
print(tokens) # e.g. [2438, 2065, 382, 2523, 0]
print(f"{len(tokens)} tokens")
See it yourself
Paste text into a visual tokenizer (e.g. the OpenAI tokenizer playground) and watch words split into colored token chunks. It builds intuition fast.
Best Practices¶
- โ Estimate tokens before sending large prompts โ it's your cost and limit budget.
- โ Prefer retrieving relevant chunks (RAG) over stuffing whole documents.
- โ Leave headroom in the context window for the output.
- โ For exact character/number tasks, use a tool instead of trusting the model.
Common Mistakes¶
- โ Assuming "context window" means characters or words โ it's tokens.
- โ Forgetting output tokens count toward both cost and the window.
- โ Blaming the model for miscounting letters โ that's a tokenization artifact.
- โ Assuming token counts transfer across providers โ each has its own tokenizer.
Exercises¶
- Count the tokens in a paragraph of English, the same paragraph translated to another language, and a code snippet of similar length. Compare โ which is most token-hungry?
- Estimate the cost of a 2,000-word input + 500-word output using your provider's per-token prices.
- Take a 300,000-token document and a 200k-token context window. How would you fit it? (Hint: this is the motivation for chunking.)
References¶
- tiktoken โ OpenAI's tokenizer library
- Anthropic โ Token counting
- Hugging Face โ Summary of tokenizers
- Next in Bee: The Transformer