Tokens are the currency of LLMs. Understanding how tokenization works and what context windows mean is essential for building with AI.
LLMs don't read text character by character or word by word — they read tokens. A token is a chunk of text, typically 3-4 characters or about ¾ of a word.
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"unbelievable" → ["un", "believ", "able"] = 3 tokens
"ChatGPT" → ["Chat", "G", "PT"] = 3 tokens
"a" → ["a"] = 1 token
Rule of thumb: 1 token ≈ 4 characters ≈ ¾ of a word. 100 tokens ≈ 75 words.
Everything in an LLM is measured in tokens:
Different models charge different rates per token:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Output tokens cost more — generating text is more expensive than reading it.
The context window is the maximum number of tokens an LLM can process at once — both input and output combined.
Everything — system prompt, conversation history, retrieved documents, user message, and the model's response — must fit within the context window.
| Model | Context Window | Approx. pages of text |
|---|---|---|
| GPT-3.5 | 16,384 tokens | ~12 pages |
| GPT-4o | 128,000 tokens | ~96 pages |
| Claude 3.5 Sonnet | 200,000 tokens | ~150 pages |
| Gemini 1.5 Pro | 1,000,000 tokens | ~750 pages |
| Llama 3.1 | 128,000 tokens | ~96 pages |
Most production systems use a combination — summarize old conversation history and use RAG to retrieve only relevant documents rather than stuffing everything in.
Always count tokens before making an API call to avoid errors and control costs:
import tiktoken # OpenAI's tokenizer
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
# Check before sending
prompt = "Explain quantum computing in simple terms"
tokens = count_tokens(prompt)
print(f"Prompt uses {tokens} tokens") # Prompt uses 8 tokens
# Estimate cost
cost = (tokens / 1_000_000) * 2.50 # GPT-4o input rate
print(f"Estimated cost: ${cost:.6f}")