Tokens and Context Windows

What is a Token?

LLMs don't read text character by character or word by word — they read tokens. A token is a chunk of text, typically 3-4 characters or about ¾ of a word.

"Hello, world!"  →  ["Hello", ",", " world", "!"]       = 4 tokens
"unbelievable"   →  ["un", "believ", "able"]             = 3 tokens
"ChatGPT"        →  ["Chat", "G", "PT"]                  = 3 tokens
"a"              →  ["a"]                                 = 1 token

Rule of thumb: 1 token ≈ 4 characters ≈ ¾ of a word. 100 tokens ≈ 75 words.

Why Tokenization Matters

Everything in an LLM is measured in tokens:

Cost — you pay per token (input + output)
Speed — latency scales with token count
Limits — context window is measured in tokens

Token Costs

Different models charge different rates per token:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
Claude 3.5 Sonnet	$3.00	$15.00
Claude 3 Haiku	$0.25	$1.25
Gemini 1.5 Flash	$0.075	$0.30

Output tokens cost more — generating text is more expensive than reading it.

What is a Context Window?

The context window is the maximum number of tokens an LLM can process at once — both input and output combined.

Everything — system prompt, conversation history, retrieved documents, user message, and the model's response — must fit within the context window.

Context Window Sizes

Model	Context Window	Approx. pages of text
GPT-3.5	16,384 tokens	~12 pages
GPT-4o	128,000 tokens	~96 pages
Claude 3.5 Sonnet	200,000 tokens	~150 pages
Gemini 1.5 Pro	1,000,000 tokens	~750 pages
Llama 3.1	128,000 tokens	~96 pages

What Happens When You Exceed the Context Window?

Most production systems use a combination — summarize old conversation history and use RAG to retrieve only relevant documents rather than stuffing everything in.

Counting Tokens Before Sending

Always count tokens before making an API call to avoid errors and control costs:

import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.ModelType;
 
public class TokenCounter {
    private static final EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
 
    public static int countTokens(String text, ModelType modelType) {
        Encoding encoding = registry.getEncodingForModel(modelType);
        return encoding.countTokens(text);
    }
 
    public static void main(String[] args) {
        // Check before sending
        String prompt = "Explain quantum computing in simple terms";

Key Takeaway

1 token ≈ 4 characters ≈ ¾ of a word
You pay per token — both input and output
Context window = maximum tokens the model can see at once
Output tokens cost more than input tokens
Always count tokens before sending to avoid surprises
When context is too large: truncate, summarize, or use RAG