LLMs power ChatGPT, Claude, and Gemini. Here's a clear explanation of how they actually work — from tokenization to generation.
A Large Language Model is a neural network trained on massive amounts of text to predict the next token. That's it. Everything else — answering questions, writing code, reasoning — emerges from doing this one thing at enormous scale.
Input: "The capital of France is"
Output: "Paris" (highest probability next token)
Text is broken into tokens — not words, but subword units.
"Hello, world!" → ["Hello", ",", " world", "!"]
"unbelievable" → ["un", "believ", "able"]
"ChatGPT" → ["Chat", "G", "PT"]
GPT-4 has a vocabulary of ~100,000 tokens. Each token is mapped to a unique ID, then converted to a high-dimensional vector (embedding).
Each token becomes a vector — a list of numbers that captures meaning.
# Simplified — real embeddings are 768-12288 dimensions
"king" → [0.2, 0.8, -0.1, 0.5, ...]
"queen" → [0.2, 0.7, -0.1, 0.6, ...]
"apple" → [-0.3, 0.1, 0.9, -0.2, ...]Similar words have similar vectors. This is why LLMs understand that "king" and "queen" are related.
The transformer processes all tokens simultaneously using attention — each token can look at every other token to understand context.
A modern LLM stacks many transformer blocks — GPT-3 has 96 layers, each refining the representation.
Attention answers: "which other tokens should I pay attention to when processing this token?"
Sentence: "The animal didn't cross the street because it was too tired"
When processing "it":
- Attends strongly to "animal" (it = the animal)
- Attends weakly to "street"
Without attention, "it" is ambiguous
With attention, the model knows "it" refers to "animal"
The model is trained on hundreds of billions of tokens from the internet, books, and code.
Task: Predict the next token.
Input: "The quick brown fox"
Target: "jumps"
Input: "The quick brown fox jumps"
Target: "over"
This sounds simple but to predict well, the model must learn:
Scale: GPT-3 was trained on 300 billion tokens. GPT-4 — estimated 13 trillion.
A pre-trained model just predicts text — it might complete "How do I make a bomb?" with instructions. RLHF (Reinforcement Learning from Human Feedback) makes it helpful and safe.
LLMs generate text one token at a time, feeding each output back as input:
Prompt: "Write a haiku about coding"
Step 1: → "Lines"
Step 2: "Lines" → "of"
Step 3: "Lines of" → "code"
Step 4: "Lines of code" → "flow"
...
Temperature controls randomness:
The context window is how many tokens the model can "see" at once.
| Model | Context window |
|---|---|
| GPT-3.5 | 4,096 tokens (~3,000 words) |
| GPT-4 | 128,000 tokens (~96,000 words) |
| Claude 3 | 200,000 tokens (~150,000 words) |
| Gemini 1.5 | 1,000,000 tokens |
Everything outside the context window is forgotten — LLMs have no persistent memory by default.
LLMs don't "know" facts — they predict plausible text. When asked about something outside their training data, they generate a plausible-sounding answer that may be wrong.
Q: "What did Einstein say about quantum computing?"
A: [Generates a plausible-sounding quote that Einstein never said]
Solutions: RAG (retrieval augmented generation), grounding with tools, citations.