How Large Language Models Work

What is an LLM?

A Large Language Model is a neural network trained on massive amounts of text to predict the next token. That's it. Everything else — answering questions, writing code, reasoning — emerges from doing this one thing at enormous scale.

Input:  "The capital of France is"
Output: "Paris"  (highest probability next token)

Step 1 — Tokenization

Text is broken into tokens — not words, but subword units.

"Hello, world!" → ["Hello", ",", " world", "!"]
"unbelievable"  → ["un", "believ", "able"]
"ChatGPT"       → ["Chat", "G", "PT"]

GPT-4 has a vocabulary of ~100,000 tokens. Each token is mapped to a unique ID, then converted to a high-dimensional vector (embedding).

Step 2 — Embeddings

Each token becomes a vector — a list of numbers that captures meaning.

# Simplified — real embeddings are 768-12288 dimensions
"king"   → [0.2, 0.8, -0.1, 0.5, ...]
"queen"  → [0.2, 0.7, -0.1, 0.6, ...]
"apple"  → [-0.3, 0.1, 0.9, -0.2, ...]

Similar words have similar vectors. This is why LLMs understand that "king" and "queen" are related.

Step 3 — The Transformer

The transformer processes all tokens simultaneously using attention — each token can look at every other token to understand context.

A modern LLM stacks many transformer blocks — GPT-3 has 96 layers, each refining the representation.

Step 4 — Attention (Simplified)

Attention answers: "which other tokens should I pay attention to when processing this token?"

Sentence: "The animal didn't cross the street because it was too tired"

When processing "it":
- Attends strongly to "animal" (it = the animal)
- Attends weakly to "street"

Without attention, "it" is ambiguous
With attention, the model knows "it" refers to "animal"

Step 5 — Pre-training

The model is trained on hundreds of billions of tokens from the internet, books, and code.

Task: Predict the next token.

Input:  "The quick brown fox"
Target: "jumps"

Input:  "The quick brown fox jumps"
Target: "over"

This sounds simple but to predict well, the model must learn:

Grammar and syntax
Facts about the world
Reasoning patterns
Code structure
And much more

Scale: GPT-3 was trained on 300 billion tokens. GPT-4 — estimated 13 trillion.

Step 6 — RLHF (Making it Helpful)

A pre-trained model just predicts text — it might complete "How do I make a bomb?" with instructions. RLHF (Reinforcement Learning from Human Feedback) makes it helpful and safe.

How Generation Works

LLMs generate text one token at a time, feeding each output back as input:

Prompt: "Write a haiku about coding"

Step 1: → "Lines"
Step 2: "Lines" → "of"
Step 3: "Lines of" → "code"
Step 4: "Lines of code" → "flow"
...

Temperature controls randomness:

Temperature = 0 → always pick the most likely token (deterministic)
Temperature = 1 → sample proportionally (creative)
Temperature > 1 → more random (chaotic)

Context Window

The context window is how many tokens the model can "see" at once.

Model	Context window
GPT-3.5	4,096 tokens (~3,000 words)
GPT-4	128,000 tokens (~96,000 words)
Claude 3	200,000 tokens (~150,000 words)
Gemini 1.5	1,000,000 tokens

Everything outside the context window is forgotten — LLMs have no persistent memory by default.

Why LLMs Hallucinate

LLMs don't "know" facts — they predict plausible text. When asked about something outside their training data, they generate a plausible-sounding answer that may be wrong.

Q: "What did Einstein say about quantum computing?"
A: [Generates a plausible-sounding quote that Einstein never said]

Solutions: RAG (retrieval augmented generation), grounding with tools, citations.

Key Takeaway

LLMs predict the next token — everything else emerges from scale
Tokenization → Embeddings → Transformer (Attention) → Output probabilities
Pre-training on massive text teaches world knowledge
RLHF makes the model helpful and safe
Context window = working memory — everything outside is forgotten
Hallucination happens because LLMs predict plausible text, not verified facts