Controlling LLM Output — Temperature and Sampling

How LLMs Generate Text

LLMs don't pick the "correct" next word — they generate a probability distribution over all possible next tokens, then sample from it.

Input: "The capital of France is"

Token probabilities:
"Paris"    → 94.2%
"Lyon"     → 2.1%
"a"        → 1.8%
"the"      → 0.9%
...

Temperature and top-p control how you sample from this distribution.

Temperature

Temperature scales the probability distribution before sampling. It's a number between 0 and 2.

Temperature	Behaviour	Use when
0	Always picks highest probability token	Facts, code, structured data
0.3 - 0.7	Mostly predictable, slight variation	Summaries, Q&A, analysis
0.7 - 1.0	Balanced creativity	General chat, writing assistance
1.0 - 2.0	High creativity, less coherent	Brainstorming, creative writing

from openai import OpenAI
client = OpenAI()
 
# Deterministic — same answer every time
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
 
# Creative — different answer each time
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=1.2,
    messages=[{"role": "user", "content": "Write a poem about code"}]
)

Top-p Sampling (Nucleus Sampling)

Top-p limits sampling to the smallest set of tokens whose cumulative probability exceeds p.

All tokens sorted by probability:
"Paris"    → 94.2%  ← cumulative: 94.2%
"Lyon"     → 2.1%   ← cumulative: 96.3%
"a"        → 1.8%   ← cumulative: 98.1%  ← top-p=0.98 stops here
"the"      → 0.9%   ← excluded
...

With top_p=0.9, only tokens in the top 90% of probability mass are considered.

Top-p	Behaviour
0.1	Very conservative — only top tokens
0.9	Balanced — default for most use cases
1.0	All tokens considered — no filtering

Temperature vs Top-p — Which to Use?

Use temperature when you want to control the overall creativity level.

Use top-p when you want to prevent very unlikely tokens from appearing.

In practice: Most APIs let you set both. The recommendation from OpenAI and Anthropic is to alter one, not both — changing both simultaneously makes behaviour hard to predict.

# Recommended: set one, leave the other at default
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.7,   # set this
    top_p=1.0,         # leave at default
    messages=[...]
)

Practical Settings by Use Case

Use Case	Temperature	Top-p
Code generation	0 - 0.2	1.0
Factual Q&A	0 - 0.3	1.0
Summarization	0.3 - 0.5	1.0
Chatbot	0.7	1.0
Creative writing	1.0 - 1.5	0.9
Brainstorming	1.2 - 1.5	0.95

Key Takeaway

LLMs sample from a probability distribution — temperature and top-p control how
Temperature 0 = deterministic, always picks the most likely token
High temperature = more creative, less predictable
Top-p limits the pool of tokens to sample from
For production: use low temperature (0-0.3) for facts and code, higher (0.7-1.0) for creative tasks
Change one parameter at a time — not both