What does temperature do in an LLM?

Temperature controls how random or predictable the model's outputs are. At 0, the model always picks the most likely next token — fully deterministic. At 1, it picks tokens according to their probability distribution, introducing creative variation. Higher values (above 1) make outputs more unpredictable and often incoherent. For most tasks, 0.3–0.7 is the practical range.

What is top-p (nucleus sampling)?

Top-p limits token selection to the smallest set of tokens whose combined probability reaches p. With top-p=0.9, the model only considers tokens whose probabilities sum to 90% — cutting out the long tail of low-probability, often weird options. Most APIs use top-p alongside temperature. A common default is temperature=1 with top-p=0.9.

Should I use temperature or top-p? Can I use both?

You can use both simultaneously, but Anthropic recommends changing one at a time. For most use cases, adjusting temperature alone is simpler and sufficient. Low temperature (0–0.3) for precision tasks, higher (0.7–1) for creative tasks. Only tune top-p if you find temperature alone doesn't give you the variance control you need.

What's the difference between max tokens and context window?

The context window is the total amount of text the model can read at once — inputs plus outputs combined. Max tokens (or max completion tokens) limits only the length of the model's output. If your prompt uses 3,000 tokens and you set max_tokens=500, the model will stop generating output after 500 tokens even if it hasn't finished. Set it high enough for the complete response you need.

LLM Settings: Temperature, Top-P, Max Tokens, and More

Every LLM API exposes a set of parameters that control how the model generates text. Most people use the defaults and never touch them — but understanding what these controls do gives you meaningful power over output quality, creativity, and cost.

Temperature

Temperature is the single most useful parameter to understand.

LLMs generate text by assigning probabilities to every possible next token. At each step, the model has to pick which token comes next. Temperature controls how it makes that choice:

Temperature = 0: Always pick the highest-probability token. Deterministic, consistent, conservative. Same prompt → same output every time.

Temperature = 1: Pick tokens according to their actual probabilities. Some variation, but still coherent. The model "as designed."

Temperature > 1: Amplify lower-probability tokens. More surprising, more creative, often incoherent beyond ~1.5.

Practical guidance

Task	Temperature
Factual Q&A, data extraction	0.0–0.2
Code generation	0.0–0.3
Professional writing	0.3–0.6
Conversational assistants	0.5–0.8
Creative writing, brainstorming	0.7–1.0
Poetry, experimental content	0.9–1.2

Rule of thumb: If you need consistency and accuracy, go lower. If you need variety and creativity, go higher.

Top-P (Nucleus Sampling)

Top-p limits which tokens the model can choose from at each step.

With top_p=0.9, the model considers only the tokens whose cumulative probability reaches 90% — throwing out the long tail of low-probability tokens. This prevents truly bizarre choices while still allowing natural variation.

Top-p vs. temperature:

Temperature reshapes the probability distribution (makes peaks higher or flatter)
Top-p truncates it (removes the low-probability tail)

For most use cases, temperature alone is sufficient. If you're finding that your outputs occasionally include bizarre word choices even at moderate temperatures, lowering top-p (to 0.7–0.85) can help.

What Anthropic recommends: Don't change both temperature and top-p at the same time. Pick one to tune.

Max Tokens

Max tokens (called max_tokens in the Anthropic API, max_completion_tokens in OpenAI) sets the maximum length of the model's output.

It does not affect how much input the model reads — only how long its response can be.

Common mistakes:

Setting max_tokens too low → response gets cut off mid-sentence
Setting max_tokens much higher than needed → wastes money on buffer
Confusing max_tokens with the context window → they're different things

Practical approach: Estimate the longest output you'd realistically need and set max_tokens to ~120% of that. For most conversational tasks, 1,024–2,048 is plenty. For long documents or detailed code, 4,096–8,192 makes sense.

Stop Sequences

Stop sequences tell the model to stop generating text when it produces a specific string or token.

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    stop_sequences=["###", "END", "\n\nUser:"],
    messages=[...]
)

Common uses:

Structured generation: Stop at a delimiter to extract just the section you want
Multi-turn control: Stop at "User:" to prevent the model from role-playing both sides of a conversation
Format enforcement: Stop after the JSON closes: } as a stop sequence

Stop sequences are especially useful when you're parsing model output programmatically and need a reliable termination point.

Top-K

Less commonly exposed than temperature and top-p, top-K limits the model to choosing from the K most likely tokens at each step.

With top_k=40, only the 40 highest-probability tokens are eligible at each generation step.

Most APIs don't expose top-K or default it to a sensible value. If you can set it: lower values produce more predictable output; higher values allow more diversity. It's most useful for very constrained generation tasks where you want the model to stay within a tight vocabulary.

The Parameters at a Glance

Parameter	What it controls	Increase for	Decrease for
`temperature`	Randomness of token selection	More creativity	More consistency
`top_p`	How many tokens are candidates	More variety	Fewer odd choices
`max_tokens`	Maximum output length	Longer responses	Shorter, cheaper
Stop sequences	Where generation halts	N/A	Precise cutoff points
`top_k`	How many tokens are eligible	More variety	More predictable

Putting It Together

For a code generation task where you need reliable, consistent output:

temperature=0.1, top_p=0.95, max_tokens=4096

For a brainstorming task where you want varied ideas:

temperature=0.9, top_p=1.0, max_tokens=1024

For a data extraction task where the output must match a schema:

temperature=0.0, max_tokens=512, stop_sequences=["}"]

The key insight: temperature and top-p are about how the model chooses tokens; max_tokens is about how long it goes; stop sequences are about where it stops. These are independent controls, and combining them lets you tune the output precisely for your use case.