Every LLM API exposes a set of parameters that control how the model generates text. Most people use the defaults and never touch them — but understanding what these controls do gives you meaningful power over output quality, creativity, and cost.
Temperature
Temperature is the single most useful parameter to understand.
LLMs generate text by assigning probabilities to every possible next token. At each step, the model has to pick which token comes next. Temperature controls how it makes that choice:
Temperature = 0: Always pick the highest-probability token. Deterministic, consistent, conservative. Same prompt → same output every time.
Temperature = 1: Pick tokens according to their actual probabilities. Some variation, but still coherent. The model "as designed."
Temperature > 1: Amplify lower-probability tokens. More surprising, more creative, often incoherent beyond ~1.5.
Practical guidance
| Task | Temperature |
|---|---|
| Factual Q&A, data extraction | 0.0–0.2 |
| Code generation | 0.0–0.3 |
| Professional writing | 0.3–0.6 |
| Conversational assistants | 0.5–0.8 |
| Creative writing, brainstorming | 0.7–1.0 |
| Poetry, experimental content | 0.9–1.2 |
Rule of thumb: If you need consistency and accuracy, go lower. If you need variety and creativity, go higher.
Top-P (Nucleus Sampling)
Top-p limits which tokens the model can choose from at each step.
With top_p=0.9, the model considers only the tokens whose cumulative probability reaches 90% — throwing out the long tail of low-probability tokens. This prevents truly bizarre choices while still allowing natural variation.
Top-p vs. temperature:
- Temperature reshapes the probability distribution (makes peaks higher or flatter)
- Top-p truncates it (removes the low-probability tail)
For most use cases, temperature alone is sufficient. If you're finding that your outputs occasionally include bizarre word choices even at moderate temperatures, lowering top-p (to 0.7–0.85) can help.
What Anthropic recommends: Don't change both temperature and top-p at the same time. Pick one to tune.
Max Tokens
Max tokens (called max_tokens in the Anthropic API, max_completion_tokens in OpenAI) sets the maximum length of the model's output.
It does not affect how much input the model reads — only how long its response can be.
Common mistakes:
- Setting max_tokens too low → response gets cut off mid-sentence
- Setting max_tokens much higher than needed → wastes money on buffer
- Confusing max_tokens with the context window → they're different things
Practical approach: Estimate the longest output you'd realistically need and set max_tokens to ~120% of that. For most conversational tasks, 1,024–2,048 is plenty. For long documents or detailed code, 4,096–8,192 makes sense.
Stop Sequences
Stop sequences tell the model to stop generating text when it produces a specific string or token.
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
stop_sequences=["###", "END", "\n\nUser:"],
messages=[...]
)
Common uses:
- Structured generation: Stop at a delimiter to extract just the section you want
- Multi-turn control: Stop at
"User:"to prevent the model from role-playing both sides of a conversation - Format enforcement: Stop after the JSON closes:
}as a stop sequence
Stop sequences are especially useful when you're parsing model output programmatically and need a reliable termination point.
Top-K
Less commonly exposed than temperature and top-p, top-K limits the model to choosing from the K most likely tokens at each step.
With top_k=40, only the 40 highest-probability tokens are eligible at each generation step.
Most APIs don't expose top-K or default it to a sensible value. If you can set it: lower values produce more predictable output; higher values allow more diversity. It's most useful for very constrained generation tasks where you want the model to stay within a tight vocabulary.
The Parameters at a Glance
| Parameter | What it controls | Increase for | Decrease for |
|---|---|---|---|
temperature | Randomness of token selection | More creativity | More consistency |
top_p | How many tokens are candidates | More variety | Fewer odd choices |
max_tokens | Maximum output length | Longer responses | Shorter, cheaper |
| Stop sequences | Where generation halts | N/A | Precise cutoff points |
top_k | How many tokens are eligible | More variety | More predictable |
Putting It Together
For a code generation task where you need reliable, consistent output:
temperature=0.1, top_p=0.95, max_tokens=4096
For a brainstorming task where you want varied ideas:
temperature=0.9, top_p=1.0, max_tokens=1024
For a data extraction task where the output must match a schema:
temperature=0.0, max_tokens=512, stop_sequences=["}"]
The key insight: temperature and top-p are about how the model chooses tokens; max_tokens is about how long it goes; stop sequences are about where it stops. These are independent controls, and combining them lets you tune the output precisely for your use case.