Key takeaways:
- Every LLM call has these dials and most developers leave them at default. Choosing them deliberately is one of the highest-leverage technical adjustments in production prompting.
- Tune temperature or top-p, not both. Provider docs explicitly recommend this. The interaction is hard to reason about and tuning both simultaneously is a recipe for non-reproducible debugging.
- Reasoning models flatten the temperature lever. The internal deliberation determines accuracy; sampling temperature mostly affects surface phrasing. Do not lower temperature on a reasoning model and expect it to "get smarter."
- Frequency and presence penalties default to 0 for a reason — they help long open-ended generation and hurt structured output, code, and any text that legitimately repeats. Reach for them only for a specific repetition problem.
- Seed gives you best-effort reproducibility, not guaranteed reproducibility. OpenAI exposes it, Anthropic does not as of 2026, and even at temperature 0 hosted endpoints drift over time as infrastructure changes.
- The sampling parameters are not where prompt quality comes from. Structure (see RCAF) and validation (see Quality Rubric) matter more. Sampling is the last 10%, not the first 90%.
Every modern LLM exposes the same handful of dials at the API surface: temperature, top-p, top-k, frequency penalty, presence penalty, seed, stop sequences, and max tokens. The defaults are usually fine for chat, which is why most developers never touch them. Once you ship a prompt to production — extraction, classification, code, tool use, RAG, agents — the defaults stop being fine, and the right parameter for the job is rarely the default.
This is a reference guide. It defines each parameter, explains the math at the level you actually need, lists per-provider defaults where they are documented, names the interactions that bite you in production, and gives concrete temperature ranges per task. It is not a tutorial; it assumes you have already shipped a prompt that calls an LLM and now want to tune it.
The shape of the guide: parameters first, then the rule about not tuning temperature and top-p together, then per-task recommendations, then a per-provider defaults table, then the failure modes that show up in real systems. Bookmark the sections you reach for; skim the rest.
What sampling actually does
A language model takes the input tokens and produces, for the next token, a vector of unnormalized scores called logits — one score per token in its vocabulary, typically tens to hundreds of thousands of tokens. A softmax converts those logits to a probability distribution over the whole vocabulary. Generating the next token means picking one token from that distribution.
The simplest pick is greedy decoding — always take the highest-probability token. Greedy is deterministic and often boring. It also has a structural failure mode: when two tokens have nearly identical probabilities, greedy commits to one based on tie-breaking and never explores the other path, which can lead the model into low-quality completions it would have escaped with a tiny bit of randomness.
Sampling is the alternative. Instead of always picking the top token, sample from the distribution proportionally. Pure sampling is too random for most uses, which is why every modern API exposes parameters that reshape the distribution before sampling — making it sharper or flatter, cutting the long tail, or preventing repetition. Temperature, top-p, top-k, and the penalties are all distribution-reshaping operations. Stop sequences, max tokens, and seed control the loop and the randomness source. That is the entire shape of the surface area.
The mental model: sampling parameters do not change what the model knows or how it ranks tokens. They change which tokens the sampler is allowed to pick from and how likely each candidate is to win.
Temperature
Temperature is a scalar that divides the logits before the softmax. If T is temperature and z_i is the logit for token i, the sampler computes the probability as softmax(z_i / T). The math has three regimes:
T = 1: pass-through. The model's native distribution is what gets sampled.T < 1: distribution sharpens. High-probability tokens get more of the mass; low-probability tokens get less. AsT → 0, the distribution collapses onto the top token and the sampler becomes greedy.T > 1: distribution flattens. The top token loses some of its lead; lower-ranked tokens become more likely. AsT → ∞, the distribution approaches uniform over the vocabulary.
Most APIs expose temperature as a value between 0 and 2.
A few practical points the math implies:
- Temperature 0 is not perfectly deterministic on hosted endpoints. Floating-point rounding in batched inference, occasional ties, and infrastructure non-determinism mean even temperature 0 can produce different outputs on the same input. It is usually deterministic enough to ship, not guaranteed deterministic.
- Above ~1.2, output quality degrades fast for most models. The flattened distribution starts including tokens that are syntactically wrong, off-topic, or hallucinated. The model has not gotten "more creative" — it has been forced to roll on dice it would normally avoid.
- Temperature does not unlock new ideas. The model's vocabulary and its conditional probabilities do not change. Higher temperature lets you sample further down the ranking; it cannot suggest tokens the model never assigned probability to.
The practical range for production work is 0 to roughly 1. Anything above 1 should be a deliberate choice for a specific task (brainstorming, multi-sample generation), not a default.
Top-p (nucleus sampling)
Top-p, also called nucleus sampling, takes the candidate tokens in descending order of probability and keeps the smallest prefix whose probabilities sum to at least p. Everything outside that nucleus is dropped to zero probability; the sampler then samples from the renormalized nucleus.
The legal range is 0 to 1.
top-p = 1: no filtering. The whole distribution is in scope. (Subject to whatever temperature has done to it.)top-p = 0.9: drop the long tail. The bottom 10% of cumulative probability mass cannot be sampled. This is a common default.top-p = 0.1: aggressive filtering. Only the very top of the distribution is in scope. Behavior approaches greedy.
The key property of top-p is that it is adaptive. When the model is highly confident, the nucleus is small (a few tokens carry all the mass); when the model is uncertain, the nucleus expands to include more candidates. That adaptivity is why top-p has largely replaced top-k as the default truncation method for hosted APIs.
Top-p interacts with temperature: temperature reshapes the distribution first, then top-p truncates the reshaped distribution. A high temperature combined with a low top-p can produce a sharper-than-default distribution (the temperature flattens, the top-p cuts the new long tail), which is occasionally useful but rarely needed.
Top-k
Top-k keeps only the k highest-probability tokens and zeros out the rest. Unlike top-p, it does not adapt to the model's confidence — it always keeps exactly k tokens.
top-k = 1: greedy.top-k = 50: a common open-weights default.top-k = 0or unset: no filtering on count.
Top-k was the original truncation parameter in early language models, but most modern hosted APIs have moved to top-p as the primary truncation control. OpenAI's API does not expose top-k at all. Anthropic exposes it. Google exposes it. Most open-weights inference stacks (Hugging Face Transformers, vLLM, llama.cpp) expose it.
The case for top-k over top-p: it is simpler to reason about. You always know exactly how many candidates are in the pool. The case against: when the model's distribution is highly skewed, top-k can include tokens with vanishingly small probability that are essentially noise; when the model's distribution is flat, top-k can cut off useful candidates. Top-p handles both cases by tracking probability mass instead of count.
In 2026 production work, top-k is mostly tuned for open-weights deployments where you want fine control over the inference loop. For hosted APIs, leave it at default (or unexposed) and tune top-p or temperature instead.
Temperature vs top-p: pick one
Most provider documentation states this explicitly. OpenAI's docs say: "We generally recommend altering this or top_p but not both." Anthropic's docs say similar. The reason is that the two parameters interact through the same softmax, and the joint effect is hard to predict.
The practical rule:
- Default workflow: leave top-p at its default (usually 0.9 or 1.0) and tune temperature.
- Tune top-p instead when you want to keep temperature at default for response style consistency but cut off the long tail more aggressively (e.g., to reduce occasional weird tokens in customer-facing output).
- Tune both only deliberately, with eval data showing the joint setting outperforms either alone. This is rare and almost always a sign you are over-fitting parameters to your eval set.
If you cannot articulate which problem each parameter is solving in your specific case, you are tuning both and you should stop. Pick one.
Frequency penalty and presence penalty
Both penalize repetition, applied as a logit adjustment before sampling.
Frequency penalty subtracts a value proportional to how often a token has already appeared in the output. The more times the token has shown up, the harder it becomes to generate it again. Useful when a model is looping on a single phrase ("The product is great. It is great. It is also great.").
Presence penalty subtracts a fixed value once a token has appeared at all, regardless of count. Useful when you want to push the model toward new topics or new vocabulary, not just stop it from repeating the same word three times.
Both default to 0 in OpenAI's API. The legal range is roughly -2.0 to 2.0. Negative values encourage repetition (rarely useful but occasionally helpful for sticking to specific terminology).
Where they help:
- Long open-ended generation (essays, stories) that drifts into self-repetition.
- Brainstorming where you want diverse suggestions, not three variants of the same idea.
Where they hurt:
- Code. Code legitimately repeats identifiers, function names, and keywords. Penalizing repetition in code generation suppresses the very tokens that make the code valid.
- Structured output. JSON keys, XML tags, and field names repeat across records. Penalizing them produces invalid output.
- Lists with proper nouns. "Chicago", "Chicago", "Chicago" might all be correct in a list of Chicago neighborhoods. The penalty does not know.
Default to 0. Only raise the penalty when you have a specific repetition problem you can show on eval data, and lower it again as soon as the problem is solved. Both penalties are easy to over-tune; you set them at 0.5 to fix one issue and accidentally break three others.
Seed and reproducibility
Reproducibility in LLMs is best-effort, not guaranteed. The seed parameter is the closest thing to a determinism control on hosted APIs.
- OpenAI exposes
seed. The same seed plus the same input plus the samesystem_fingerprint(returned in the response) produces the same output most of the time. When OpenAI updates the model or backend, the system_fingerprint changes and reproducibility breaks. - Anthropic does not expose a stable seed parameter as of early 2026. Reproducibility on Claude requires temperature 0 plus identical inputs and is best-effort.
- Google Gemini exposes seed in some endpoints; behavior varies.
- DeepSeek and Mistral vary by endpoint and version.
- Open-weights models running on your own hardware can be made fully deterministic by setting seed, temperature 0, and controlling the inference backend (single-batch, deterministic kernels). This is the only setting where strict determinism is achievable.
For evals: capture the full request payload, the response, the model name, and the system_fingerprint where available. Expect occasional drift even at temperature 0 with a fixed seed. Rerun eval cases periodically rather than trusting that yesterday's eval result still holds today.
For production: do not rely on reproducibility for correctness. Validate outputs structurally (schema check, regex, length) rather than asserting exact-match against a previous run.
Stop sequences and max tokens
These are the boundary controls. They do not shape the distribution; they bound the loop.
Stop sequences are strings that, when generated, halt output immediately. The matched stop string itself is not included in the response. Stops are essential in three cases:
- Agent loops: when an agent emits a sentinel like
</tool_call>orEND_OF_PLAN, you stop generation and dispatch the parsed output. Without a stop sequence, the model may keep generating past the structure boundary and produce garbage you have to clean up. - Format-bounded outputs: when generating a structured prefix (e.g., a JSON object that ends with
}followed by a newline), a stop sequence can guarantee the model does not keep going and add commentary. - Conversational role markers: in chat templating where the model might otherwise hallucinate the next user turn.
Most APIs accept multiple stop sequences (typically up to 4). Stops are exact-string matches and are case-sensitive on most providers.
Max tokens caps the number of tokens the model can generate in a single response. Two reasons to set it explicitly:
- Cost and latency control: an unbounded response can run thousands of tokens longer than needed, which costs money and adds latency.
- Safety bound in agent loops: if a stop sequence misses (typo, model drift), max_tokens is your fallback to prevent runaway generation.
In agent contexts, max_tokens should be set tight enough to bound a single turn but loose enough to accommodate the longest legitimate output. Stop sequences should be set to your structural sentinels. Both together are belt-and-suspenders, and you want both.
Per-task recommendations
Concrete temperature ranges by task. These are starting points; adjust based on eval data.
Deterministic extraction and classification — temperature 0. When the task has one right answer (extract the date, classify the sentiment, name the entity), you want the model's top-ranked token every time. Any randomness here is pure downside. Pair with structured decoding where the output format is enumerable.
Code generation — temperature 0 to 0.3. Code is graded by whether it compiles and runs. The highest-confidence completion is almost always the safest. A small amount of temperature (0.1-0.3) is occasionally useful for getting the model out of a stuck pattern, but the default should be 0.
Structured output / JSON — temperature 0 to 0.2. Schema-conforming output is a structural task, not a creative one. Combine with structured decoding when available. If you cannot use structured decoding, at minimum specify the schema in the prompt and validate the output before consuming it.
Tool use and function calling — temperature 0 to 0.2. Tool calls have to validate against a schema (right tool name, right argument names, well-typed values). Higher temperature introduces non-zero probability of wrong tool selection or hallucinated arguments. Combine with tool_choice (see the tool-choice glossary entry) to constrain the model to the right tool when you know which one should run.
RAG answering — temperature 0.2 to 0.5. RAG answers should be grounded in retrieved context, which argues for low temperature; they should also read naturally, which argues for some temperature. The middle range balances. Going too low produces stilted, copy-paste answers; going too high invites the model to drift away from the retrieved evidence.
Conversational chat — temperature 0.7 to 1.0. The default range that most chat APIs use. Coherent, varied, naturally phrased. This is what users expect from a chatbot.
Creative writing and brainstorming — temperature 0.8 to 1.2. High enough to suggest combinations the model would not produce at default, low enough to stay coherent. Above 1.2 quality degrades fast on most models. For brainstorming specifically, generating multiple samples at moderate temperature often beats a single sample at very high temperature.
Multi-sample self-consistency — temperature 0.6 to 1.0 with N samples. Self-consistency generates multiple samples and votes across them. The temperature has to be high enough to produce diversity (otherwise all samples are the same) but low enough that each sample is reasonable on its own. 0.7 with N=5 is a common starting point. The technique is most useful for math, multi-step reasoning, and any task where the right answer is verifiable but the path to it varies.
Reasoning models are different
The o-series from OpenAI, Claude with extended thinking, Gemini Deep Think, and DeepSeek R1 are reasoning models — they spend tokens on internal deliberation before producing the final answer. The deliberation process changes how sampling parameters behave.
- OpenAI's reasoning models do not accept a temperature parameter at all. The API rejects the request if you try to set one. Reasoning effort is controlled via a separate parameter; the rest of sampling is opaque to the user.
- Claude with extended thinking still accepts temperature, but the reasoning portion of the output is largely invariant to it. Temperature mostly affects the surface phrasing of the final answer, not the internal chain.
- Gemini Deep Think and DeepSeek R1 behave similarly — the reasoning quality is set by the model's internal process, not by sampling.
The implication: do not lower temperature on a reasoning model expecting it to "get smarter." Accuracy on a reasoning model is set by the reasoning process and the difficulty of the problem, not by sampling. If a reasoning model is wrong, the fix is a better prompt, more reasoning effort, or a different model — not a lower temperature.
For full coverage of how to prompt reasoning models, see the AI Reasoning Models Prompting Complete Guide 2026. The short version: state the problem clearly, do not hand-hold the chain of thought (the model already does it), give the model room (do not over-constrain output format mid-reasoning), and validate the answer.
Per-provider defaults
Defaults change as providers ship model updates. The values below are documented or widely-attested as of early 2026; check the provider's API reference for the version of the model you are calling. Where a value is not explicitly documented, the column shows "varies."
| Provider / model | Default temperature | Default top-p | Top-k exposed | Seed exposed | Notes |
|---|---|---|---|---|---|
| OpenAI GPT-4o, GPT-4.1 | 1.0 | 1.0 | No | Yes | Frequency and presence penalty exposed; stops up to 4. |
| OpenAI o-series (o3, o4-mini) | not accepted | not accepted | No | Yes | Sampling parameters not configurable; reasoning effort controls behavior. |
| Anthropic Claude Sonnet 4.6 | 1.0 | varies | Yes | No (no stable seed) | Top-k exposed; stop_sequences up to 4. |
| Anthropic Claude Opus 4.7 | 1.0 | varies | Yes | No (no stable seed) | Same surface as Sonnet. Extended thinking does not require parameter changes. |
| Google Gemini 2.5 Pro | varies | varies | Yes | varies | Defaults documented per endpoint; check the specific API surface. |
| Google Gemini 2.5 Flash | varies | varies | Yes | varies | Same surface as Pro. |
| DeepSeek V3 / R1 | varies | varies | varies | varies | OpenAI-compatible API surface in most clients; check provider docs for current defaults. |
| Mistral (hosted) | varies | varies | varies | varies | OpenAI-compatible surface; defaults differ by model. |
| Open-weights via Hugging Face / vLLM | varies | varies | Yes | Yes | Full control over the inference backend; deterministic settings achievable. |
Three things this table is not:
- It is not a substitute for the provider's API reference. Defaults shift across model versions; the row above is a starting point, not an authority.
- It does not capture every parameter. Providers expose dozens of additional knobs (logit_bias, response_format, tool_choice, parallel_tool_calls, repetition_penalty, mirostat for some open-weights, etc.). The columns above are the universally meaningful ones.
- It does not say which defaults are good. Default 1.0 temperature is fine for chat; it is wrong for extraction. The defaults exist for the most common use case (conversation), which is rarely the use case you are tuning for.
Common failure modes
The bugs that show up in production sampling configurations.
Tuning temperature on a reasoning model and expecting accuracy gains. Already covered. The reasoning process sets accuracy. Sampling is decoration.
Tuning both temperature and top-p simultaneously. The two interact through the same softmax. Tuning both at once means you cannot attribute behavior changes to either parameter individually. Pick one.
Frequency penalty too high on code or structured output. A frequency penalty above 0.5 starts suppressing legitimately repeated tokens — function names in code, field names in JSON, repeated entities in lists. The output looks subtly broken (renamed variables, missing fields, dropped entities) and the cause is hard to find unless you know to look at the penalty.
Forgetting stop sequences in agent loops. An agent that emits </tool_call> as its sentinel needs </tool_call> as a stop sequence. Without the stop, the model often keeps generating past the structure boundary, producing extra text the parser has to handle (or fail on). Always set stops at the structural boundaries of your agent's output contract. See the Agentic Prompt Stack for where these contracts live.
Assuming temperature 0 means perfectly deterministic. It does not, on any hosted endpoint, ever. Floating-point non-determinism in batched inference and silent infrastructure changes break determinism even at temperature 0 with a fixed seed. Build evals that tolerate occasional drift; do not assert exact-match across runs.
Not setting max_tokens. An unbounded response can run thousands of tokens longer than needed. The cost and latency hit is real and avoidable. Set max_tokens to the longest legitimate output for the task plus a margin.
Setting max_tokens too tight. The opposite failure: the model runs out of tokens mid-output, leaving you with a truncated JSON or half-finished response. The fix is not "always set max_tokens high" — it is "set max_tokens to the actual longest legitimate output for this prompt, measured on eval data, plus a margin."
Reaching for sampling when the prompt is the problem. If a prompt is producing wrong outputs, lower temperature first instinct is to tweak temperature. The right first instinct is to score the prompt against the SurePrompts Quality Rubric. A prompt that scores 18/35 on quality is not going to be fixed by temperature 0.3 versus 0.5. Fix the prompt; then tune.
What's next
The sampling parameters control how the model picks tokens from a distribution it computed. They do not control the format of the output once picked. For tasks where the output must conform to a schema — JSON, function arguments, enumerated values — the right tool is structured decoding, which constrains the sampler to only emit tokens that keep the partial output valid against a grammar or schema. Structured decoding gives you correctness guarantees that no temperature setting can match.
For agent contexts, the parameters in this guide combine with the patterns described in the Agentic Prompt Stack — temperature near 0, tight stop sequences, max_tokens bounded per step, tool_choice constrained when appropriate.
For modality-specific work, the sampling parameters often look different. Image, video, voice, and multimodal generation have their own samplers (CFG scale, denoising steps, classifier guidance) that overlap conceptually with temperature but are not the same parameter. The pillar guides cover those:
- AI Image Prompting Complete Guide 2026
- AI Video Prompting Complete Guide 2026
- AI Voice & Audio Prompting Complete Guide 2026
- AI Multimodal Prompting Complete Guide 2026
For provider-specific defaults and idioms:
- Claude Opus 4.7 Prompting Guide
- Claude 4 Prompting Guide
- Best DeepSeek Prompts 2026
- Advanced Prompt Engineering 2026: Claude, GPT-5, Gemini
Our position
- Tune temperature or top-p, not both. The interaction is unpredictable and tuning both simultaneously means you cannot debug either.
- For production prompts that need correctness (extraction, classification, code, tool use, structured output), default to temperature 0 to 0.3 and validate the output structurally.
- For conversational and creative work, default to 0.7 to 1.0 and leave top-p at its provider default.
- Frequency and presence penalty default to 0 for a reason. Raise them only for a specific repetition problem you can show on eval data; lower them again as soon as the problem is solved.
- Reasoning models flatten the temperature lever. The deliberation process determines accuracy; sampling is surface phrasing. Do not tune temperature on o3 expecting it to get smarter.
- Always set max_tokens. Always set stop sequences in agent loops. The boundary controls are not optional.
- Sampling is the last 10% of prompt quality, not the first 90%. Score your prompt with the SurePrompts Quality Rubric and structure it with RCAF before reaching for the sampler.
Related reading
- The SurePrompts Quality Rubric — score the prompt before tuning the sampler.
- RCAF Prompt Structure — the drafting skeleton that fixes most of what people blame on sampling.
- Agentic Prompt Stack — where stop sequences and max_tokens earn their keep.
- AI Reasoning Models Prompting Complete Guide 2026 — how prompting changes when temperature is no longer a meaningful lever.
- Advanced Prompt Engineering 2026: Claude, GPT-5, Gemini — provider-specific idioms across the major models.
- Claude Opus 4.7 Prompting Guide — Claude-specific defaults and patterns.
- Best DeepSeek Prompts 2026 — DeepSeek-specific defaults and patterns.
- Structured Decoding — when correctness matters more than sampling.
- Tool Choice — the parameter that pairs with low temperature for tool-use reliability.
- Self-Consistency — the multi-sample pattern that needs moderate temperature to work.
- Extended Thinking — Claude's reasoning mode and how it interacts with sampling.
- Chain of Thought — the reasoning pattern reasoning models internalize.
- Model Cascade — when sampling tuning is not enough and you need a different model in the loop.