Why the Same Prompt Gives You a Different Answer Every Time

Have you ever wondered why asking an AI the exact same question twice gives you two completely different answers? It is not a bug. It is a deliberate design decision baked into how language models generate text, and the mechanism behind it is more interesting than most people realize.

The Model Does Not Pick One Word

Here is the first thing to understand. When an AI model generates a response, it does not look at your prompt and decide: this is the next word. Instead, for every position in the output, it produces a probability score for every token in its entire vocabulary.

A vocabulary might contain 50,000 or more tokens. For each step of generation, the model assigns every one of those tokens a score representing how likely it is to come next given everything before it. The word "the" might score very high. The word "algorithm" might score moderately. The word "elephant" might score near zero.

After those scores are computed, a process called softmax converts them into a proper probability distribution where all values sum to one. Now the model has a list of every possible next token and the probability that each one should come next.

The question is: what does it do with that list?

Temperature: Sharpening or Flattening the Distribution

This is where temperature comes in.

Temperature is a number that modifies the probability distribution before a token is selected. A low temperature makes high-probability tokens even more dominant. A high temperature flattens the distribution and gives lower-probability tokens a more meaningful chance of being selected. Temperature does not change which tokens the model thinks are likely. It changes how much that likelihood matters when making the final selection.

At a temperature of zero, the model always picks the single most probable token at every step. The output becomes completely deterministic and repeatable. Ask the same question a hundred times and you get the same answer every time. This is useful for tasks where you want consistent, predictable output such as code generation or data extraction.

At a temperature of one, the model samples according to the raw probabilities. Tokens the model thinks are likely get chosen often but not always. There is room for variety.

At temperatures above one, the distribution becomes nearly flat. Almost any token has a reasonable chance of being chosen. The output becomes more creative, more surprising, and also more likely to drift into incoherence.

Sampling Methods: How the Token Is Actually Chosen

Temperature adjusts the shape of the distribution. But there are also different strategies for how to sample from it once it has been adjusted.

Greedy decoding always picks the highest probability token. Fast and simple but produces repetitive, flat text because it never explores alternatives.

Top-K sampling restricts the selection to only the K most probable tokens and samples from those. If K is 50, the model picks randomly among the 50 most likely options, ignoring everything else. This cuts out unlikely nonsense while keeping meaningful variety.

Top-P sampling, also called nucleus sampling, works differently. Instead of fixing the number of tokens considered, it takes the smallest group of tokens whose combined probability adds up to at least P. If P is 0.9, the model finds however many top tokens are needed to account for 90 percent of the probability mass and samples from those.

Top-K vs Top-P — how each method cuts the distribution differently

Most production systems combine these methods. A typical configuration might apply a temperature, then apply Top-P, then sample from the resulting narrowed distribution. Each layer of control shapes what the final output looks like.

Greedy, Top-K, and Top-P side by side — same model, three different behaviours

Why This Matters Beyond Trivia

Understanding sampling explains things you probably noticed but could not explain.

When you regenerate a response and get something different, that is sampling at work. When an AI writes creative fiction with unexpected turns, that is temperature giving lower-probability tokens a chance to surface. When an AI produces repetitive or looping text, that is often a side effect of greedy or near-greedy decoding getting stuck in high-probability loops. The feeling of an AI being creative or robotic is largely a function of these sampling parameters, not a property of the model itself.

The underlying model weights are fixed after training. The same model can feel wildly different depending entirely on how its output distribution is sampled at inference time.

The Tradeoff Developers Have to Make

There is no universally correct temperature. The right setting depends entirely on the task.

A system extracting structured data from invoices wants temperature zero. Variation is a defect. A system brainstorming marketing ideas wants a higher temperature. Variation is the point.

Sampling from the model is like taking a draw from a distribution. The temperature parameter controls how peaked or diffuse that distribution is.
— Andrej Karpathy, Neural Networks: Zero to Hero lecture series, 2023

This is why the same underlying model can feel like a completely different product across different applications. The model is the same. The sampling strategy is not.

The Simple Version

When an AI generates text, it produces a probability score for every possible next token. Temperature scales those probabilities up or down to control how much variety enters the output. Sampling methods like Top-K and Top-P then determine which portion of that distribution the model is allowed to choose from.

The combination of these parameters is what decides whether the output feels precise or creative, consistent or surprising, focused or meandering. The model does not write your answer. It draws from a distribution that has been shaped by choices you may not even know were made.

Next time a response surprises you, that is sampling. Next time it bores you with something obvious, that is probably temperature close to zero. The behavior is not random. It is the result of very deliberate mathematics.

Fact checked against published literature including Holtzman et al., The Curious Case of Neural Text Degeneration, 2020 on nucleus sampling, and Andrej Karpathy's publicly available lecture materials on language model inference.