Skip to main content
Back to Blog
few-shot promptingexample selectioncontext engineeringprompt engineeringin-context learning

Few-Shot Example Selection Guide (2026)

How to pick few-shot examples that actually help — similarity, diversity, ordering, and when dynamic selection beats fixed example sets.

SurePrompts Team
April 20, 2026
12 min read

TL;DR

Few-shot prompting is table stakes; example selection is where the wins come from. Pick examples by similarity, diversity, ordering, and task-shape fit — and use dynamic selection on tasks with example variation.

Everybody reaches for few-shot prompting once zero-shot stops cutting it. Paste three worked examples above the query, the model picks up the pattern, accuracy goes up. That part is easy. The part that actually moves numbers — and that most teams under-invest in — is which examples go in the prompt. This post, part of the context engineering pillar, covers how to select few-shot examples well: similarity, diversity, ordering, task-shape fit, and when dynamic per-query selection beats a fixed set.

Why Selection Matters

The model doesn't learn from few-shot examples the way a human learns from a textbook. It pattern-matches against what's in the context window. If your examples look like the current query — same domain, same format, same difficulty — the pattern transfers cleanly. If they don't, the model has to generalize across a gap, and the wider the gap the less reliable the transfer.

That makes example quality the headline variable. Three carefully chosen examples can outperform a dozen mediocre ones. A single misleading example — wrong format, off-topic, inconsistent with the task — can quietly drag accuracy across every query it influences. Set size is a secondary lever; selection is where most of the signal lives.

For a refresher on mechanics, see the few-shot prompting guide and the glossary entry. This post assumes you know what a few-shot prompt is and want to pick better examples.

Similarity: Pick Examples Close to the Query

The first axis is semantic similarity. An example that looks and reads like the current query transfers better than one that doesn't.

What "close" means depends on the task. For classification, similar means similar input features — same product category, same customer segment, same phrasing style. For reasoning, similar means same reasoning shape — comparable steps, comparable constraints. For generation, it's similar length, tone, and output-format requirements.

The default infrastructure for similarity is embedding-based retrieval: embed candidate examples once, embed the incoming query at request time, return top-k nearest neighbors. Same pattern as retrieval-augmented prompting, applied to your curated example corpus rather than to documents.

Two guardrails. First, similarity is necessary but not sufficient — top matches often cluster together, which is the diversity problem below. Second, embedding similarity is a proxy; it can miss structural features (like "this input contains a negation") that matter to the task. For high-value tasks, layer rule-based filters on top — always include one example with a negation if the query has one.

Diversity: Cover the Output-Space Edges

Three highly similar examples are worse than three examples that together span the output space. If all three show long answers, the model will produce long answers even for queries that want a short one. If all three show positive-sentiment classifications, it biases toward positive.

Diversity is about what the example set collectively demonstrates. Goals to keep in mind when building or selecting:

  • Output-class coverage. For classification or multi-choice tasks, include at least one example per label you expect the model to emit — especially any minority class it might otherwise ignore.
  • Length variation. At least one short example and one long example, so the model doesn't anchor on a single output length.
  • Edge-case representation. At least one example that exercises the task's tricky rule — negations, nulls, unusual formats, the condition that distinguishes this task from a neighboring one.
  • Avoiding near-duplicates. If two candidate examples have cosine similarity above some threshold to each other, drop one. Similar examples waste budget without adding pattern information.

Diversity and similarity pull against each other — pure top-k picks near-duplicates, pure diversity wanders off-task. The practical answer is two-stage: retrieve top-N by similarity (say N = 10), then greedily pick k from that pool to maximize diversity, with each pick still clearing a minimum similarity bar.

Ordering: Recency Favors Examples Near the Query

Transformer attention isn't uniform across the context window. Content close to the query tends to exert more influence on the next token than content at the top of the prompt. This "recency bias" is a useful lever for few-shot selection.

Practical rules:

  • Strongest example last. Put the best-matching example immediately before the query — not first, not in the middle.
  • Canonical format last. If examples vary in how strictly they match the desired output format, put the most format-canonical one closest to the query so its shape is freshest when the model starts generating.
  • Ascending relevance. A safe default is weakest-to-strongest by similarity score. Early examples establish task shape; the last example models the exact pattern you want repeated.

For why this works at prompt scope, see hierarchical context loading. Ordering is free — zero extra tokens. Most teams ignore it because default template order becomes invisible. Revisit it when accuracy is close but not quite there.

Task-Shape Fit: Examples Must Show the Exact Output

Few-shot examples teach the model two things at once: the task (what question is being asked) and the output shape (what the answer looks like). If the shape in your examples is wrong, the model faithfully copies the wrong shape.

Where this goes sideways:

  • Format mismatch. Instructions say "respond in JSON" but examples show prose with JSON-like fragments. The model produces malformed JSON.
  • Length mismatch. Instructions say "one sentence" but examples show three. The model picks the examples.
  • Tone mismatch. Instructions say "formal" but examples are chatty. Same outcome.
  • Label mismatch. Instructions list ["positive", "neutral", "negative"] but examples use ["POS", "NEG"]. The model emits a mix.

The fix is boring: examples must be exact, literal exemplars of the output you want. When instructions and examples disagree, examples tend to win.

Dynamic vs Fixed Selection

A fixed example set is the same 3–8 examples appended to every prompt, chosen at design time. A dynamic selection pulls examples per query from a larger curated pool. Both have places; here's how to choose.

ApproachUse whenTrade-offs
Fixed examplesTasks are homogeneous; outputs look similar across queries.Simplest to implement. Works well for narrow, well-defined tasks. Gets stuck when the task has genuine variety.
Dynamic selectionTasks vary meaningfully in input domain, difficulty, or format. Example corpus is large.Needs retrieval infrastructure and a curated example pool. Pays back when query variation is real.

Heuristic: if you can write a single set of three examples that genuinely represents the full task, fixed is fine. If you keep adding examples to cover new cases and the prompt keeps growing, that's the signal to move to dynamic.

Dynamic is also the right home for tasks with arbitrary user input — support triage, code-completion prompts, classifier APIs. The "right" three examples for one query differ from the "right" three for another, and no fixed-set compromise beats per-query selection. For how dynamic selection fits the broader pipeline, see dynamic context assembly patterns.

Selection Infrastructure

The default stack for dynamic few-shot is plain: a curated example pool, an embedding index over inputs, and a selector function.

  • Curate a pool. Start with 50–200 high-quality examples per task. Quality beats quantity; wrong or inconsistent examples hurt more than they help.
  • Tag and validate. Tag each example with its label, any relevant structural features (length, complexity, negation presence), and provenance. Validate that every example's output matches the current output spec.
  • Embed the inputs. Run each example's input through an embedding model; store the vector. Re-embed when you change models.
  • Build a selector. Given a query: embed it, retrieve top-N by cosine similarity, apply diversity or rule-based filters, pick k, order them, insert into the prompt.
  • Log and evaluate. Log which examples fire on each request. When outputs go wrong, look at the selection — often the bug is that retrieved examples were off-task.

The selector is the piece most teams under-invest in. Shipping top-k by similarity and calling it done is tempting; the gap between that and a selector that enforces diversity, format canonicality, and label coverage tends to be worth real accuracy.

Don't Let the Model Generate Its Own Few-Shot Examples

A recurring anti-pattern: asking an LLM to "come up with a few examples of the task, then solve it." On paper it saves curation work. In practice the model's own examples drift toward its own prior — they confirm whatever shape it was going to produce anyway. You get more-confident wrong answers, not more-right ones.

The version that works is using an LLM to draft candidate examples that humans review, correct, and add to a curated pool. LLM for ideation; humans validate before anything enters the production corpus. Never wire the model's same-turn generations into its own few-shot slot.

A Dynamic Selection Prompt Pattern (Hypothetical)

The following is a hypothetical shape to illustrate dynamic few-shot assembly end-to-end. The task is a toy customer-support intent classifier with labels billing, technical, account, other.

code
# Pseudocode for a dynamic few-shot selector

function selectFewShot(query, pool, k=4) {
  // 1. Similarity: embedding top-N
  const candidates = embedding.topK(query, pool, N=20);

  // 2. Diversity + label coverage
  const picked = [];
  const labelsSeen = new Set();
  for (const example of candidates) {
    if (picked.length >= k) break;
    const tooSimilarToPicked = picked.some(
      p => cosine(p.vec, example.vec) > 0.92
    );
    if (tooSimilarToPicked) continue;
    if (!labelsSeen.has(example.label) || picked.length < k - 1) {
      picked.push(example);
      labelsSeen.add(example.label);
    }
  }

  // 3. Order weakest-to-strongest by similarity to query
  picked.sort((a, b) => a.similarity - b.similarity);
  return picked;
}

// Final assembled prompt
function buildPrompt(query) {
  const examples = selectFewShot(query, examplePool, k=4);
  const exampleBlock = examples
    .map(ex => `Input: ${ex.input}\nLabel: ${ex.label}`)
    .join("\n\n");
  return `
You classify customer-support messages into one of:
billing, technical, account, other.

Respond with only the label, lowercase, nothing else.

${exampleBlock}

Input: ${query}
Label:
`.trim();
}

Read it as a shape, not a library recommendation. The pattern is what matters: similarity retrieval, diversity filter with label coverage, weakest-to-strongest ordering, strict output-shape modeling in every example.

Common Anti-Patterns

  • Top-k and done. Retrieving by cosine similarity and shipping whatever comes back. Produces near-duplicate sets, misses label coverage, biases toward whichever cluster the query landed near.
  • Too many examples. Stuffing 10–15 "to be safe." Costs budget, dilutes the pattern signal, over-represents common classes. Three to five good beats ten mediocre.
  • Unchecked example quality. Letting any input/output pair into the pool without validating outputs match the current spec. One wrong example quietly drops accuracy across every query it gets retrieved for.
  • Instruction–example disagreement. Instructions say one thing, examples show another. Examples win. Keep them in sync every time the output spec changes.
  • Fixed ordering. Never reshuffling by query. The last example is the most influential slot; leaving it to alphabetical order is wasted leverage.
  • LLM-generated inline examples. The model's own same-turn "examples" confirm its priors. Use LLMs to draft candidates for humans to curate, not as inline few-shot slots.

FAQ

How many examples should I use?

Three to five is the usual sweet spot. Fewer than three and you can't show diversity; more than five tends to pay diminishing returns and eats budget. Push higher only when you've measured it helps on your eval set.

Should I always use dynamic selection?

No. If your task is homogeneous — every query looks similar enough that a single set of three examples genuinely covers it — fixed is simpler, cacheable, and fine. Move to dynamic when the prompt keeps growing to cover new query shapes, or when eval accuracy plateaus despite adding fixed examples.

How do I pick between two candidate examples that look equally relevant?

Prefer the one that exercises a distinct feature the other doesn't — a different label, a different length, a different edge case. The example set's job is collective coverage, not that each individual example be the single best match.

What about zero-shot for the same task?

If zero-shot is working, don't add few-shot just because you can. Few-shot costs budget and adds a selection surface to maintain. For the comparison in depth, see zero-shot vs few-shot prompting.

Can I mix few-shot examples with retrieved documents?

Yes. Keep them in distinct sections with clear headings, so the model knows which is "how to format the answer" (examples) and which is "evidence to reason over" (retrieved docs). Put examples in the stable prefix or near the top of dynamic content; put retrieved docs closest to the query where their evidence is freshest.

Wrap-Up

Few-shot is easy to set up and easy to leave under-tuned. The selection layer is where the wins come from. Similarity gets on-topic examples; diversity keeps them from collapsing into near-duplicates; ordering puts the strongest pattern closest to the query; task-shape fit makes sure the examples teach the exact output you want. Dynamic selection beats fixed sets whenever query variation is real.

Treat the example pool like code: curate it, validate every entry against the current spec, log which examples fire on which queries, trim the pool regularly. That's the unglamorous work separating "we added few-shot" from "few-shot moved our eval numbers."

For the pillar, context engineering. For assembly patterns that include example slots, dynamic context assembly. For retrieval patterns sharing the same infrastructure, retrieval-augmented prompting. For ordering at whole-prompt scope, hierarchical context loading. For the term itself, few-shot prompting.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator