Needle in a Haystack
Needle in a haystack is a long-context evaluation pattern that measures whether a model can retrieve a specific fact (the needle) planted at an arbitrary position inside a long irrelevant passage (the haystack). The evaluator varies two dimensions — the total context length and the depth at which the needle is inserted — and asks a question that can only be answered from the needle. Recovery rates are reported as a grid, which exposes where model attention degrades: near perfect at the start and end, often worse in the middle, sometimes collapsing at long context lengths. Popular in 2023-2024 as an initial check on long-context claims, the benchmark is useful but limited — it tests literal recall of a planted fact, not multi-hop reasoning, aggregation across the corpus, or adversarial distractors. Modern long-context evals add those dimensions rather than replacing NIAH entirely.
Example
A vendor claims a 1M-token context window. An evaluator runs needle-in-a-haystack at 100K, 500K, and 950K tokens, inserting the needle at 5%, 25%, 50%, 75%, and 95% depth. The result grid shows near-perfect recovery at the start and end at every length, a visible drop in the middle at 500K, and a larger drop in the middle at 950K — the classic U-shape. The takeaway is not that long context is broken but that "how long" and "how reliable at which depth" are different questions, and the nominal context window is a ceiling, not an operating range.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts