RULER (Long-Context Benchmark)

RULER (Long-Context Benchmark)

RULER is a long-context evaluation that goes beyond simple needle-in-a-haystack retrieval. It tests aggregation, multi-hop reasoning, and variable tracking across long inputs to measure where a model's effective context window actually ends.

Models with a stated 1M-token context window often see recall accuracy drop well before that limit. RULER reveals the gap between a model's nominal context window and its effective context window — the depth at which retrieval and reasoning quality remains acceptable. Different tasks within RULER stress different long-context capabilities, so "good at RULER" is a multidimensional claim, not a single number.

Example

A model can pass a needle-in-a-haystack test at 500k tokens and still fail RULER's multi-hop variable-tracking task at the same depth, because the harder tasks require holding multiple facts in working memory simultaneously.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts

Example

Related Terms

Put this into practice