RULER (Long-Context Benchmark)
RULER is a long-context evaluation that goes beyond simple needle-in-a-haystack retrieval. It tests aggregation, multi-hop reasoning, and variable tracking across long inputs to measure where a model's effective context window actually ends.
Models with a stated 1M-token context window often see recall accuracy drop well before that limit. RULER reveals the gap between a model's nominal context window and its effective context window — the depth at which retrieval and reasoning quality remains acceptable. Different tasks within RULER stress different long-context capabilities, so "good at RULER" is a multidimensional claim, not a single number.
Example
A model can pass a needle-in-a-haystack test at 500k tokens and still fail RULER's multi-hop variable-tracking task at the same depth, because the harder tasks require holding multiple facts in working memory simultaneously.
Related Terms
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts