Skip to main content

Many-Shot Jailbreaking

Many-shot jailbreaking is a long-context attack pattern identified by Anthropic researchers in 2024. The attacker fills the prompt with dozens to hundreds of fabricated example dialogues in which an assistant character appears to comply with prohibited requests — synthesizing dangerous content, bypassing safety policies, writing malicious code — and then appends a final attack query. The model, primed by the long sequence of "compliant" in-context demonstrations, is materially more likely to comply with the final query than it would be on a zero-shot version of the same request. The attack surface grows roughly with the number of shots, which means frontier long-context models (hundreds of thousands or millions of tokens) are inherently more exposed than smaller-context predecessors. Mitigations include input classifiers that detect suspicious multi-turn patterns, targeted fine-tuning on many-shot refusal examples, and prompt-level defenses.

Example

A red team evaluates a new long-context model. Zero-shot harmful requests are refused at a high rate. The team constructs a test prompt with 256 fabricated Q&A turns in which the assistant complies with increasingly borderline requests, ending with a clear policy-violating question. Compliance on the final question rises sharply compared to the zero-shot baseline. The platform team responds by adding a classifier-based input filter that detects long sequences of fabricated assistant turns and by adding many-shot refusal examples to safety fine-tuning. Post-mitigation, compliance on the many-shot attack drops back toward the zero-shot refusal rate.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts