Question 1

What is Many-Shot Jailbreaking?

Accepted Answer

Many-shot jailbreaking is a long-context attack pattern identified by Anthropic researchers in 2024.

Question 2

How does Many-Shot Jailbreaking work?

Accepted Answer

The model, primed by the long sequence of "compliant" in-context demonstrations, is materially more likely to comply with the final query than it would be on a zero-shot version of the same request.

Question 3

Can you give an example of Many-Shot Jailbreaking?

Accepted Answer

A red team evaluates a new long-context model. Zero-shot harmful requests are refused at a high rate. The team constructs a test prompt with 256 fabricated Q&A turns in which the assistant complies with increasingly borderline requests, ending with a clear policy-violating question. Compliance on the final question rises sharply compared to the zero-shot baseline. The platform team responds by adding a classifier-based input filter that detects long sequences of fabricated assistant turns and by adding many-shot refusal examples to safety fine-tuning. Post-mitigation, compliance on the many-shot attack drops back toward the zero-shot refusal rate.

Many-Shot Jailbreaking

Example

Frequently asked questions

What is Many-Shot Jailbreaking?

How does Many-Shot Jailbreaking work?

Can you give an example of Many-Shot Jailbreaking?

Related Terms

Put this into practice