Skip to main content

Model Cascade

A model cascade is a routing pattern in which each request is first attempted by a cheaper, smaller model and only escalated to a stronger, more expensive model when the small model's confidence is low or its output fails a validation check. The escalation signal can come from the model's own self-assessment, a logprob-based confidence score, a lightweight classifier, or a downstream validator (schema check, tool-call success, judge score). Cascades cut cost on easy requests while preserving quality on hard ones, and they degrade gracefully: a partially correct small-model output is often better than a hard error. The design choice is where to set the escalation threshold — too aggressive and you lose the cost savings, too lax and quality drops.

Example

A support-ticket classifier sends every incoming ticket to a small model first. If the small model's top-class probability is above 0.85, the result is used as-is; otherwise the ticket is re-sent to a frontier model. On a month of traffic, 72% of tickets clear the threshold and are handled by the small model; the remaining 28% go to the frontier model. Blended cost drops by roughly half with no measurable accuracy regression on the held-out eval set.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts