Question 1

What is Model Cascade?

Accepted Answer

A model cascade is a routing pattern in which each request is first attempted by a cheaper, smaller model and only escalated to a stronger, more expensive model when the small model's confidence is low or its output fails a validation check.

Question 2

How does Model Cascade work?

Accepted Answer

The escalation signal can come from the model's own self-assessment, a logprob-based confidence score, a lightweight classifier, or a downstream validator (schema check, tool-call success, judge score).

Question 3

Can you give an example of Model Cascade?

Accepted Answer

A support-ticket classifier sends every incoming ticket to a small model first. If the small model's top-class probability is above 0.85, the result is used as-is; otherwise the ticket is re-sent to a frontier model. On a month of traffic, 72% of tickets clear the threshold and are handled by the small model; the remaining 28% go to the frontier model. Blended cost drops by roughly half with no measurable accuracy regression on the held-out eval set.

Model Cascade

Example

Frequently asked questions

What is Model Cascade?

How does Model Cascade work?

Can you give an example of Model Cascade?

Related Terms

Put this into practice