Skip to main content
Back to Blog
semantic routerprompt routingembeddingsLLM architecturemodel cascadecontext engineering

Semantic Router: Embedding-Based Routing Without Calling an LLM

A semantic router classifies incoming queries by comparing embeddings against a small set of labeled reference utterances per route. Faster, cheaper, and more deterministic than asking an LLM to route — this walkthrough shows how to build one and when to fall back to an LLM.

SurePrompts Team
April 22, 2026
12 min read

TL;DR

Semantic routers classify queries by embedding similarity to labeled reference utterances — faster, cheaper, and more deterministic than LLM routing. This walkthrough builds one end-to-end on a hypothetical support-desk bot and shows when to fall back to an LLM.

Tip

TL;DR: A semantic router classifies queries by comparing their embedding to a small set of labeled reference utterances per route. Embed once, route by similarity. Faster, cheaper, and more deterministic than asking an LLM to route — and when calibrated with a threshold, it hands ambiguous queries back to an LLM fallback instead of pretending it knows.

Key takeaways:

  • LLM-based routing is the wrong tool for the common case. For well-defined, mostly-separable routes, you are paying LLM latency and tokens to do what a vector similarity check does in milliseconds.
  • The core building block is an embedding model plus a set of reference utterances per route. Setup cost is one-time and tiny.
  • Threshold selection is the load-bearing hyperparameter. Without it, a semantic router confidently misroutes ambiguous traffic. With it, ambiguity becomes an explicit signal.
  • Hybrid is the default: semantic router as the fast path, LLM as the fallback for below-threshold queries. Treat the LLM as an expensive specialist, not a default router.
  • The failure modes are specific and predictable: stale reference utterances, collapsing multi-intent queries, centroid smearing, and embedding-space drift when you swap models.
  • Routing is its own layer in the Agentic Prompt Stack. Treat it that way — observable, versioned, independently evaluated.

Why LLM routing is often overkill

The default pattern in most agent stacks looks like this: user query arrives, a system prompt lists the available routes (billing, technical-support, sales, legal, whatever), and the LLM is asked to output the route name. Sometimes via function calling, sometimes via a constrained JSON schema, sometimes via raw text that gets regex-parsed downstream.

This works. It also costs you 500-1500ms of latency and several hundred tokens of input per query just to decide which prompt stack to load. At scale, that is a meaningful line item. More importantly: you are using a reasoning model to do a classification task that, for well-defined routes, does not require reasoning.

The three things LLM routing gives you that matter:

  • It handles implicit context ("I was charged twice last month, can you help?" routes to billing even though "billing" appears nowhere in the query).
  • It handles ambiguous queries ("my account is broken" — which route?) by applying soft reasoning.
  • It adapts to new routes without retraining, just by updating the system prompt.

The first one — implicit context — is exactly what embeddings are built for. A decent embedding model places "I was charged twice" near other billing queries in latent space without needing reasoning, because the training data already taught it that "charged," "refund," "invoice," and "subscription" cluster together.

The second — ambiguous queries — is real. That is the case for LLM fallback, not for LLM-as-default.

The third — adapting to new routes — is also real, but the cost is trivial: add some reference utterances, re-embed them, ship. No training, no fine-tuning.

So the reframe is: semantic router for the confident 80%, LLM for the ambiguous tail. That is a model cascade — cheap model first, expensive model as fallback — applied specifically to the routing problem.

How a semantic router works

The mechanics are deliberately boring, which is part of the point.

Step 1: define your routes. Concrete, mutually-exclusive, and named with the destination (the prompt stack, the tool, the queue). For a support bot: billing, technical_support, sales. Not general_inquiry — a generic bucket is a symptom that routes are not carved at the right joints.

Step 2: collect reference utterances per route. Real user queries if you have them, hand-written examples if you do not. Aim for 10-20 per route to start. Diversity matters more than quantity; you want to span the way users actually phrase requests for this route, not just the way a product manager would.

Step 3: embed the reference utterances. Run them through an embedding model once, store the vectors. This is the one-time setup cost. Choose an embedding model that is consistent across queries and reference utterances — do not embed references with one model and queries with another.

Step 4: for each incoming query, embed it and compute similarity to each route. Two main strategies:

  • Max over utterances: for each route, compute semantic similarity between the query and every reference utterance in that route, and take the max. The route with the highest max wins.
  • Centroid: for each route, average the reference embeddings into a single centroid vector, and compute similarity between the query and each centroid.

Start with max-over-utterances. It is more robust to routes with internally diverse utterances. Centroid is cheaper at scale but collapses route diversity; billing queries about refunds, pricing, and plan changes span real semantic distance, and the centroid sits in the middle of that cloud, far from all of them.

Step 5: threshold. If the winning similarity is above a tuned threshold, route. Otherwise, hand off to an LLM fallback. This step is what separates a semantic router from a brittle classifier.

The whole thing is bi-encoder retrieval, structurally identical to the first-pass retrieval stage in most RAG pipelines. The same pattern powers semantic search; the only difference is that instead of retrieving documents, you are retrieving a route label.

Worked example: a hypothetical support-desk bot

Hypothetical scenario, not a shipped product. A B2B SaaS company wants to route incoming support messages to one of three queues: billing, technical_support, or sales.

Reference utterances, abbreviated:

billing — "I was charged twice this month," "Can I get a refund for yesterday's invoice?", "How do I change my payment method?", "My card was declined," "Why is my bill higher than last month?"

technical_support — "SSO login is failing with a 500 error," "My webhook isn't firing," "The dashboard won't load after the update," "API returns 429 even though I'm under the rate limit," "Export to CSV is hanging."

sales — "What's included in the Enterprise plan?", "Can we get a demo for our team?", "How does per-seat pricing work for 500+ users?", "Do you offer annual discounts?", "I'm comparing you to [competitor] — what's different?"

Embed these 15 utterances once, store the vectors keyed by route. Setup done.

Now route a few queries through. All numbers below are illustrative — actual similarity scores depend on the embedding model and the reference utterance set.

Query: "My credit card got rejected when I tried to renew."

Similarities: billing ~0.83, technical_support ~0.41, sales ~0.38.

Confident billing. Route.

Query: "The API keeps timing out when I hit the /users endpoint."

Similarities: billing ~0.32, technical_support ~0.79, sales ~0.35.

Confident technical_support. Route.

Query: "What does the Pro plan cost and how do I cancel?"

Similarities: billing ~0.62, technical_support ~0.39, sales ~0.64.

Split between billing and sales, both below a reasonable confidence threshold (say 0.75). This is exactly the case the LLM fallback exists for. Hand it off.

Query: "I want to love your product but something's wrong."

Similarities: billing ~0.34, technical_support ~0.41, sales ~0.36.

All low. Fallback.

That fourth query illustrates why the threshold matters. Without it, the router would "confidently" pick technical_support at 0.41 — a score that, in absolute terms, says nothing is really matching. With the threshold, the router says I don't know and delegates, which is the correct behavior.

Tuning the threshold and handling ambiguous queries

The threshold is the load-bearing hyperparameter in the whole system, and it cannot be chosen a priori.

The procedure:

  • Assemble a held-out labeled set — 100-200 real user queries, each tagged with the correct route. If you have production logs, use those. If you do not, hand-label queries from early testing.
  • Run the router against the held-out set, record the winning similarity and predicted route per query.
  • Sweep the threshold from low to high. At each value, compute:
- Precision on routed queries (of the queries we routed, how many went to the right place).

- Recall (of queries that should have been routed, how many we actually routed instead of falling back).

- Fallback rate (fraction of queries handed to the LLM).

  • Pick the threshold where precision crosses your tolerance. If you cannot tolerate more than 5% misroutes, find the threshold where precision hits 0.95 and accept whatever recall and fallback rate that gives you.

Re-tune the threshold whenever you change routes, reference utterances, or the embedding model. Embedding models place the same text at different absolute similarity values, and a threshold tuned for text-embedding-3-small will not work for a newer model without re-sweeping.

For ambiguous queries specifically — where two routes are close — consider a margin check in addition to the threshold. If the top route scores 0.76 and the second scores 0.74, routing to the top is a coin flip dressed up as a decision. Fall back when the margin between first and second is below, say, 0.05. Margin failures are where most ambiguous-query errors concentrate.

Hybrid: semantic router as the fast path, LLM as fallback

The production shape is a two-stage cascade.

Stage 1: semantic router. Embed the query, compute similarity against each route, apply threshold and margin checks. If the query passes both, route directly and record the confidence.

Stage 2: LLM fallback. If the router falls back, invoke an LLM with the query, the route definitions, and (optionally) the router's top-K scores as a hint. The LLM output is the final routing decision. Log everything — the fallback triggered, the LLM's choice, and why.

The LLM fallback is also your feedback loop. Every time it fires, you have a labeled example: what route did the LLM (or the downstream human agent) ultimately assign? Feed the confirmed-correct ones back as new reference utterances for that route. Over time, the semantic router's coverage expands and the fallback rate drops.

This fits cleanly into the Agentic Prompt Stack pattern: routing is its own layer, with its own observability and its own evaluation. It also maps to the Level 3-4 disciplines in the Context Engineering Maturity Model — contextual routing decisions become measurable events, not implicit behavior buried in a system prompt.

If you are already running RAG, you likely have the infrastructure: an embedding service, a vector store, a similarity API. A semantic router is that same stack pointed at a tiny corpus of reference utterances instead of a documents corpus.

The open-source semantic-router library by Aurelio Labs implements this pattern directly, with threshold tuning and LLM fallback built in. Whether you adopt it or roll your own is a question of operational preference — the pattern is the part that matters.

Failure modes

Four anti-patterns that bite teams running semantic routers.

Stale reference utterances. The routes drift — new products launch, naming changes, user language evolves — but the reference set was frozen in month one. Routing quality degrades silently because the held-out eval set also came from month one. Re-sample reference utterances from recent production traffic every quarter, and re-run the threshold sweep.

Collapsing multi-intent queries. A user writes "I was charged twice AND my SSO is broken." The router picks one route, the user's other need is dropped. A semantic router is not built to handle compound queries cleanly — that is an LLM task. Detect this case (multiple routes above threshold, not just one) and fall back.

Centroid smearing on diverse routes. You used centroid matching because it was faster, but one of your routes covers semantically broad territory. The centroid ends up in a region of latent space that is actually closer to a sibling route's utterances than to its own. Switch to max-over-utterances or multi-centroid.

Embedding-model swap without re-calibration. A provider releases a better embedding model; you swap it in; the threshold you tuned six weeks ago is now wrong. Absolute similarity scales differ between models. Any model change requires re-embedding the reference set and re-sweeping the threshold. There is no shortcut.

Our position

  • Default to semantic routing for well-defined routes. Three to ten mostly-separable destinations is the exact shape a semantic router was built for. LLM routing is the wrong default here — you are paying reasoning tokens to do a classification job.
  • Make the LLM the fallback, not the default. A small, observable fraction of queries is genuinely ambiguous; that is the budget the LLM earns. If your fallback rate is 50%+ after tuning, the problem is your routes or reference utterances, not the routing strategy.
  • Threshold and margin are both required. Threshold alone lets close-call queries through as confident decisions. Margin alone lets low-absolute-similarity queries through as "best of bad options." Check both, fall back on either failing.
  • Treat routing as its own layer, with its own eval. Label a held-out set of 100-200 queries, track precision, recall, and fallback rate, and version both the reference set and the threshold with the code that loads them. The same discipline the SurePrompts Quality Rubric demands of prompts applies to routing: observable, repeatable, diffable.
  • Integrate into the prompt stack, not around it. Routing output selects which prompt, tools, and retrievers load for the rest of the turn. That selection belongs alongside the prompt structure itself — which is exactly what frameworks like RCAF and the Agentic Prompt Stack formalize. Route → role-aware prompt → tool-bounded action is the pattern, and the router's output feeds the other two.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator