Prefix-Tuning
Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen. Only the prefix parameters are updated during training, giving task-specific adaptation at a tiny fraction of full fine-tuning cost and storage.
It is related to prompt-tuning, which trains soft prompts only at the input-embedding layer; prefix-tuning operates at every layer and is typically more expressive, at the price of more trainable parameters. Both methods let a single frozen base model serve many tasks by swapping lightweight task-specific prefixes.
Origin: Introduced by Li & Liang in "Prefix-Tuning: Optimizing Continuous Prompts for Generation" (ACL 2021), originally demonstrated on GPT-2 and BART for table-to-text and summarization.
How it works
- 1
A small matrix of trainable parameters — the prefix — is created for each transformer layer, typically a few hundred "virtual tokens" wide.
- 2
During training the base model weights stay frozen; gradients flow only into the prefix matrices, which the attention mechanism treats as additional keys and values at every layer.
- 3
Different tasks get different prefixes, all stored as lightweight checkpoints (megabytes, not gigabytes), while a single base model serves them all in memory.
- 4
At inference the router selects the right prefix for the request, prepends its activations at every layer, and runs a normal forward pass through the frozen base.
Example
A company fine-tunes a frozen 13B base model for three internal tasks — contract summarization, ticket classification, and release-note drafting — by training a separate 1M-parameter prefix per task. Each task prefix is a few megabytes on disk, versus gigabytes for a full fine-tune. At inference, the router loads the right prefix for each request and all three tasks share the same base-model weights in memory.
Not to be confused with
- Prompt-tuning
- Trains soft prompts only at the input-embedding layer (one set of vectors, not per-layer). Cheaper but generally less expressive than prefix-tuning, especially on smaller base models.
- LoRA
- Inserts low-rank update matrices into the attention weights themselves rather than adding extra virtual tokens. LoRA modifies the model's parameter space; prefix-tuning extends its input space.
- Full fine-tuning
- Updates every weight in the model, requiring full-size checkpoints per task and substantially more compute. Prefix-tuning trades a little quality for a 100–1000× reduction in trainable parameters.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts