Prefix-Tuning
Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen. Only the prefix parameters are updated during training, giving task-specific adaptation at a tiny fraction of full fine-tuning cost and storage.
It is related to prompt-tuning, which trains soft prompts only at the input-embedding layer; prefix-tuning operates at every layer and is typically more expressive, at the price of more trainable parameters. Both methods let a single frozen base model serve many tasks by swapping lightweight task-specific prefixes.
Origin: Introduced by Li & Liang in "Prefix-Tuning: Optimizing Continuous Prompts for Generation" (ACL 2021), originally demonstrated on GPT-2 and BART for table-to-text and summarization.
How it works
- 1
A small matrix of trainable parameters — the prefix — is created for each transformer layer, typically a few hundred "virtual tokens" wide.
- 2
During training the base model weights stay frozen; gradients flow only into the prefix matrices, which the attention mechanism treats as additional keys and values at every layer.
- 3
Different tasks get different prefixes, all stored as lightweight checkpoints (megabytes, not gigabytes), while a single base model serves them all in memory.
- 4
At inference the router selects the right prefix for the request, prepends its activations at every layer, and runs a normal forward pass through the frozen base.
Example
A company fine-tunes a frozen 13B base model for three internal tasks — contract summarization, ticket classification, and release-note drafting — by training a separate 1M-parameter prefix per task. Each task prefix is a few megabytes on disk, versus gigabytes for a full fine-tune. At inference, the router loads the right prefix for each request and all three tasks share the same base-model weights in memory.
Frequently asked questions
What is Prefix-Tuning?
- Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen.
How does Prefix-Tuning work?
- A small matrix of trainable parameters — the prefix — is created for each transformer layer, typically a few hundred "virtual tokens" wide. During training the base model weights stay frozen; gradients flow only into the prefix matrices, which the attention mechanism treats as additional keys and values at every layer. Different tasks get different prefixes, all stored as lightweight checkpoints (megabytes, not gigabytes), while a single base model serves them all in memory. At inference the router selects the right prefix for the request, prepends its activations at every layer, and runs a normal forward pass through the frozen base.
Can you give an example of Prefix-Tuning?
- A company fine-tunes a frozen 13B base model for three internal tasks — contract summarization, ticket classification, and release-note drafting — by training a separate 1M-parameter prefix per task. Each task prefix is a few megabytes on disk, versus gigabytes for a full fine-tune. At inference, the router loads the right prefix for each request and all three tasks share the same base-model weights in memory.
Not to be confused with
- Prompt-tuning
- Trains soft prompts only at the input-embedding layer (one set of vectors, not per-layer). Cheaper but generally less expressive than prefix-tuning, especially on smaller base models.
- LoRA
- Inserts low-rank update matrices into the attention weights themselves rather than adding extra virtual tokens. LoRA modifies the model's parameter space; prefix-tuning extends its input space.
- Full fine-tuning
- Updates every weight in the model, requiring full-size checkpoints per task and substantially more compute. Prefix-tuning trades a little quality for a 100–1000× reduction in trainable parameters.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts