Skip to main content

Prefix-Tuning

Prefix-tuning is a parameter-efficient fine-tuning method in which a small set of continuous, trainable vectors — the "prefix" — is prepended to the input at every transformer layer and the underlying model weights are frozen. Only the prefix parameters are updated during training, giving task-specific adaptation at a tiny fraction of full fine-tuning cost and storage.

It is related to prompt-tuning, which trains soft prompts only at the input-embedding layer; prefix-tuning operates at every layer and is typically more expressive, at the price of more trainable parameters. Both methods let a single frozen base model serve many tasks by swapping lightweight task-specific prefixes.

Origin: Introduced by Li & Liang in "Prefix-Tuning: Optimizing Continuous Prompts for Generation" (ACL 2021), originally demonstrated on GPT-2 and BART for table-to-text and summarization.

How it works

  1. 1

    A small matrix of trainable parameters — the prefix — is created for each transformer layer, typically a few hundred "virtual tokens" wide.

  2. 2

    During training the base model weights stay frozen; gradients flow only into the prefix matrices, which the attention mechanism treats as additional keys and values at every layer.

  3. 3

    Different tasks get different prefixes, all stored as lightweight checkpoints (megabytes, not gigabytes), while a single base model serves them all in memory.

  4. 4

    At inference the router selects the right prefix for the request, prepends its activations at every layer, and runs a normal forward pass through the frozen base.

Example

A company fine-tunes a frozen 13B base model for three internal tasks — contract summarization, ticket classification, and release-note drafting — by training a separate 1M-parameter prefix per task. Each task prefix is a few megabytes on disk, versus gigabytes for a full fine-tune. At inference, the router loads the right prefix for each request and all three tasks share the same base-model weights in memory.

Not to be confused with

Prompt-tuning
Trains soft prompts only at the input-embedding layer (one set of vectors, not per-layer). Cheaper but generally less expressive than prefix-tuning, especially on smaller base models.
LoRA
Inserts low-rank update matrices into the attention weights themselves rather than adding extra virtual tokens. LoRA modifies the model's parameter space; prefix-tuning extends its input space.
Full fine-tuning
Updates every weight in the model, requiring full-size checkpoints per task and substantially more compute. Prefix-tuning trades a little quality for a 100–1000× reduction in trainable parameters.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts