Skip to main content

Direct Preference Optimization (DPO)

Direct preference optimization (DPO) is a training technique that aligns AI models with human preferences by learning directly from pairs of preferred and rejected outputs — without needing a separate reward model. Unlike RLHF which trains a reward model as an intermediate step, DPO simplifies the alignment process into a single supervised learning objective, making it faster and more stable to train.

Example

During DPO training, the model sees pairs of responses to the same prompt: one that human evaluators preferred and one they rejected. For example, given "Explain gravity simply," the preferred response uses a clear analogy while the rejected one uses dense jargon. The model learns to produce responses matching the preferred style across all topics.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts