Direct Preference Optimization (DPO)
Direct preference optimization (DPO) is a training technique that aligns AI models with human preferences by learning directly from pairs of preferred and rejected outputs — without needing a separate reward model. Unlike RLHF which trains a reward model as an intermediate step, DPO simplifies the alignment process into a single supervised learning objective, making it faster and more stable to train.
Example
During DPO training, the model sees pairs of responses to the same prompt: one that human evaluators preferred and one they rejected. For example, given "Explain gravity simply," the preferred response uses a clear analogy while the rejected one uses dense jargon. The model learns to produce responses matching the preferred style across all topics.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts