Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF): Reinforcement learning from human feedback (RLHF) is a training method where human evaluators rank or score multiple AI outputs, and those preferences are used to train a reward model that further fine-tunes the language model. RLHF bridges the gap between what a model can generate and what humans actually find helpful, harmless, and honest, making it a cornerstone of modern AI alignment.

Example

Human raters are shown two responses to the question "Explain quantum computing." They consistently prefer the response that uses analogies and avoids jargon. These preference signals train a reward model, which then guides the LLM to produce more accessible explanations across all topics — not just the ones humans rated.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts

Example

Related Terms

Put this into practice