Glossary — Agentic AI

What is RLHF?

1 min read Updated

Reinforcement Learning from Human Feedback (RLHF) is a training technique that aligns LLM outputs with human preferences by training a reward model on human comparisons, then optimizing the LLM against that reward.

WHY IT MATTERS

RLHF is the technique that made LLMs useful. Raw pre-trained models are powerful but uncontrolled. RLHF aligns the model with human expectations of helpful, harmless, and honest behavior.

The process: human raters compare model outputs and indicate preferences. These preferences train a reward model. The LLM is then fine-tuned using reinforcement learning to maximize that reward — learning to produce outputs humans prefer.

RLHF is responsible for the dramatic difference between base models (unpredictable) and assistant models (helpful, instruction-following). It's also why different models have different 'personalities.'

FREQUENTLY ASKED QUESTIONS

What's the difference between RLHF and DPO?
RLHF trains a separate reward model then optimizes against it. DPO skips the reward model, directly training from preference pairs. DPO is simpler and increasingly popular.
Does RLHF make models safe?
It improves safety but doesn't guarantee it. Aligned models can still be jailbroken and make mistakes. RLHF is a training-time measure, not a runtime guarantee.
Can RLHF train agents to follow spending rules?
In theory, but it's too coarse for precise financial rules. Hard-coded policy enforcement is more reliable than learned preferences for financial constraints.

FURTHER READING

BUILD WITH POLICYLAYER

Non-custodial spending controls for AI agents. Setup in 5 minutes.

Get Started