nxted

RLHF (reinforcement learning from human feedback)

RLHF improves an AI model by training it against human judgements of its outputs: humans rank or rate responses, a reward model learns those preferences, and the model is optimised toward them. The quality of the result depends heavily on who provides the feedback and how it is measured.

The modern recipe was popularised by OpenAI’s InstructGPT. For high-risk domains, domain-expert evaluation catches errors that generalist crowds miss.