Supervising Strong Learners by Amplifying Weak Experts
AI Safety Fundamentals: Alignment - A podcast by BlueDot Impact
Categorie:
Abstract: Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficu...