High-Stakes Alignment via Adversarial Training [Redwood Research Report]
AI Safety Fundamentals: Alignment - A podcast by BlueDot Impact
Categorie:
(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used adversarial training to improve high-stakes reliability in a task (“filter all injurious continuations of a story”) that we think is analogous to work that future AI safety eng...