Takeaways From Our Robust Injury Classifier Project [Redwood Research]

AI Safety Fundamentals: Alignment - A podcast by BlueDot Impact

Categorie:

With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injurious completions. If we had accomplished that, we think it would have been the first demonstration of a deep learning system avoiding a difficult-to-formalize catastrophe with an ultra-high level of reliability. Presumably, we would have needed to invent n...

Visit the podcast's native language site