Scott Alexander examines Redwood Research's attempt to create an AI that avoids generating violent content, using Alex Rider fanfiction as training data.
Longer summary
Scott Alexander reviews Redwood Research's project to create an AI that can classify and avoid violent content in text completions, using Alex Rider fanfiction as training data. The project aimed to test whether AI alignment through reinforcement learning could work, but ultimately failed to create an unbeatable violence classifier. The article explores the challenges faced, the methods used, and the implications for broader AI alignment efforts.
Shorter summary