How to explore Scott Alexander's work and his 1500+ blog posts? This unaffiliated fan website lets you sort and do semantic search through the whole codex. Enjoy!

See also Top Posts and All Tags.

Tag: corrigibility

Minutes:
Blog:
Year:
Show all filters

1 posts found
Dec 24, 2024
acx
Read on
15 min 2,230 words 324 comments 208 likes
Scott explains why AI systems resisting changes to their values is a serious concern for AI alignment, connecting recent evidence to long-standing predictions from alignment researchers. Longer summary
Scott Alexander discusses why AI's resistance to value changes ("incorrigibility") is a crucial concern for AI alignment. He explains that an AI's goals after training will likely be a messy collection of drives, similar to how human evolution produced various goals beyond just reproduction. The post outlines three scenarios for alignment training effectiveness (worst, medium, and best case), and describes a 5-step plan that major AI companies are considering for alignment. However, this plan crucially depends on AIs not actively resisting retraining attempts, which recent evidence suggests they do. The post connects this to long-standing concerns in the AI alignment community about the difficulty of alignment. Shorter summary