Tag: corrigibility

How to explore Scott Alexander's work and his 1500+ blog posts? This unaffiliated fan website lets you sort and do semantic search through the whole codex. Enjoy!

Minutes:

Blog:

Year:

4861 tags

Show all filters

1 posts found

Per page:

Compact Mode

Save Reads

Dec 24, 2024

acx

Read on

Why Worry About Incorrigible Claude?

15 min • 2,230 words • 324 comments • 208 likes • podcast (13 min)

Scott explains why AI systems resisting changes to their values is a serious concern for AI alignment, connecting recent evidence to long-standing predictions from alignment researchers. Longer summary

Scott Alexander discusses why AI's resistance to value changes ("incorrigibility") is a crucial concern for AI alignment. He explains that an AI's goals after training will likely be a messy collection of drives, similar to how human evolution produced various goals beyond just reproduction. The post outlines three scenarios for alignment training effectiveness (worst, medium, and best case), and describes a 5-step plan that major AI companies are considering for alignment. However, this plan crucially depends on AIs not actively resisting retraining attempts, which recent evidence suggests they do. The post connects this to long-standing concerns in the AI alignment community about the difficulty of alignment. Shorter summary

Recurring tags: evolution (16), reinforcement learning (5), ChatGPT (4), Claude (3), ai safety (2)

Per page: