How to avoid getting lost reading Scott Alexander and his 1500+ blog posts? This unaffiliated fan website lets you sort and search through the whole codex. Enjoy!

See also Top Posts and All Tags.

Minutes:
Blog:
Year:
Show all filters
1 posts found
Jan 09, 2024
acx
23 min 2,913 words 365 comments 200 likes podcast
Scott reviews two papers on honest AI: one on manipulating AI honesty vectors, another on detecting AI lies through unrelated questions. Longer summary
Scott Alexander discusses two recent papers on creating honest AI and detecting AI lies. The first paper by Hendrycks et al. introduces 'representation engineering', a method to identify and manipulate vectors in AI models representing concepts like honesty, morality, and power-seeking. This allows for lie detection and potentially controlling AI behavior. The second paper by Brauner et al. presents a technique to detect lies in black-box AI systems by asking seemingly unrelated questions. Scott explores the implications of these methods for AI safety and scam detection, noting their current usefulness but potential limitations against future superintelligent AI. Shorter summary