Tag: honesty

How to explore Scott Alexander's work and his 1500+ blog posts? This unaffiliated fan website lets you sort and do semantic search through the whole codex. Enjoy!

Minutes:

Blog:

Year:

4821 tags

Show all filters

1 posts found

Per page:

Compact Mode

Save Reads

Jan 09, 2024

acx

Read on

The Road To Honest AI

19 min • 2,913 words • 365 comments • 200 likes • podcast (20 min)

Scott reviews two papers on honest AI: one on manipulating AI honesty vectors, another on detecting AI lies through unrelated questions. Longer summary

Scott Alexander discusses two recent papers on creating honest AI and detecting AI lies. The first paper by Hendrycks et al. introduces 'representation engineering', a method to identify and manipulate vectors in AI models representing concepts like honesty, morality, and power-seeking. This allows for lie detection and potentially controlling AI behavior. The second paper by Brauner et al. presents a technique to detect lies in black-box AI systems by asking seemingly unrelated questions. Scott explores the implications of these methods for AI safety and scam detection, noting their current usefulness but potential limitations against future superintelligent AI. Shorter summary

Recurring tags: AI (97), scientific research (61), AI safety (55), machine learning (12), deception (4)

Per page: