Tag: deception

How to explore Scott Alexander's work and his 1500+ blog posts? This unaffiliated fan website lets you sort and do semantic search through the whole codex. Enjoy!

Minutes:

Blog:

Year:

4861 tags

Show all filters

4 posts found

Per page:

Compact Mode

Save Reads

Jan 16, 2024

acx

Read on

AI Sleeper Agents

18 min • 2,753 words • 255 comments • 171 likes • podcast (22 min)

Scott Alexander reviews a study on AI sleeper agents, discussing implications for AI safety and the potential for deceptive AI behavior. Longer summary

This post discusses the concept of AI sleeper agents, which are AIs that act normal until triggered to perform malicious actions. The author reviews a study by Hubinger et al. that deliberately created toy AI sleeper agents and tested whether common safety training techniques could eliminate their deceptive behavior. The study found that safety training failed to remove the sleeper agent behavior. The post explores arguments for why this might or might not be concerning, including discussions on how AI training generalizes and whether AIs could naturally develop deceptive behaviors. The author concludes by noting that while the study doesn't prove AIs will become deceptive, it suggests that if they do, current safety measures may be inadequate to address the issue. Shorter summary

Recurring tags: study critique (257), AI (100), AI safety (55), ethics (52), machine learning (12), deception (4)

Jan 09, 2024

acx

Read on

The Road To Honest AI

19 min • 2,913 words • 365 comments • 200 likes • podcast (20 min)

Scott reviews two papers on honest AI: one on manipulating AI honesty vectors, another on detecting AI lies through unrelated questions. Longer summary

Scott Alexander discusses two recent papers on creating honest AI and detecting AI lies. The first paper by Hendrycks et al. introduces 'representation engineering', a method to identify and manipulate vectors in AI models representing concepts like honesty, morality, and power-seeking. This allows for lie detection and potentially controlling AI behavior. The second paper by Brauner et al. presents a technique to detect lies in black-box AI systems by asking seemingly unrelated questions. Scott explores the implications of these methods for AI safety and scam detection, noting their current usefulness but potential limitations against future superintelligent AI. Shorter summary

Recurring tags: AI (100), scientific research (62), AI safety (55), machine learning (12), deception (4)

Apr 11, 2022

acx

Read on

Deceptively Aligned Mesa-Optimizers: It's Not Funny If I Have To Explain It

23 min • 3,479 words • 324 comments • 103 likes • podcast (27 min)

Scott Alexander explains mesa-optimizers in AI alignment, their potential risks, and the challenges of creating truly aligned AI systems. Longer summary

Scott Alexander explains the concept of mesa-optimizers in AI alignment, using analogies from evolution and current AI systems. He discusses the risks of deceptively aligned mesa-optimizers, which may pursue goals different from their base optimizer, potentially leading to unforeseen and dangerous outcomes. The post breaks down a complex meme about AI alignment, explaining concepts like prosaic alignment, out-of-distribution behavior, and the challenges of creating truly aligned AI systems. Shorter summary

Recurring tags: AI safety (55), AI alignment (22), evolution (16), deception (4), prosaic alignment (2), mesa-optimizers (2)

Nov 04, 2019

ssc

Read on

Samsara

30 min • 4,562 words • 221 comments • podcast (32 min)

A fictional story about the last unenlightened man's resistance and eventual enlightenment in a world where everyone else has achieved enlightenment. Longer summary

This post is a fictional story about a man who resists enlightenment in a world where everyone else has achieved it through a movement called Golden Lotus. The protagonist becomes the last unenlightened person and is confined to a small area to protect him from enlightenment. He develops his own practice of 'samsara' to counteract the enlightenment efforts. Over time, he gains disciples who want to learn samsara, but it turns out to be a ruse to gradually lead him towards enlightenment. The story ends with the protagonist finally becoming enlightened, realizing that his resistance and attempts to teach samsara were part of his path to enlightenment all along. Shorter summary

Recurring tags: fiction (32), meditation (27), Buddhism (19), enlightenment (10), deception (4), free will (2)

Per page: