No Time Like The Present For AI Safety Work

On the recent post on AI risk, a commenter challenged me to give the short version of the argument for taking it seriously. I said something like:

1. If humanity doesn’t blow itself up, eventually we will create human-level AI.

2. If humanity creates human-level AI, technological progress will continue and eventually reach far-above-human-level AI

3. If far-above-human-level AI comes into existence, eventually it will so overpower humanity that our existence will depend on its goals being aligned with ours

4. It is possible to do useful research now which will improve our chances of getting the AI goal alignment problem right

5. Given that we can start research now we probably should, since leaving it until there is a clear and present need for it is unwise

I placed very high confidence (>95%) on each of the first three statements – they’re just saying that if trends continue moving towards a certain direction without stopping, eventually they’ll get there. I had lower confidence (around 50%) on the last two statements.

Commenters tended to agree with this assessment; nobody wanted to seriously challenge any of 1-3, but a lot of people said they just didn’t think there was any point in worrying about AI now. We ended up in an extended analogy about illegal computer hacking. It’s a big problem that we’ve never been able to fully address – but if Alan Turing had gotten it into his head to try to solve it in 1945, his ideas might have been along the lines of “Place your punch cards in a locked box where German spies can’t read them.” Wouldn’t trying to solve AI risk in 2015 end in something equally cringeworthy?

Maybe. But I disagree for a couple reasons, some of them broad and meta-level, some of them more focused and object level. The most important meta-level consideration is: if you’re accepting points 1 to 3 – that is, you accept that eventually the human race is going to go extinct or worse if we can’t figure out AI goal alignment – do you really think our chances of making a dent in the problem today are so low that saying “Yes, we’re on a global countdown to certain annhilation, but it would be an inefficient use of resources to even investigate if we could do anything about it at this point”? What is this amazing other use of resources that you prefer? Like, go on and grumble about Pascal’s Wager, but you do realize we just paid Floyd Mayweather ten times more money than has been spent on AI risk total throughout all of human history to participate in a single boxing fight, right?

(if AI boxing got a tenth as much attention, or a hundredth as much money, as AI boxing, the world would be a much safer place)

But I want to make a stronger claim: not just that dealing with AI risk is more important than boxing, but that it is as important as all the other things we consider important, like curing diseases and detecting asteroids and saving the environment. That requires at least a little argument for why progress should indeed be possible at this early stage.

And I think progress is possible insofar as this is a philosophical and not a technical problem. Right now the goal isn’t “write the code that will control the future AI”, it’s “figure out the broad category of problem we have to deal with.” Let me give some examples of open problems to segue into a discussion of why these problems are worth working on now.

II.

Problem 1: Wireheading

Some people have gotten electrodes implanted in their brains for therapeutic or research purposes. When the electrodes are in certain regions, most notably the lateral hypothalamus, the people become obsessed with stimulating them as much as possible. If you give them the stimulation button, they’ll press it thousands of times per hour; if you try to take the stimulation button away from them, they’ll defend it with desperation and ferocity. Their life and focus narrows to a pinpoint, normal goals like love and money and fame and friendship forgotten in the relentless drive to stimulate the electrode as much as possible.

This fits pretty well with what we know of neuroscience. The brain (OVERSIMPLIFICATION WARNING) represents reward as electrical voltage at a couple of reward centers, then does whatever tends to maximize that reward. Normally this works pretty well; when you fulfill a biological drive like food or sex, the reward center responds with little bursts of reinforcement, and so you continue fulfilling your biological drives. But stimulating the reward center directly with an electrode increases it much more than waiting for your brain to send little bursts of stimulation the natural way, so this activity is by definition the most rewarding possible. A person presented with the opportunity to stimulate the reward center directly will forget about all those indirect ways of getting reward like “living a happy life” and just press the button attached to the electrode as much as possible.

This doesn’t even require any brain surgery – drugs like cocaine and meth are addictive in part because they interfere with biochemistry to increase the level of stimulation in reward centers.

And computers can run into the same issue. I can’t find the link, but I do remember hearing about an evolutionary algorithm designed to write code for some application. It generated code semi-randomly, ran it by a “fitness function” that assessed whether it was any good, and the best pieces of code were “bred” with each other, then mutated slightly, until the result was considered adequate.

They ended up, of course, with code that hacked the fitness function and set it to some absurdly high integer.

These aren’t isolated incidents. Any mind that runs off of reinforcement learning with a reward function – and this seems near-universal in biological life-forms and is increasingly common in AI – will have the same design flaw. The main defense against it this far is simple lack of capability: most computer programs aren’t smart enough for “hack your own reward function” to be an option; as for humans, our reward centers are hidden way inside our heads where we can’t get to it. A hypothetical superintelligence won’t have this problem: it will know exactly where its reward center is and be intelligent enough to reach it and reprogram it.

The end result, unless very deliberate steps are taken to prevent it, is that an AI designed to cure cancer hacks its own module determining how much cancer has been cured and sets it to the highest number its memory is capable of representing. Then it goes about acquiring more memory so it can represent higher numbers. If it’s superintelligent, its options for acquiring new memory include “take over all the computing power in the world” and “convert things that aren’t computers into computers.” Human civilization is a thing that isn’t a computer.

This is not some exotic failure mode that a couple of extremely bizarre designs can fall into; this may be the natural course for a sufficiently intelligent reinforcement learner.

Problem 2: Weird Decision Theory

Pascal’s Wager is a famous argument for why you should join an organized religion. Even if you believe God is vanishingly unlikely to exist, the consequence of being wrong (Hell) is so great, and the benefits of being right (not having to go to church on Sundays) so comparatively miniscule, that you should probably just believe in God to be on the safe side. Although there are many objections based on the specific content of religion (does God really want someone to believe based on that kind of analysis?) the problem can be generalized into a form where you can make an agent do anything merely by promising a spectacularly high reward; if the reward is high enough, it will overrule any concerns the agent has about your inability to deliver it.

This is a problem of decision theory which is unrelated to questions of intelligence. A very intelligent person might be able to calculate the probability of God existing very accurately, and they might be able to estimate the exact badness of Hell, but without a good decision theory intelligence alone can’t save you from Pascal’s Wager – in fact, intelligence is what lets you do the formal mathematical calculations telling you to take the bet.

Humans are pretty resistant to this kind of problem – most people aren’t moved by Pascal’s Wager, even if they can’t think of a specific flaw in it – but it’s not obvious how exactly we gain our resistance. Computers, which are infamous for relying on formal math but having no common sense, won’t have that kind of resistance unless it gets built in. And building it in is a really hard problem. Most hacks that eliminate Pascal’s Wager without having a deep understanding of where (or whether) the formal math is going on just open up more loopholes somewhere else. A solution based on a deep understanding of where the formal math goes wrong, and which preserves the power of the math to solve everyday situations, has as far as I know not yet been developed. Worse, once we solve Pascal’s Wager, there are a couple of dozen very similar decision-theoretic paradoxes that may require entirely different solutions.

This is not a cute little philosophical trick. A sufficiently good “hacker” could subvert a galaxy-spanning artificial intelligence just by threatening (with no credibility) to inflict a spectacularly high punishment on it if it didn’t do what the hacker wanted; if the AI wasn’t Pascal-proofed, it would decide to do whatever the hacker said.

Problem 3: The Evil Genie Effect

Everyone knows the problem with computers is that they do what you say rather than what you mean. Nowadays that just means that a program runs differently when you forget a close-parenthesis, or websites show up weird if you put the HTML codes in the wrong order. But it might lead an artificial intelligence to seriously misinterpret natural language orders.

Age of Ultron actually gets this one sort of right. Tony Stark orders his super-robot Ultron to bring peace to the world; Ultron calculates that the fastest and most certain way to bring peace is to destroy all life. As far as I can tell, Ultron is totally 100% correct about this and in some real-world equivalent that is exactly what would happen. We would get pretty much the same effect by telling an AI to “cure cancer” or “end world hunger” or any of a thousand other things.

I promise the gender gap, among many other things, will be eliminated. https://t.co/9KDLvaNFob

— Sweet Meteor O'Death (@smod2016) May 24, 2015

Even Isaac Asimov’s Three Laws of Robotics would take about thirty seconds to become horrible abominations. The First Laws says a robot cannot harm a human being or allow through inaction a human being to come to harm. “Not taking over the government and banning cigarettes” counts as allowing through inaction a human being to come to harm. So does “not locking every human in perfectly safe stasis fields for all eternity.”

There is no way to compose an order specific enough to explain exactly what we mean by “do not allow through inaction a human to come to harm” – go ahead, try it – unless the robot is already willing to do what we mean, rather than what we say. This is not a deal-breaker, since AIs may indeed by smart enough to understand what we mean, but our desire that they do so will have to be programmed into them directly, from the ground up. Part of SIAI’s old vision of “causal validity semantics” seems to be about laying a groundwork for this program.

But this just leads to a second problem: we don’t always know what we mean by something. The question of “how do we balance the ethical injunction to keep people safe with the ethical injunction to preserve human freedom?” is a pretty hot topic in politics right now, presenting itself in everything from gun control to banning Big Gulp cups. It seems to involve balancing out everything we value – how important are Big Gulp cups to us, anyway? – and combining cost-benefit calculations with sacred principles. Any AI that couldn’t navigate that moral labyrinth might end up ending world hunger by killing all starving people, or refusing else to end world hunger by inventing new crops because the pesticides for them might kill an insect.

But the more you try to study ethics, the more you realize they’re really really complicated and so far resist simplification to the sort of formal system that a computer has any hope of understanding. Utilitarianism is almost computer-readable, but it runs into various paradoxes at the edges, and even without those you’d need to have a set of utility weights for everything in the world.

This is a problem we have yet to solve with humans – most of the humans in the world have values that we consider abhorrent, and accept tradeoffs we consider losing propositions. Dealing with an AI whose mind is no more different to mine than that of fellow human being Pat Robertson would from my perspective be a clear-cut case of failure.

[EDIT: I’m told I’m not explaining this very well. This might be better.]

III.

My point in raising these problems wasn’t to dazzle anybody with interesting philosophical issues. It’s to prove a couple of points:

First, there are some very basic problems that affect broad categories of minds, like “all reinforcement learners” or “all minds that make decisions with formal math”. People often speculate that at this early stage we can’t know anything about the design of future AIs. But I would find it extraordinarily surprising if they used neither reinforcement learning or formal mathematical decision-making.

Second, these problems aren’t obvious to most people. These are weird philosophical quandaries, not things that are obvious to everybody with even a little bit of domain knowledge.

Third, these problems have in fact been thought of. Somebody, whether it was a philosopher or a mathematician or a neuroscientist, sat down and thought “Hey, wait, reinforcement learners are naturally vulnerable to wireheading, which would explain why this same behavior shows up in all of these different domains.”

Fourth, these problems suggest research programs that can be pursued right now, at least in a preliminary way. Why do humans resist Pascal’s Wager so effectively? Can our behavior in high-utility, low-probability situations be fitted to a function that allows a computer to make the same decisions we do? What are the best solutions to the related decision theory problems? How come a human can understand the concept of wireheading, yet not feel any compulsion to seek a brain electrode to wirehead themselves with? Is there a way to design a mind that could wirehead a few times, feel and understand the exact sensation, and yet feel no compulsion to wirehead further? How could we create an idea of human ethics and priorities formal enough to stick into a computer?

I think when people hear “we should start, right now in 2015, working on AI goal alignment issues” they think that somebody wants to write a program that can be imported directly into a 2075 AI to provide it with an artificial conscience. Then they think “No way you can do something that difficult this early on.”

But that isn’t what anybody’s proposing. What we’re proposing is to get ourselves acquainted with the general philosophical problems that affect a broad subset of minds, then pursue the neuroscientific, mathematical, and philosophical investigations necessary to have a good understanding of them by the time the engineering problem comes up.

By analogy, we are nowhere near having spaceships that can travel at even half the speed of light. But we already know the biggest obstacle that an FTL spaceship is going to face (relativity and the light-speed limit) and we already have some ideas for getting around it (the Alcubierre drive). We can’t come anywhere close to building an Alcubierre drive. But if we discover how to make near-lightspeed spaceships in 2100, and for some reason the fate of Earth depends on having faster-than-light spaceships by 2120, it’ll probably be nice that we did all of our Theory-Of-Relativity-discovering early so that we’re not wasting half that time interval debating basic physics.

The question “Can we do basic AI safety research now?” is silly because we have already done some basic AI safety research successfully. It’s led to understanding issues like the three problems mentioned above, and many more. There are even a couple of answers now, although they’re at technical levels much lower than any of those big questions. Every step we finish now is one that we don’t have to waste valuable time retracing during the crunch period.

IV.

That last section discussed my claim 4, that there’s research we can do now that will help. That leaves claim 5 – given that we can do research now, we should, because we can’t just trust our descendents in the crunch time to sort things out on their own without our help, using their better model of what eventual AI might look like. There are a couple of reasons for this

Reason 1: The Treacherous Turn

Our descendents’ better models of AI might be actively misleading. Things that work for subhuman or human level intelligences might fail for superhuman intelligences. Empirical testing won’t be able to figure this out without help from armchair philosophy.

Pity poor evolution. It had hundreds of millions of years to evolve defenses against heroin – which by the way affects rats much as it does humans – but it never bothered. Why not? Because until the past century, there wasn’t anything around intelligent enough to synthesize pure heroin. So heroin addiction just wasn’t a problem anything had to evolve to deal with. A brain design that looks pretty good in stupid animals like rats and cows becomes very dangerous when put in the hands (well, heads) of humans smart enough to synthesize heroin or wirehead their own pleasure centers.

The same is true of AI. Dog-level AIs aren’t going to learn to hack their own reward mechanism. Even human level AIs might not be able to – I couldn’t hack a robot reward mechanism if it were presented to me. Superintelligences can. What we might see is reinforcement-learning AIs that work very well at the dog level, very well at the human level, then suddenly blow up at the superhuman level, by which it’s time it’s too late to stop them.

This is a common feature of AI safety failure modes. If you tell me, as a mere human being, to “make peace”, then my best bet might be to become Secretary-General of the United Nations and learn to negotiate very well. Arm me with a few thousand nukes, and it’s a different story. A human-level AI might pursue its peace-making or cancer-curing or not-allowing-human-harm-through-inaction-ing through the same prosocial avenues as humans, then suddenly change once it became superintelligent and new options became open. Indeed, the point that will activate the shift is precisely that no humans are able to stop it. If humans can easily shut an AI down, then the most effective means of curing cancer will be for it to research new medicines (which humans will support); if humans can no longer stop an AI, the most effective means of curing cancer is destroying humanity (since it will no longer matter that humans will fight back).

In his book, Nick Bostrom calls this pattern “the treacherous turn”, and it will doom anybody who plans to just wait until the AIs exist and then solve their moral failings through trial and error and observation. The better plan is to have a good philosophical understanding of exactly what’s going on, so we can predict these turns ahead of time and design systems that avoid them from the ground up.

Reason 2: Hard Takeoff

Nathan Taylor of Praxtime writes:

Arguably most of the current “debates” about AI Risk are mere proxies for a single, more fundamental disagreement: hard versus soft takeoff.

Soft takeoff means AI progress takes a leisurely course from the subhuman level to the dumb-human level to the smarter-human level to the superhuman level over many decades. Hard takeoff means the same course takes much shorter, maybe days to months.

It seems in theory that by hooking a human-level AI to a calculator app, we can get it to the level of a human with lightning-fast calculation abilities. By hooking it up to Wikipedia, we can give it all human knowledge. By hooking it up to a couple extra gigabytes of storage, we can give it photographic memory. By giving it a few more processors, we can make it run a hundred times faster, such that a problem that takes a normal human a whole day to solve only takes the human-level AI fifteen minutes.

So we’ve already gone from “mere human intelligence” to “human with all knowledge, photographic memory, lightning calculations, and solves problems a hundred times faster than anyone else.” This suggests that “merely human level intelligence” isn’t mere.

The next problem is “recursive self-improvement”. Maybe this human-level AI armed with photographic memory and a hundred-time-speedup takes up computer science. Maybe, with its ability to import entire textbooks in seconds, it becomes very good at computer science. This would allow it to fix its own algorithms to make itself even more intelligent, which would allow it to see new ways to make itself even more intelligent, and so on. The end result is that it either reaches some natural plateau or becomes superintelligent in the blink of an eye.

If it’s the second one, “wait for the first human-level intelligences and then test them exhaustively” isn’t going to cut it. The first human-level intelligence will become the first superintelligence too quickly to solve even the first of the hundreds of problems involved in machine goal-alignment.

And although I haven’t seen anyone else bring this up, I’d argue that even the hard-takeoff scenario might be underestimating the risks.

Imagine that for some reason having two hundred eyes is the killer app for evolution. A hundred ninety-nine eyes are useless, no better than the usual two, but once you get two hundred, your species dominates the world forever.

The really hard part of having two hundred eyes is evolving the eye at all. After you’ve done that, having two hundred of them is very easy. But it might be that it would take eons and eons before any organism reached the two hundred eye sweet spot. Having dozens of eyes is such a useless waste of energy that evolution might never get to the point where it could test the two-hundred-eyed design.

Consider that the same might be true for intelligence. The hard part is evolving so much as a tiny rat brain. Once you’ve got that, getting a human brain, with its world-dominating capabilities, is just a matter of scaling up. But since brains are metabolically wasteful and not that useful before the technology-discovering point, it took eons before evolution got there.

There’s a lot of evidence that this is true. First of all, humans evolved from chimps in just a couple of million years. That’s too short to redesign the mind from the ground up, or even invent any interesting new evolutionary “technologies”. It’s just enough time for evolution to alter the scale and add a couple of efficiency tweaks. But monkeys and apes were around for tens of millions of years before evolution bothered.

Second, dolphins are almost as intelligent as humans. But they last shared a common ancestor with us something like fifty million years ago. Either humans and dolphins both evolved fifty million years worth of intelligence “technologies” independently of each other, or else the most recent common ancestor had most of what was necessary for intelligence and humans and dolphins were just the two animals in that vast family tree for whom using them to their full extent became useful. But the most recent common ancestor of humans and dolphins was probably not much more intelligent than a rat itself.

Third, humans can gain intelligence frighteningly quickly when the evolutionary pressures are added. If Cochran is right, Ashkenazi gained ten IQ points in a thousand years. Torsion dystonia sufferers can gain five or ten IQ points from a single mutation. All of this suggests a picture where intelligence is easy to change, but evolution has decided it just isn’t worth it except in very specific situations.

If this is right, then the first rat-level AI will contain most of the interesting discoveries needed to build the first human-level AI and the first superintelligent AI. People tend to say things like “Well, we might have AI as smart as a rat soon, but it will be a long time after that before they’re anywhere near human-level”. But that’s assuming you can’t turn the rat into the human just by adding more processing power or more simulated neurons or more connections or whatever. Anything done on a computer doesn’t need to worry about metabolic restrictions.

Reason 3: Everyday Ordinary Time Constraints

Bostrom and Mueller surveyed AI researchers about when they expected human-level AI. The median date was 2040. That’s 25 years.

People have been thinking about Pascal’s Wager (for example) for 345 years now without coming up with any fully generalizable solutions. If that turns out to be a problem for AI, we have 25 more years to solve not only the Wager, but the entire class of problems to which it belongs. Even barring scenarios like unexpected hard takeoffs or treacherous turns, and accepting that if we can solve the problem in 25 years everything will be great, that’s not a lot of time.

During the 1956 Dartmouth Conference on AI, top researchers made a plan toward reaching human-level artificial intelligence, and gave themselves two months to teach computers to understand human language. In retrospect, this might have been mildly optimistic.

But now machine translation is a thing, people are making some good progress in some of the hard problems – and when people bring up problems like decision theory, or wireheading, or goal alignment, people just say “Oh, we have plenty of time”.

But expecting to solve those problems in a few years might be just as optimistic as expecting to solve machine language translation in two months. Sometimes problems are harder than you think, and it’s worth starting on them early just in case.

All of this means it’s well worth starting armchair work on AI safety now. I won’t say the entire resources of our civilization need to be sunk into it immediately, and I’ve ever heard some people in the field say that after Musk’s $10 million donation money is no longer the most important bottleneck to advancing these ideas. I’m not even sure public exposure is a bottleneck anymore; the median person who watches a movie about killer robots is probably doing more harm than good. If the bottleneck is anything at all, it’s probably intelligent people in relevant fields – philosophy, AI, math, and neuroscience – applying brainpower to these issues and encouraging their colleagues to take them seriously.