How Does Recent AI Progress Affect The Bostromian Paradigm?

[content note: I seriously know nothing about this and it’s all random uninformed speculation]

AI risk discussions are dominated by the Bostromian paradigm of AIs as highly strategic agents that try to maximize certain programmed goals. This paradigm got developed in the early 2000s, before a recent spurt of advances in machine learning. Do these advances require any changes to the way we approach these topics?

The latest progress has concentrated in neural networks – “cells” arranged in layers that represent the potential for ascending levels of abstract categorization. For example, a neural network working on image recognition might have a low-level layer that scans the image and resolves it into edges, a medium-level network that scans the set of edges and resolves it into shapes, and a highest-level network that scans the shapes and resolves them into subjects and themes. With enough training, the network “learns” how best to map each level onto the level above it, ending up with profound insight into the high-level features of a scene.

These are a lot like the human brain, and in fact some of the early researchers got important insights from neuroscience. The brain certainly uses cells, the cells are arranged in layers, and the brain categorizes things in hierarchies that move from simple things like edges or sounds to complicated things like objects or sentences.

In particular, these networks are like the brain’s sensory cortices, and they’re starting to equal or beat human sensory cortices at important tasks like recognizing speech and faces.

(I think this is scarier than most people give it credit for. It’s no big deal when computers beat humans at chess – human brains haven’t been evolving specific chess modules. But face recognition is an adaptive skill localized in a specific brain area that underwent a lot of evolutionary work, and modern AI still beats it)

The sensory tasks where AIs excel tend to involve abstraction, categorization, and compression: the thing where you take images of black dogs, white dogs, big dogs, little dogs, ugly dogs, cute dogs, et cetera and are able to generalize them into “dog”. Or to take a more interesting example: a new AI classifies images as pornographic or safe-for-work. Its structure naturally gives it an abstract understanding of pornographicness that allows it to “imagine” what the most pornographic possible images would look like (trigger warning: artificially intelligent computer generating the most pornographic possible images). This kind of classification/categorization/generalization ability is a major advance and eerily reminiscent of human abilities.

But how far is this to building an AGI or human-level AI or superintelligence or whatever else you want to call it?

II.

Consider two opposite perspectives:

The engineer’s perspective: Categorization ability is just one tool out of many. When people invented automated theorem-provers, that was pretty cool – it meant computers could now assess new mathematics. But for AGI, you still need some thing that wants to prove theorems, something (someone?) that can do something with the theorems it proves. The theorem-prover is a tool for the AI to use, not the core “consciousness” of the AI itself. The same will be true of these new neural nets and deep learning programs. They can recognize dogs, and that’s cool. But AGI is still about creating some kind of program that wants to recognize dogs, and which can do something interesting with the dogs once it recognizes them. And that will probably require something different from either a theorem-prover or a neural-net-categorizer. A paperclip maximizer might use a neural net to recognize paperclips, but its desire to maximize them will still come from some novel architecture we don’t know much about yet which probably looks more like normal programming.

The biologist’s perspective: The whole brain runs on more or less similar cells doing more or less similar things, and evolved in a series of tiny evolutionary steps. If we’ve figured out how one part of the brain works, that’s a pretty big clue as to how other parts of the brain work. The human motivation system is in brain structures not so different from the human perception-association-categorization system, and they probably evolved from a common root. If researchers are discovering that the easiest way to make perception-association-categorization systems is neural nets reminiscent of the brain, then they’ll probably find that those neural nets are pretty easy to alter slightly to make a motivational system reminiscent of the brain. This would look less like strategic/agenty goal maximization, which the brain is terrible at, and more like the sort of vague mishmash of desires which humans have.

The exact evolutionary history beyond the biologist’s perspective is complicated. There’s a split between some sensory processing centers (like the visual cortex) and some motivational/emotional centers (like the hypothalamus) pretty early in vertebrates and maybe even before. But in other cases the systems are all messed up. Some parts of the cortex interact with the hypothalamus and are considered part of the limbic system. Some parts of the really primitive lizard brain handle sensation (like the colliculi). It looks like sensation/perception-related areas and emotion/motivation-related areas are mixed throughout every level of the brain. Most important, the frontal lobe, which we tend to interpret as the seat of truly human intelligence and executive planning and “the will” – probably evolved from sensation/perception-related areas in fish, since it looks like sensation/perception-related areas are just about all the cortex that fish had. And all of this evolved from the same couple of hundred neurons in worms, which were already responsible for interpreting the sensations picked up by the worm’s little bristle thingies.

The point is, neither evolution nor anatomy suggests that the brain enforces a deep conceptual separation between perception, motivation, and cognition. Instead, the same sort of systems which handle perception in some areas are – with a few tweaks – able to handle cognition and motivation in others.

In fact, there are some deep connections between all three domains. The same factors that make a grey figure on dark ground look white can make an okay choice compared to worse choices look good. The same top-down processing that screws up PARIS IN THE THE SPRINGTIME is responsible for confirmation bias. In general the mapping between cognitive biases and perceptual illusions is fruitful enough that it’s hard for me to believe that cognition and sensation/perception aren’t handled in really similar ways, with motivation probably also involved.

So if we have something that can equal human sensory cortices – not just in the coincidental way where a sports car can equal a cheetah, but because we’re genuinely doing the same thing human sensory cortices do for the same reasons – then we might already be further than we think towards understanding human intelligence and motivation.

III.

A quick sketch of two ways this might play out in real life.

First, categorization/classification/generalization/abstraction seems to be a big part of how people develop a moral sense, and maybe a big part of what morality is.

Everyone remembers the whole thing about mental categories, right? The thing where you have a category “bird”, and you can’t give a necessary-and-sufficient explicit definition of what you mean by that, but you know a sparrow is definitely a bird, and an ostrich is weird but probably still a bird, and there are edge cases like Archaeopteryx where you’re not quite sure if they’re birds or not and there’s probably no fact of the matter either way? Cluster-structures in thingspace? Weird border disputes? That thing?

And you remember how we get these categories, right? A little bit of training data, your mother pointing at a sparrow and saying “bird”, then maybe at a raven and saying “bird”, then maybe learning ad hoc that a bat isn’t a bird, and your brain’s brilliant hyperadvanced categorization/classification/generalization/abstraction system picking it up from there? And then maybe after several thousand years of this Darwin comes along and tells you what birds actually are, and it’s good to know, but you were doing just fine way before that?

We learn morality in a very similar way. When we hit someone, our mother/father/teacher/priest/rabbi/shaman says “That’s bad”; when we share, “that’s good”. From all this training data, the categorization/classification/generalization/abstraction system eventually feels like it has a pretty good idea of what morality is, although often we can’t verbalize an explicit definition any better than we can verbalize an explicit definition of “bird” (“it’s an animal that can fly…wait, no, bats…um, that has feathers…uh, do all birds have feathers? Bah, of course they don’t if you pluck them, that wasn’t what I meant…”). Just as Darwin was able to give an explicit definition of “bird” which conclusively settled some edge cases like bats, so philosophers have tried to give explicit definitions of “morality” which settle edge cases like abortion and trolley-related mishaps.

An AI based around a categorization/classification/generalization/abstraction system might learn morality in the same way. Its programmers give it a bunch of training data – maybe the Bible (this is a joke, please do not train an AI on the Bible) – and the AI gains a “moral sense” that it can use to classify novel data.

The classic Bostromian objection to this kind of scheme is that the AI might draw the wrong conclusion. For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.

To this I can only respond that we humans don’t work this way. I’m not sure why. It seems to either be a quirk of our categorization/classification/generalization/abstraction system, or a genuine moral/structure-of-thingspace-related truth about how well forced-heroin clusters with other things we consider good vs. bad. A fruitful topic for AI goal alignment research might be to understand exactly how this sort of thing works and whether there are certain values of classification-related parameters that will make classifiers more vs. less like humans on these kinds of cases.

Second, even if we can’t get this 100% right, there might be a saving grace: I don’t see these kinds of systems as paperclip maximizers. The human utility function seems to be a set of complicated things generalizing/abstracting from a few biologically programmed imperatives (food, sex, lack of pain) and ability to learn other goals from society and your moral system.

Categorization/classification/generalization/abstraction is certainly involved in reinforcement learning. You say “BARK!” and a dog barks, and you give it a treat. The dog needs to be able to figure out, on the fly, whether the treat was for barking when you said “BARK!”, for barking whenever you speak, for barking in general, for being next to you, or just completely random. This is a problem of categorization and abstraction – going from training data (“the human did or didn’t reward me at this specific time”) to general principles (“when the human says bark, I bark”).

I don’t really understand how the human motivational system works. Dopamine and the idea of incentive salience seem to be involved in a fundamental way that seems linked to perception. But I am kind of hopeful that it’s something that’s not too hard to do if you already have a working categorizer, and that it’s a foundation to build agents that want things without being psychopathic maniacs. Humans can want sex without being insane sex maximizers who copulate with everything around until they explode. An AI that wanted paperclips, but which was built on a human incentive system that gave paperclips the same kind of position as sex, might be a good paperclip producer without being insane enough to subordinate every other goal and moral rule to its paperclip-lust.

Tomorrow 10/31 is the last day of MIRI’s yearly fundraiser, and as usual I think it is a good cause well worth your donation. But its basic assumption is that AIs will be very computer-like: entities of pure code and logic that will reflect on themselves using mathematical tools. I can also imagine futures where AIs aren’t much more purely-logical than we are, and the tools we need to keep them human-friendly are very different. I support MIRI’s efforts to deal with the one case, but I’m hoping there will be some efforts in the other direction as well.

EDIT: Nick points out some of MIRI’s work along these lines.
EDIT2: Comment by Eliezer