Why to take existential AI risk seriously
Humans are powerful. We are so powerful on earth that we routinely wipe out entire species by accident — not because we don’t care (although we are also fairly callous), but mostly because we don’t adjust our habits and lives to account for other species’ needs. Currently, about 50k species per year go extinct, because of the way we are shaping the global environment to our own needs and wants.
There are other animals that are stronger, faster, hardier, more agile and more dexterous than we are. The only reason we are the ones controlling the planet is because we are the most intelligent.
There is no reason to believe this advantage of intelligence stops at our level — we just happened to be at the top of the curve 200,000 years ago when evolution arrived at our current form. There is a spectrum of intelligence among humans, and it’s unlikely that this spectrum ends at the smartest human currently alive.
GPT-4 is better than me at coding, Spanish, medical knowledge and at writing essays. Is it more “intelligent” than I am? Perhaps, but it’s not obvious in either direction. That’s one of the reasons people are calling it “early AGI“.
This level of Machine Intelligence occurred much earlier than most people (including me) expected. And it’s unlikely that it will be the end. Imagine the level of machine intelligence that we will have in a few years, let alone a hundred! It’s going to be wild.
But what happens when we have an AI that is obviously more intelligent than we are? What will an AI model be like when the difference in intelligence between a human and it is the same as between a human and a chimpanzee? This is what is referred to as “SuperIntelligence” — i.e. something that is vastly more intelligent than humans.
There are chimps living in Zoos right now. We captured them from the wild, using tranquilizers we invented, putting them inside huge prison buildings we constructed with technology, for reasons that are completely opaque to the chimpanzees. The power differential between us is so vast that the chimpanzees aren’t even capable of grasping the beginning of how we have trapped them, nor why we want to.
Imagine that we had a model at the same level of intelligence as an average human, IQ of 100. One worry is that the system will start learning how to improve its own code, starting the exponential take-off scenario that people ominously call “the singularity”. However, even without self-improvement, it is quite easy to increase the intelligence of an artificial agent in other ways.
Increasing computational speed is one way, by increasing the amount of hardware that the model is running on. Or by spinning up several copies of the model and allowing them to communicate with each other. Providing it with access to functional APIs and knowledge databases would also increase its capabilities massively.
A hivemind of 100 models of avg human intelligence, running at 5x speed with millisecond access to the entire internet is likely in combination going to be significantly more intelligent than an average human.
When you throw in self-improvement or AI-assisted human-guided improvement, the speed at which artificial agents can become more intelligent only increases.
A model that is slightly more intelligent than a human is probably not going to be particularly potent at gathering power. We already have human-level geniuses today — some of them do very well and accumulate a lot of power via the private or public sector, but others don’t have much effect on the world.
However, the range of human intelligence is relatively narrow. The difference between the smartest human alive to the average is not that big, all things considered. They can still communicate with each other, understand each other’s motivations, and, most of the time, learn from each other (no one is better than everyone at everything).
But when we consider the full range of intelligence that we see in nature, things change. Bonobos and crows are clearly intelligent and can do quite impressive feats of problem-solving and even learn the basics of language. But there is a fundamental disconnect between our species and theirs — our relationship is fundamentally unequal. There is virtually no intellectual skill that any human could not accomplish better than any bonobo, particularly when you add our ability to create tools and technology to complement our own minds and bodies.
Imagine creating an artificial being that is to us like we are to bonobos, or even like we are to ants. It is going to be able to do things that we cannot begin to understand, for reasons we cannot begin to understand. It will be able to manipulate us in ways that we cannot detect or even reason about.
As a species, we have shaped planet earth to our own needs. In doing so, we have changed not only the appearance of almost all parts of earth, but even the climate has started changing due to our actions. 34% of the planet’s mammal biomass is made of Human Bodies and 62% is our livestock. Only 4% is made up of wild mammals. Every day, our changes to the earth’s habitat make about 100 species go extinct. Before humans turned up, the natural rate of extinction was about 1 species every three days.
Most of the time, we hold no malice towards the species that we drive to extinction. Most people actually find it quite sad and would rather we didn’t. Most people also happen to care more about other things. As a species, we prioritize eating juicy burgers over saving the rainforest and having convenience and abundance over conserving the natural environment for other animals.
Once a superintelligence comes along, we better hope that it cares more about us than we do about other animals. Whatever goal or motives it has, it will be very good at pursuing them, better than we are at pursuing ours. If those goals ever end up in conflict or even slightly out of alignment, it will win every time, in ways and for reasons we will not be able to understand. That’s what superintelligence means.
In the realm of supervised/unsupervised learning, we train neural networks to do a task using some loss function, backpropagation and a bunch of training data. We end up with neural networks that can do some processing of data pretty well. But they don’t seem to end up with a “goal” in the sense that we (or animals) have goals.
Recently, we have started trying to give LLMs goals by describing what we want in words, adding some sort of method for them to perform actions and a control loop so they can assess the outcomes of their actions (e.g. AGI and AutoGPT). As part of this, we are testing the waters on hiveminds by allowing them to spin up copies of themselves. However, these models are ultimately predicting the next token of text and it’s very hard to reason about to what extent they are adopting the goal we have given them.
Goals are more readily observable in the realm of reinforcement learning, where we allow models to act in some environment (e.g. playing starcraft) and get rewarded or disincentivised based on the outcomes of their actions (e.g. winning or losing). Fundamentally, training neural networks requires a loss function described mathematically — i.e. the reward you give the neural network needs to be a number. With reinforcement learning, we can see what happens when we train models on optimizing a certain number by acting in an environment. It turns out that making a mathematical function that describes what you “want” and getting the model to perform this is very very hard!
Here is a long list of reinforcement learning models that have done “reward hacking”, meaning they have found a way to maximize their reward while doing something that is not what the designers of the reward function intended.
An analogy can be found in image classification models. Before neural networks, we couldn’t write a mathematical equation that described what cats look like in general. But with neural networks, when we trained them on thousands of images, they gradually learned to detect cats in images very reliably (sometimes even better than humans!).
However, you can reliably trick these models by creating adversarial pixel perturbations that barely change the appearance of an image but make the model classify them incorrectly with near 100% certainty. We thought that we taught these models to recognise cats, but we actually taught them something subtly different. We don’t really understand what we taught them — the function they actually learned lies buried in millions of inscrutable neuron weights and biases.
If we try to align a superintelligent AI in this way to human values, we don’t yet know how to avoid this situation. We may train a model on millions of philosophy papers, ethical dilemmas and real world human conversations. But in the end, it may learn something that is different from what we really want, in a subtle and inscrutable way. When you then apply the full optimization power of a superintelligence to this goal, the subtle deviation from human values that it has learned may be enough to end us.
In general, we do not know how to solve this problem, nor even the — in theory simpler — problem of defining what our values should be.
I am optimistic that this problem is tractable, researchable and solvable. But it will require time, resources, and — above all — to be taken seriously. Even if the probability of AI-caused human extinction is very low, the complete awfulness of that outcome is enough to warrant spending significant resources on figuring this out and driving the probability as close to zero as possible. To do this, we will need international cooperation, regulation and research. We will likely need to tackle this in a similar way to how we have handled global threats in the past, including nuclear war & energy, bioweapons and climate change. Let’s get to work.
If you have an opinion based on what you’ve read, or if you think you’ve figured out a solution, feel free to tweet it at me @magnushambleton. Alternatively, if you are building the next big thing (perhaps in AI safety or interpretability?) and you have a tie to the New Nordics, let us know — we are the most founder friendly early stage VC in the region.
We have also hosted a series of panel debates on this topic, with technical AI experts from a broad range of backgrounds. On the 8th of June, we will be holding on in Copenhagen — if you want to join the discussion, sign up here.