Notes on AGI risk

This is the sixth post in my series of daily posts for the month of April. To get the best of my writing in your inbox, subscribe to my Substack.

I am once again confused about existential risk to humanity due to AGI. My position on this has shifted back and forth many times, although for the most part I’ve resisted becoming a full-on doomer. In this post I’ll outline how I currently think about the issue, and why I’m not panicking about it (yet). My goal is for this to serve as a “snapshot” of my thinking as I continue to learn about it (because it’s good to document your ideas as they evolve).

Let’s start with the thesis that everyone is talking about. It’s the view that AI systems, as they continue to improve, have a very high probability of wiping out all of humanity in the next few years. This is Eliezer Yudkowsky’s position. The conclusion based on this view is to ban AI research indefinitely, establish a global moratorium on it, and take any means necessary to enforce that moratorium. Yudkowsky outlines his arguments in his TIME magazine piece. He goes as far as saying that sufficiently advanced AI technology (“superintelligence” or “AGI”) poses a far greater threat to humanity than nuclear war.

(A quick aside: there’s a different thesis about AI risk that I won’t be talking about here, which is the view that AI, in its current state, can wreak all kinds of havoc on society: economic dislocation, misinformation, new forms of crime, more sophisticated fraud, and so on. Gary Marcus points out that this risk is a separate concern from the bigger worry about all of humanity being suddenly destroyed, and that both of these risks warrant our attention. I completely agree. For the purposes of this post though I will just focus on the existential, “everyone is about to die” risk, since that one seems more dramatic.)

As for my position: I have a vague intuition that Yudkowsky is wrong, mostly because my bar for believing that the world will end requires extremely clear and compelling reasoning. That said, I’m still working my way through understanding all of Yudkowsky’s arguments, and I do think that to truly address the concerns about AGI doom we will have to explore all the arguments in detail. That is going to take a while for me personally, so until then my tentative position is that AI/AGI can cause all kinds of harms, and so we should be putting a ton of research into safety, but I just don’t think we should ban it outright.

The limits of good argument

This is obvious but I’m going to reiterate it: any argument is only as good as the assumptions it starts with. Given sufficient emotional motivation, you can logically convince yourself of many patently untrue things.

The key here is to remember that it’s not just the logical force of the arguments that matter, but the framing that the argument depends on (i.e. the background assumptions, meanings, and conceptual categories that it takes for granted). There is no argument that starts from empty space, from some imagined unassailable “ground” of knowledge that all other knowledge rests comfortably on. All arguments involve a background frame, and any time people have convinced themselves of silly things, the problem is not in the logical consequences they sketched out but the background frame that the argument was born in.

What always happens is that people take a given frame for granted, do a bunch of reasoning within that frame, and then come to fantastical conclusions, and then get mad at others for not following their reasoning. (Again, I’m not claiming that that’s necessarily what’s going on here, I’m just saying it’s another trap to watch out for.) It could be that all of Yudkowsky’s points are correct in the abstract, but when you start mapping them onto reality they completely break down. (I imagine Yudkowsky himself would acknowledge that this is possible.) That said, if we want to make our criticism by attacking the underlying frame, we have to be pretty specific about what is wrong with that frame.

The one-sentence doomer take

From what I understand Yudkowsky’s argument is: a superintelligent system will ruthlessly optimize for something, and it will have endless power to do so. The space of possible “somethings” it could be optimizing for is very large, and most of those “somethings” involve humans being in a horrible situation, or just not existing. Hence, it will almost certainly wipe out humanity.

Distinguishing intelligence from causal power

What even is intelligence? Two options for defining intelligence come to mind:

The capacity to make predictive models of the world. So, for example, predicting the next word in a text document, which is what GPT-3 does. (I think GPT-4 is the same but I’m not actually sure because it seems to be “multi-modal” and so it doesn’t just generate tokens but also images?) Most of science is comprised of predictive/explanatory models that allow us to know in advance things like where a constellation will be in 6 months and how quickly a colony of bacteria will grow.
The capacity to have intentional causal efficacy on the world. Mere prediction is not enough to have actual agency to change things. The more agency an AI system has—“agency” being the ability to imagine a desired world-state and then execute the actions needed to bring about that state—the more dangerous it will be.

Right now the causal power of GPT-4 / ChatGPT is limited primarily to text output on computer screens. (It looks like they’re also building in capabilities for it to connect to the internet and a bunch of API’s, but for the sake of the following argument I’m going to assume that it’s blocked off from the internet and API access, because that seems like something we could easily enforce if we wanted to.)

So, the danger so far (on the existential front, not the malicious usages front, which I’m ignoring here) is not what the software can do directly—the software can only output text on a screen—but the second-order effects of that, namely manipulating humans into doing bad things.

What kind of existential damage could such a model do, if its only causal access to the world is by virtue of humans doing things that it tells them to do? The scenario that Yudkowsky sketches out is one where it figures out the DNA sequence for a super-deadly virus or protein or something, and then gets a human to mail that sequence to a lab to “produce” that sequence (I’m not sure how this works) and then it self-replicates and spreads all over the world and everyone dies.

I’m skeptical of this scenario in particular because of:

i have for a long time been prepared to be swayed by the arguments of @ESYudkowsky and of the alignment/safety community in general

i cannot get past this part in his piece in 'time'. this reads like 'and then the ai invents magic' pic.twitter.com/zpFdw47HIM
— anton 🇺🇸 (@atroyn) March 30, 2023

I recommend the whole thread but I think this part is very important:

what this comes down to is; intelligence alone is not sufficient to guarantee success of any plan in the real physical world, and the more complex the plan, the lower the probability of success for even completely optimal strategies
— anton 🇺🇸 (@atroyn) March 30, 2023

Whatever “intelligence” is, I’m not sure that it’s exactly the same thing as causal efficacy in the world. Being able to think really fast and model complicated things in your head does not, by necessity, make you better at changing things.

Consider: who were the most intelligent humans to have ever existed? People like Einstein, Newton, and von Neumann come to mind. How much actual power did these people accumulate in their lifetimes? How much were they able to shape the world to their liking? Very little, compared to the brutal dictators of history. Einstein, for example, was a lifelong pacifist, and wasn’t able to do much about the various world wars and nuclear proliferation that occurred during his lifetime. The most powerful people in history were not “smarter” per se, they were just better at manipulating others to help with achieving their goals.

To be fair this isn’t a very strong argument because: (1) the intelligence gaps between humans are much smaller than the gap that AI doomers are expecting between humans and AGI; and (2) the whole premise of an AGI is that it is better than humans at everything, so it is just as good at physics equations as it is at manipulating people into doing things. But still, I think people sometimes forget that intelligence does not, on its own, guarantee causal efficacy.

Is there a limit on the causal efficacy that a single physical agent could have on the world?

This is one of the core questions that will determine one’s position on existential risk. In Superintelligence is impossible, Erik Hoel makes the distinction between the “deistic” view of intelligence and the non-deistic view:

Roughly, there are two frameworks: either an evolutionary one (non-believers) or a deistic one (believers). [The deistic view] assumes that intelligence is effectively a single variable (or dimension) in which rocks lie at one end of the spectrum and god-like AI lies at the other end. It’s “deistic” because it assumes that omniscience, or something indistinguishable to it from the human perspective, is possible.

In the essay, Hoel argues that the “deistic” view of intelligence is wrong: that omnipotence (or the appearance of it from the human perspective) is impossible. It’s worth pointing out that in his more recent articles, Hoel backtracks on this and concedes that “general” intelligence is actually a thing:

If the NFL-argument were correct, we would continue to leap ahead in terms of narrow AI, but anything that even looked like AGI would be very difficult and fall far behind. That’s what I thought would happen back in 2017. I was wrong. Merely training on autocomplete has led to beta-AGIs that can outperform humans across a huge host of tasks, at least in terms of speed and breadth. This means the space of general intelligences is likely really large, since it was so easy to access. Basically some of the first things we tried led to it. That’s bad! That’s really bad!

Ok, so large language models are already much more powerful than we expected. But is this enough to warrant the belief that they will eventually become so powerful that they could literally turn the entire solar system into paperclips?

Again, an important assumption I’m making is that the AI is limited in its physical interactions with the world to outputting text on a screen, because as far as I understand, the AI systems that OpenAI and others are developing are software that runs on inert hardware which does not have the ability to walk around and push things in the physical world. I find it very hard to believe that given these limitations, such a system could turn the solar system into paperclips. It would have to do everything meditated by the causal powers that humans have, which are limited in all kinds of ways.

(If OpenAI started to put GPT-5 inside a robot body, it would be a different matter and I would be more concerned.)

The core of Yudkowsky’s argument

Dwarkesh Patel just posted a podcast today where he apparently kept trying to “come up with reasons for why AI might not kill us all” and having his arguments shut down by Yudkowsky. So far I’ve only listened to the specific hour that Dwarkesh refers to in his tweet as the crux of the debate:

If you want to get to the crux, fast forward to 2:35:00 - 3:43:54. Here we debate the main reasons I think doom is unlikely.

Transcript: https://t.co/xY2uu10hp3

YouTube: https://t.co/CEXiELcTga

Apple Podcasts: https://t.co/hKRJjaHeGM

Spotify: https://t.co/Wm6CshZ5j1
— Dwarkesh Patel (@dwarkesh_sp) April 6, 2023

There were three things Yudkowsky said that stood out to me:

“The super vast majority of utility functions are incompatible with humans existing.”
(paraphrasing) Whenever you imagine the AI creating a scenario we like, there is some reason that it must have created that scenario, and Yudkowsky can come up with a different scenario which allows the AI to serve the same goal/reason but in a way that we don’t like at all. For example: maybe the AI decides to just let humans be. Yudkowsky asks: “why would it do that?” And we say: “because it decides human happiness is a good thing”. And then Yudkowsky would say, “well, the AI could advance that goal just as well by creating a bunch of pleasure-maxxing brains in a vat. And there are a bajillion other such scenarios that advance the same purpose but which we would view as bad.”
“Remember the basic paradigm: from my perspective I’m not making any brilliant startling predictions, I’m poking at other people’s incorrectly narrow theories until they fall apart into the maximum entropy state of doom.”

I will have to think further about all these things but these are gut reactions:

For #1: I mean, yes? Any one thing that could be optimized (e.g. water, paperclips, pleasurable neuronal firings) would be a disaster if it was optimized to the detriment of everything else. But what if the AI is equally capable of “optimizing” much more nebulous things, like “harmony”, or “justice”, or “general enlightenment and wellbeing”? In other words, why is it obvious that the AI—this superintelligent, basically omniscient and omnipotent thing—will optimize extremely narrowly on an extremely narrow objective? I guess you could say that the current deep learning approaches—like gradient descent—are all about narrow optimization, and so the superintelligent system will also have a narrow objective. But the result of that narrow optimization is this very general thing—ChatGPT is helpful to us in a very general way. Even though internally it might be optimizing just one thing, its actual causal impact on the world is not to narrowly optimize any one thing.

For #2-3: These points seem to rest on the view that “it’s easier to imagine things going wrong than to imagine things going right”. And I can kinda grant that: complex systems like humans and societies are pretty fragile, they require a whole lot of preconditions to continue existing. (And as he rightly points out, the existence of humans and societies is, in the grand scheme of the universe, a remarkably rare and surprising and special thing.) But also, humans and societies do have resilience, and what we’re talking about here is intelligence, which is generally a force that increases order and structure rather than destroying it. (If we compare humans to ants, on net humans have created more structure in the universe than ants have, even though humans have also destroyed a lot of structure.)

For more of my writing, you can subscribe to my Substack.

The limits of good argument#

The one-sentence doomer take#

Distinguishing intelligence from causal power#

Is there a limit on the causal efficacy that a single physical agent could have on the world?#

The core of Yudkowsky’s argument#

The limits of good argument

The one-sentence doomer take

Distinguishing intelligence from causal power

Is there a limit on the causal efficacy that a single physical agent could have on the world?

The core of Yudkowsky’s argument