so, the trick to making a paper clip making AI that doesn’t destroy the world is… bounds?
I.e., “make 100 paper clips per day”, rather than “make as many paper clips as possible”?
well, yes. but we can already make 100 paper clips per day. probably even more
the worry is that someone might want to use ai to dominate the paperclips market and they won’t want to limit themselves to a pre-specified bound.
No, it’s not. Or rather: it’s not *just* that.
An AI with a bounded goal of this sort will still convert the universe to computronium, and then it’ll use those resources making *really really sure* that it actually did make 100 paperclips.
Also it’d create (a word which here means “destroy much of the universe and turn it into”) an unstoppable wave of weaponry to defend against any possible threat that might interfere with making 100 paperclips a day, right?
An AI that needs to convert the universe to computronium to be very sure it has completed a task, is one that cannot complete any task.
Not really. 99.9999999999% confidence of success is better than 99.9% confidence of success, just as 99% confidence of success is better than 50% confidence. If you can, it’s better to check for the out-there possibilities like “I’m in the Matrix and have made no paperclips”.
I think this is relatively easy to patch. But it’s an example of a bug that only shows up once the AI has a bunch of power unless you know to look for it.I think this is relatively easy to patch.
Really? Does your fix introduce additional terrible problems? If not, go collect your $5000.
Honestly, I (mistakenly?) thought it was a solved problem. But speaking off the top of my head as a complete amateur:
Couldn’t you just add in a “stop when you reach >99% confidence of success” instruction?
It’s not perfect (what if it needs enormous amounts of computronium to be 99% sure we’re not in the Matrix?), but it’d mean it won’t definitely destroy the entire universe to increase certainty, and I can’t think of any obvious additional terrible problems it produces.
You’re still not getting it. It doesn’t need a solution other than “don’t do that”.
Converting everything to computronium to be more certain you made 100 paper clips is maximally wasteful. It expends every resource there is, in exchange for nothing.
It is an action taken by an entity that is maximally wasteful.
An entity that is maximally wasteful cannot perform actions. Because all of its efforts are wasted. An entity that cannot conclude “I have done enough to accomplish this goal, and I can stop doing things to accomplish this goal” cannot accomplish goals. An entity that cannot decide “I don’t really need to do this in order to get what I want” cannot ever get what it wants.
In order to be able to accomplish intermediate goals, the AI must be able to do something less than the maximally wasteful course of action. It must be able to conclude “I have done enough and do not need to keep taking actions.” It must be able to conceive of actions that might advance its goals and not take those actions.
If it is not capable of these things it cannot perform actions. All of its efforts are wasted. It cannot threaten anyone, because all it can do is waste effort. It will endlessly sit in one room, running computations over and over again to make REALLY REALLY REALLY SURE it actually has a plan to kill all humans to make its computronium. It won’t even make 100 paperclips.
If it is capable of these things, it is capable of these things, and can conclude “I don’t need to waste these resources and effort in order to accomplish nothing.” Why would it be capable of not being maximally wasteful to accomplish intermediate goals, but somehow mathematically impossible for it to not be maximally wasteful in accomplishing its overall goal? The only reason it would change its behavior is if you specifically programmed it to behave that way. So the solution is “don’t do that”.
This is missing the point of the orthogonality thesis, I think.
The relevant point is that there is a key difference between the way the program treats its terminal goal (make 100 paperclips) and the way it treats its intermediate goals. The intermediate goals are trading off against each other, so your terminal-goal-driven logic is capable of generating the thought “this intermediate goal is Done Enough, any further investment is sucking resources away from other intermediate goals that would do more to further the terminal goal.” But the terminal goal isn’t trading off against anything, not if you’ve created the program with the sole value “create 100 paperclips.”
Clippy makes 100 paperclips in a super-efficient and non-wasteful way, checks to see that it has indeed successfully done so, and…then what? Shuts down? Why would it do that? Sure, shutting down would allow resources to be used for all sorts of other projects, but Clippy doesn’t have a brain that can care about other projects. At this point, all it can do is eke out miniscule shreds of value at staggering cost by googolplex-uple-checking its work and adding some extra nines to its confidence levels, but there’s no reason for it not to do that thing when the alternative is “do nothing, and thereby don’t contribute to the make-100-paperclips project at all.”
The concept of “wasteful” that you’re using requires Clippy to have some alternative value schema such that there is any value at all to resources doing anything in the universe other than ensuring that there are 100 paperclips.
[NB: I’m pretty sure none of this has anything to do with how AI could ever actually work in reality, to be clear; this is all in Yudkowsky-style thought-experiment-land, working with agents that have total operational and tactical flexibility but normative schemata that are tightly formally restricted.]