Guidelines for Artificial Intelligence Containment
Due to the “basic AI drives” mentioned above, an unsafe AGI will likely be motivated to falsify tests or monitoring
mechanisms to manipulate the researchers into thinking it’s safe, to gain access to more resources, to embed dormant
copies of itself in device firmwares, and to hack computers on the internet. In order to reliably test and safely interact
with an AGI with these motivations and capabilities, there must be barriers preventing it from performing these
actions. These barriers are what we refer to as containment.Some have argued that controlling AGI - especially if superintelligent - is impossible or infeasible. For example, Ray
Kurzweil writes that “intelligence is inherently impossible to control” [17]. Eliezer Yudkowsky’s AI box experiment 8
found that human factors make containing an AI difficult. Vernor Vinge argued that “confinement is intrinsically
impractical” in the long run.We agree that containment is not a long-term solution for AI safety; rather it’s a tool to enable testing and development
of AGIs with other, more robust safety properties such as value learning [36, 67] and corrigibility [37]. Value learning
is the strategy of programming an AGI to learn what humans value, and further those values. If this is done correctly,
such an AGI could be very good for humanity, helping us to flourish. Corrigibility is the strategy of programming an
AGI to help (or at least, to not resist) its creators in finding and fixing its own bugs. An AGI which had both of these
properties would not need to be contained, but experience with software suggests that developers are very unlikely to
get it right on the first try.