Jun 29, 2022
I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.
I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.
This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.