DRYM - a principle of sys admin (Don't repeat your mistakes)
Mistakes happen. We accept that, and we clean up our own mess. But we do try to learn from our mistakes, and bake the lessons into our processes so they don't happen again.
1&1 still have a control panel that makes it confusing as to which server is which, so we've moved to Linode who have a far superior control panel (not to mention being cheaper). The key feature as regards this problem is that it allows us to give our own name to servers in the control panel, so it is obvious which server we are modifying.
But the key problem is that our backups weren't functional - servers do die for reasons outside of our control, and we need to be able to restore from backup when the original server is not available. If we're going to use encrypted backups we need to have a copy of the backup key stored off the server. A checklist would get us a certain part of the way, but they sometimes get skipped when you're under time pressure. Scripts are better, because they won't skip steps. And even better than that is to have an automated check that will notify us if we haven't copied the key to the correct place. Overall, the fewer manual steps required of us, the better.
In addition, " a backup isn't done until you've tested the restore" - so we need to regularly test the restore actually works.
About this time we started using puppet to manage the configuration of our servers, so we've written a puppet script that:
- copies the backup scripts into place
- also copies restore scripts onto the server - including a script to do a test restore of a single file
- sets up cron jobs to run the backup scripts nightly and the test restore weekly
- sets up a nagios check to ensure the test restore has happened in the last week
- sets up a nagios check to ensure that the file containing the backup key exists on another server
We still have one manual step left though. Once puppet has run, we have to run a script on the server ourselves to set up the backups. This script:
- generates the gpg keys to use for the backup
- generates the settings file used by the various backup and restore scripts
- creates a zip file containing the gpg keys, the settings file and restore scripts
- copies the zip file to a server in our office
The last step is why this script is run manually - we copy the zip file using scp, and that requires us to log in to the server. We haven't come up with a satisfactory way of copying the file into place without user involvement that doesn't reduce our security. But if the manual script isn't run, we have two sets of notifications. Firstly the nagios checks will complain - the zip file will not appear on the server in our office, and the test restore won't be able to run. In addition, the nightly backup cron job will be run, and when it finds the set up script hasn't been run, it will generate an email to root on that server - which then goes to the sys admin team.
The final piece of the solution is that we manually run a full restore once per year for each server that is backed up this way. This is a manual step, but we have a script (in the zip file) that does most of the steps for you, so the amount of person time required to do it is about half an hour (though the elapsed time is a bit longer).
Of course the system isn't perfect. If we find the time, we may revisit this solution. It would be good to generate the gpg key and copy it to the central server without manual intervention - maybe we could set up a relatively insecure ftp server that machines could login to. A script could monitor the directory and copy files put into it to a more secure directory, renaming the file to ensure we don't overwrite a good file with a bad file.
A second improvement would be to be absolutely sure that the key stored on the office server is the same one that the server uses. At present the nagios check ensures that a zip file with the correct name exists, but doesn't look inside the zip file, or know what key is used on the server.