Blog

The Digital Agency for International Development

Saving Apache from unpredictable certain death

By Chris Wilson on 23 August 2013

A few months ago we had a serious problem at the weekend: many of our production servers went down at 4am on Sunday.

We investigated, and found out that I'd made a Puppet configuration change that installed mod_wsgi on all of our servers that also had Munin installed.

Unfortunately there is a previously unknown bug in mod_wsgi, so if it's loaded into a running server with a reload command, then the server starts crashing on every request. Apparently this is a common problem with Apache modules.

It didn't start crashing immediately for us, because the RPM package that we used to install mod_wsgi doesn't actually restart the server after doing so. I think that's another bug. In some ways it was lucky that it doesn't, because many of our servers would have gone down in the middle of a working day. But at 4am on Sunday morning, it took us a while to make the connection to a Puppet change that had happened during the past week.

When something like this happens, I like to make at least two changes to our systems that would prevent it from happening again, or catch it and notify us immediately if it did. In this case, I opened the bug report above. But it remains open, and so does our internal ticket, 4 months later. So I decided to take more immediate defensive action to protect us.

First I added the missing Puppet rules to restart Apache when mod_wsgi is installed for the first time:

class python26 {
  case $operatingsystem {
    CentOS: {
      case $operatingsystemrelease {
        /^6\./: {
          package { python26_pkgs:
            name    => ['python', 'python-devel', 'python-tools', 'python-setuptools',
            'mod_wsgi', 'python-virtualenv', 'python-pip',],
            ensure  => installed,
            require => Class['aptivate-repo'],
            # https://projects.aptivate.org/issues/3897
            notify => Service['httpd'],
        }

However this only protects us from one way that the bug can be triggered. If mod_wsgi is removed for any reason, or any other faulty module is added, then Apache will fail in the same way. So I wrote a Puppet recipe that checks whether the list of modules loaded in Apache has changed since the last Puppet run, and if so it restarts Apache gracefully:

class apache {
    # Check the list of modules loaded in Apache, and if it has changed
    # since the last run, then restart Apache.
    exec { restart_apache_if_modules_change:
        require => Service[$apache],
        unless => '/usr/sbin/httpd -M 2> /var/run/httpd.modules.new && diff -u /var/run/httpd.modules.{prev,new}',
        command => 'diff -u /var/run/httpd.modules.prev /var/run/httpd.modules.new; /etc/init.d/httpd graceful; cp /var/run/httpd.modules.{new,prev}',
        logoutput => true,
    }
}

Now I'm pretty confident that this bug won't bite us again on any servers that we control with Puppet.