Unlike last time, this time I was involved. And it was partially my fault: My cluster configuration was not correct. Two nights ago, the service attempted a failover, which of course didn’t work. Result? Two applications down for six hours.

While the mistake was mine, I do not feel very guilty. Not only did the customer not allow us to make a cluster test when I last edited the cluster configuration (and the mistake happened); the outage ticket from the customer referenced the wrong servers, and our own monitoring was not activated because the monitoring team had ignored our order to do just that for over a month.

Of course the whole mess is not exactly fun. Today we’re having the changes to fix all of the issues – and test the fixes. And I think after that I will simply go home.

Advertisements