Worked all day to clean up the mess of yesterday’s outage outage . No lunch break. Constant calls. Eventually a nice big change in which I fixed the problem.

And promptly found a second, hitherto unknown, problem. It’s based on some “broken” scripts RedHat uses. I say that in quotation marks because it works fine for what they do with it, it just is badly written and breaks when you throw a different, but just as legal, way to set things up at them.

I spent the rest of the day coordinating the work that needs to be done to fix those scripts. It’s not made easier by the fact that we do not have a cluster to test any of this, but must use the production environment for rtests – which runs business critical applications.

Just because someone wanted to save a couple thousand Euros in hardware and support services.

Lessons learned:

1) Always require a test environment that is identical to the production environment. If they don’t agree to this, refuse to set the project up.

2) Always, always, ALWAYS insist on cluster tests when brining new applications on-line, no matter how much the customer whines. If they do not agree to such a test, refuse to activate the application.

Unfortunately, both will just be seen as being unfriendly to the customer, and probably would get overruled by the higher-ups within minutes. Still, I should insist that they then put such orders in writing.

Advertisements