Operations Rules

Jon Pral has great list here of 85 Operaions rules to live by. Here are my top 5 from the list.

1. Know your bottlenecks and know how to spot them - every layer - know if you are blocking on disk, RAM, or CPU. It is usually that simple.

2. The value of a project manager, tech writer, and financial analyst in the ops organization should not be underestimated. They will more than pay for themselves.

3. Monitor EVERYTHING - alert on actionable only, record other for trend information.

4. Assign people to be point people for every bit of technology.

5. Do it right the first time. Rarely do you get the chance to go back and redo things. If you do, it comes at a very big cost to the company. Take the hit on work, the first go round.

What we can all learn from the Amazon outage.

I didn’t write about the Amazon storage service outage here before now but I have been thinking a lot about what we all can learn from it. First a few details; The amazon S3 storage solution had issues from 3:30am PT to 6:48am PT on 2/15. The issue manifested itself in a “large” increase in authenticated calls to the S3 service. The real problem is the team didn’t know this was coming until it was to late. To resolve the problem the Amazon Team moved additional capacity in to handle this increase in authenticated requests.

I can certainly feel for the Amazon team, being caught off guard is NOT a good feeling. So what monitoring is missing from your environment? This should be an opportunity for all of us to think about the little service that everything relies on and could cripple the environment. Monitoring, trending and basic capacity planning is critical to the health of all our applications. We have been working much more closely with out engineering teams then ever before to instrument all parts of the applications supporting our sites via JMX. Call it what you want and I don’t like the word but it feels like a good time for a basic monitoring audit.