I didn’t write about the Amazon storage service outage here before now but I have been thinking a lot about what we all can learn from it. First a few details; The amazon S3 storage solution had issues from 3:30am PT to 6:48am PT on 2/15. The issue manifested itself in a “large” increase in authenticated calls to the S3 service. The real problem is the team didn’t know this was coming until it was to late. To resolve the problem the Amazon Team moved additional capacity in to handle this increase in authenticated requests.
I can certainly feel for the Amazon team, being caught off guard is NOT a good feeling. So what monitoring is missing from your environment? This should be an opportunity for all of us to think about the little service that everything relies on and could cripple the environment. Monitoring, trending and basic capacity planning is critical to the health of all our applications. We have been working much more closely with out engineering teams then ever before to instrument all parts of the applications supporting our sites via JMX. Call it what you want and I don’t like the word but it feels like a good time for a basic monitoring audit.