Entries Tagged 'Capacity' ↓
March 18th, 2008 — Capacity, Events
So yesterday we were wracking or brain trying to figure out where a 300% request per second increase to an app only seeing a 30% page view increase was coming from. We started with “why is the DB so slow” following our rules, but soon realized something else was going on. One of our engineers, while using fiddler, noticed an error in the flash that on mouse over made a call to / or the root of the app for no reason. The way the app was laid out this would account for a huge number of requests, somewhere in the neighborhood of 3000/sec at peak that were unnecessary.
This got me thinking what kind of QA would find this, is it peer review, classic code review including the design portion, or should this be part of our role? We run our shop very similar to a startup as it is primarily event driven so we don’t have the classic development cycles clearly defined. What this did show me is designers are designers and developers are developers while many can do both sometimes it really is best to separate the functions.
In our org I believe we should have a technical qa team that works with the operations team ripping apart and through the final product from an engineering and technical production standpoint. I think this would provide the best level of accountability on the two teams and formalize the release without sacrificing the startup feel. Of course we wold need to officially work this into the time line but would leave the core teams focusing on building the best possible products.
March 14th, 2008 — Capacity, Events
How can you know when something is about to go wrong if you can’t see it?? We finally closed the loop today on some MSSQL trending we have been missing for a very long time. Being able to watch things like table scans/sec, batch reads & writes/sec, and transactions/sec is huge during an event. As much as we drill into folks heads the importance of communicating changes, it is still to easy for a simple change to have unexpected impact on something like a DB. As I noted the other day it is almost always the DB or the file system and while we have our share of issues that aren’t many times, we have chased our tail due to lack of trending on the DB a lot and in the end it has been something stupid like an index got dropped.
March 11th, 2008 — Capacity, Events
That is a loose quote from a SXSW panel on scalable web ventures. This just hits way to close to home today to not write about. In our case it was both the DB and the File System. We were doing some last minute load testing, you can never be too sure, and after almost the entire day and roping in two other teams to dig into the db and SAN we realized the db had not been setup right. All files associated with the specific database were on the same storage LUN. This caused us a 95% reduction in throughput of our service. Splitting up the t-logs, data files, etc. onto different LUNs got us back to where we expected. The bottom line is we wouldn’t even have noticed this with out trending everything possible in cacti on our hosts and while win2k3 disk trending has some hurdles for SAN attached disk, it still pointed us in the right direction.
March 11th, 2008 — Capacity, What?
In between avoiding any real specifics in an interview with GiggaOm’s Stacey Higginbotham Zuckerberg eludes to Facebook running on “tens of thousands” and approaching “hundreds of thousands” of hosts. According to comscore facebook supports up to 65 Billion, yes that is billion with a B, page views a month, or more then 2 Billion a day. Relatively speaking that is a huge number of hosts to support the site. He does talk about how they use memcached extensively but it sounds to me like some general re-architecture is in order. Or is this a VMWARE salesman’s dream?
March 2nd, 2008 — Capacity, Links
Jon Pral has great list here of 85 Operaions rules to live by. Here are my top 5 from the list.
1. Know your bottlenecks and know how to spot them – every layer – know if you are blocking on disk, RAM, or CPU. It is usually that simple.
2. The value of a project manager, tech writer, and financial analyst in the ops organization should not be underestimated. They will more than pay for themselves.
3. Monitor EVERYTHING – alert on actionable only, record other for trend information.
4. Assign people to be point people for every bit of technology.
5. Do it right the first time. Rarely do you get the chance to go back and redo things. If you do, it comes at a very big cost to the company. Take the hit on work, the first go round.
February 18th, 2008 — Capacity
I didn’t write about the Amazon storage service outage here before now but I have been thinking a lot about what we all can learn from it. First a few details; The amazon S3 storage solution had issues from 3:30am PT to 6:48am PT on 2/15. The issue manifested itself in a “large” increase in authenticated calls to the S3 service. The real problem is the team didn’t know this was coming until it was to late. To resolve the problem the Amazon Team moved additional capacity in to handle this increase in authenticated requests.
I can certainly feel for the Amazon team, being caught off guard is NOT a good feeling. So what monitoring is missing from your environment? This should be an opportunity for all of us to think about the little service that everything relies on and could cripple the environment. Monitoring, trending and basic capacity planning is critical to the health of all our applications. We have been working much more closely with out engineering teams then ever before to instrument all parts of the applications supporting our sites via JMX. Call it what you want and I don’t like the word but it feels like a good time for a basic monitoring audit.