Wow this is really the life of an Ops guy.

http://thewebsiteisdown.com/

Amazon flood gates open.

AWS LogoTwo huge new features were announced today for EC2. The first being Elastic IPs which is basically the static IP solution everyone has been waiting for, but better! Elastic IP is a 1:1 NAT solution. What is so cool about this is you can dynamically remap your static IP to different running instances creating a poor mans HA solution. The second feature is Availability Zones. This allows you to launch instances in isolated zones that amazon describes as “distinct locations engineered to be insulated from failures in other zones.” The next step to this is allowing for region specific selection as well, currently you are limited to selecting a zone within your defined region based on your account. This provides for a huge increase in availability and will certainly make organizations take another hard look at what amazon has to offer to extend or augment their existing facilities.

A Different Kind of QA

So yesterday we were wracking or brain trying to figure out where a 300% request per second increase to an app only seeing a 30% page view increase was coming from. We started with “why is the DB so slow” following our rules, but soon realized something else was going on. One of our engineers, while using fiddler, noticed an error in the flash that on mouse over made a call to / or the root of the app for no reason. The way the app was laid out this would account for a huge number of requests, somewhere in the neighborhood of 3000/sec at peak that were unnecessary.

This got me thinking what kind of QA would find this, is it peer review, classic code review including the design portion, or should this be part of our role? We run our shop very similar to a startup as it is primarily event driven so we don’t have the classic development cycles clearly defined. What this did show me is designers are designers and developers are developers while many can do both sometimes it really is best to separate the functions.

In our org I believe we should have a technical qa team that works with the operations team ripping apart and through the final product from an engineering and technical production standpoint. I think this would provide the best level of accountability on the two teams and formalize the release without sacrificing the startup feel. Of course we wold need to officially work this into the time line but would leave the core teams focusing on building the best possible products.

Importance of DB Trending

How can you know when something is about to go wrong if you can’t see it?? We finally closed the loop today on some MSSQL trending we have been missing for a very long time. Being able to watch things like table scans/sec, batch reads & writes/sec, and transactions/sec is huge during an event. As much as we drill into folks heads the importance of communicating changes, it is still to easy for a simple change to have unexpected impact on something like a DB. As I noted the other day it is almost always the DB or the file system and while we have our share of issues that aren’t many times, we have chased our tail due to lack of trending on the DB a lot and in the end it has been something stupid like an index got dropped.

“If it isn’t your DB it’s your File System”

That is a loose quote from a SXSW panel on scalable web ventures. This just hits way to close to home today to not write about. In our case it was both the DB and the File System. We were doing some last minute load testing, you can never be too sure, and after almost the entire day and roping in two other teams to dig into the db and SAN we realized the db had not been setup right. All files associated with the specific database were on the same storage LUN. This caused us a 95% reduction in throughput of our service. Splitting up the t-logs, data files, etc. onto different LUNs got us back to where we expected. The bottom line is we wouldn’t even have noticed this with out trending everything possible in cacti on our hosts and while win2k3 disk trending has some hurdles for SAN attached disk, it still pointed us in the right direction.

Seriously Facebook?

In between avoiding any real specifics in an interview with GiggaOm’s Stacey Higginbotham Zuckerberg eludes to Facebook running on “tens of thousands” and approaching “hundreds of thousands” of hosts. According to comscore facebook supports up to 65 Billion, yes that is billion with a B, page views a month, or more then 2 Billion a day. Relatively speaking that is a huge number of hosts to support the site. He does talk about how they use memcached extensively but it sounds to me like some general re-architecture is in order. Or is this a VMWARE salesman’s dream?

Digg Cashing in finally?

krose.gifWell I woke up to another Digg sell out rumor so it must be the first Friday of the month. What is interesting this time is the players couldn’t get much bigger; Google and Microsoft thanks to techcrunch breaking the news. While this has to be the geekiest picture ever on the cover of a magazine it is damn good marketing. Kevin Rose took a suggestion he made to the Slashdot founder over lunch and turned it into the largest social news site that so far can’t be outdone. The digg stack is completely Lamp, High Scalability has the details, and done on a shoestring budget of course with the exception of the support of digg CEO and co-founder Adelson’s Equinox as the founder and CTO. Good idea, reasonable execution, and a solid Ops team, which started with one guy.

Operations Rules

Jon Pral has great list here of 85 Operaions rules to live by. Here are my top 5 from the list.

1. Know your bottlenecks and know how to spot them – every layer – know if you are blocking on disk, RAM, or CPU. It is usually that simple.

2. The value of a project manager, tech writer, and financial analyst in the ops organization should not be underestimated. They will more than pay for themselves.

3. Monitor EVERYTHING – alert on actionable only, record other for trend information.

4. Assign people to be point people for every bit of technology.

5. Do it right the first time. Rarely do you get the chance to go back and redo things. If you do, it comes at a very big cost to the company. Take the hit on work, the first go round.