This has been one hell of a week for the entertainment industry. Ed McMahon, Farrah Fawcett and Michael Jackson have all passed away. Whenever significant cultural events like this occur there is an explosion in communication amongst people, wanting to know what happened and further discuss it amongst their peers. In the past this would have been isolated to talking with your neighbours, family and friends either in person or over a traditional POTS line. Fast forward to the 21st century and we now have real time bidirectional communication between virtually anyone anywhere in the world.
When you have an unpredictable event like the death of a societal icon or the launch of a new service that has the potential for extremely rapid adoption or at the very least high traffic due to curiousity alone, it is very difficult, or practically impossible to anticipate the real world resources needed to support the inbound demand. This is very clearly shown by the chart from Keynote Systems illustrating the availability and performance impact of this event on news websites.

Image from: http://www.datacenterknowledge.com/archives/2009/06/25/michael-jackson-news-slows-web-sites/
TMZ.com was the first news outlet to break the story of Michael Jackson’s death, and consequently their site collapsed outright from the unexpected workload. It’s hard to fault the IT team responsible for the services delivery, after all no one knew MJ was going to pass away yesterday, and arguably there is no one in entertainment today that would have generated the level of interest from the public as him. So where am I going with all of this, to the clouds! If there was ever a real world example of where a cloud solution would have played nicely into the delivery of a service that can be impacted by transient high-intensity workloads that can come without warning, this is it. Even a properly architected high volume application or service that is designed to handle large increases in transient load has a finite capacity. If TMZ.com had the ability to automatically spin up cloud resources and shunt the new traffic load over to them during the media frenzy, ideally they would have been able to stay up during the peak of the traffic and provided service quality and performance as good as their normal service levels. (For the shunting, I’m a big fan of f5 gear for ADN networking) Now, they could have done this manually I suppose, when they see the traffic coming they could have provisioned some AWS instances, got their site/content up and running and started routing traffic over through a change to their load balancers. That’ll work, but it’s also manual, going to take them time to get it all implemented and by the time they’re done their end users have already hit a dead site and gone to one of their competitors. So what to do? Automate!
With the 5.2 release of up.time that was launched on Wednesday (June 24th, 2009) up.time now has a full bi-directional integration with VMware Orchestrator. If you are a VMware shop, you get Orchestrator for free with vCenter Server. If you are not familiar with Orchestrator, you can check it out here. Essentially, Orchestrator is a policy based workflow automation tool that you can use to build automated scenarios to perform well pretty much anything. Orchestrator has the concept of plugins that provide Orchestrator with the know how for specific vendor technologies to directly interact with them. For example, the up.time plugin for Orchestrator lets you do things like add elements, create/modify/delete groups, service groups and other tasks from within Orchestrator. (Under the hood, this is enabled by a new set of web services in up.time 5.2) So how does this play into the TMZ.com cloud scenario, well it goes something like this.
- up.time is monitoring the end user experience for the website as seen by the logical service address using the HTTP service or WATM monitor. (www.mynewssite.com)
- You can monitor the logical service for overall end user experience.
- You can monitor the individual web servers to identify if any given server is being overloaded to determine if that is expected behaviour or an issue like a load balancer algorithm misconfiguration.
- You can configure whatever service monitors you need (database, business logic, logs, etc) to determine the ongoing health of the service you are delivering and use that to trigger the automated resolution.
- When your end user begins to suffer or servers start to indicate they are becoming overloaded, have up.time trigger an Orchestrator workflow to automatically avoid any end user incident that may occur due to insufficient resources. That would look something like this
- Using an action profile within up.time, trigger the Orchestrator workflow you have defined for automatically shunting workload to the cloud or to scale out internally onto idle resources. The how you resolve it from a capacity perspective is really up to you. You could have different capacity scale out workflows depending on where the performance bottleneck is. If your webservers are overloaded, shunt to the cloud, if your database is overloaded, add a new node to your cluster. In this scenario let’s scale out our web tier.
- up.time tells Orchestrator to trigger the ‘mywebsite cloud scaleout’ workflow, Orchestrator then manages the following
- Provision and configure an AWS server (or many if you need to) with the appropriate OS and web content.
- Add the new AWS instanes into up.time (via the up.time Orchestrator plugin, it’s downloadable from our site)
- Add the instances to the appropriate up.time groups
- Add the instances to the appropriate up.time service groups so the new services are monitored and managed
- Update the load balancer virtual IP pools to include the new AWS instances and begin sending traffic
- We’re now sending traffic to our AWS cloud without anyone ever having had to do anything other than the initial Orchestrator configuration.
I realize that technically the Orchestrator piece is not a 3 click and you’re in Nirvana exercise, however once it is implemented you’ll be able to have your web properties auto scale based on inbound workload before there is ever a problem. Take it a step further and you can have up.time via Orchestrator deprovision the AWS resources when your site workload drops back to normal levels so you can close off the loop on provision-deprovision and only pay for the AWS resources you use when you need them. Pretty cool eh? I think so. So with a little up front configuration in Orchestrator and up.time you can implement Automated Incident Avoidance and keep your services running when they are faced with the potential of unforseen transient workloads. With up.time and Orchestrator, this is only one example out of literally hundreds (dare I say thousands) of ways you can automate your infrastructure management to ensure you are operating at the highest possible levels of efficiency both from a technology and a resource standpoint.
