One of the definitions of insanity is doing the same thing over and over and expecting different results. In this case, why does this affliction continue to haunt us in IT? Given the significant advances in technology, specifically in virtualization, all of which are supposed to make our lives easier and more efficient; why haven’t we whipped the beast of IT complexity? The majority of IT environments are still stuck in a world of break-fix, albeit, perhaps we’re just getting into a faster break-fix mode.
In one recent Gartner article, “Server Virtualization for x86: A Benefits Impact Assessment,” there is a rather telling statement:
“From Gartner surveys and client interactions, we know that, operationally, virtualization appears to be a “wash,” at best — and it actually creates additional costs (people, process development and tools) on a worst-case basis. “
So what are we doing wrong? One reason is that in daily operations, there isn’t an easy way to prioritize incoming incidents or determine recurring problems. I would categorize the recurring outages as “death by a thousand cuts.” This is further exacerbated on teams with a number of sysadmins, where the same problem can be perceived as distinct problems to each sysadmin. Resource inefficiency is created by having multiple sysadmins solve the same problem over-and-over.
Additionally, in VMware environments, the old traditional metrics of guest CPU, Memory, and I/O are not very useful anymore. They aren’t good indicators of how guests ‘get along’ during regular compute workloads. There are a whole new series of VMware specific metrics that are indicators of VM guest contention from a compute, bandwidth, and memory usage point of view. System’s management tooling needs to understand these new factors to aid in managing a virtual infrastructure, something that traditional ‘Big 4′ tooling just can’t do. Putting ill-matching VM guests onto the same physical infrastructure is simply asking for incidents and accumulated outages over time.
The right approach should be to stop banging your head against the wall, rather than simply taking two aspirin every day and dealing with the pain. Instead of waiting for incidents to occur, a more proactive manner of avoiding them should be possible. With VMware’s launch of vSphere in May a package called Orchestrator is also bundled (this is from their Dunes acquisition of a few years ago). This is fantastic news for SMBs (and enterprises too) as it means that any installation of VMware vSphere will have runbook automation capabilities. VMware’s Orchestrator is a very simple drag and drop interface to create (potentially complex) workflows to control your virtual infrastructure. The latest release of up.time integrates tightly with Orchestrator to add application-level monitoring and management capabilities and can trigger specific workflows when certain applications are about to exceed SLA objectives or will degrade unless corrective action is taken. Through an Orchestrator plug-in the up.time API is also exposed, allowing bi-directional communication between Orchestrator and up.time (so you can dynamically add systems, or re-group them on the fly).
So rather than wait for an application to fail and trigger an incident, up.time can take corrective action in advance to complete avoid the incident. This starts to help us snap out of the break-fix routine that we’re all stuck in. Let’s take an example of incident avoidance and dynamic infrastructure:
Let’s say that you have an e-commerce application that requires that certain response time thresholds can’t be exceeded and the concurrent user sessions are also a factor. With up.time, since it’s already a micro-framework that can monitor your entire infrastructure (applications, databases, platforms, networks, etc.), it can trigger actions based on identified thresholds. If user sessions starts to peak or response time begins to drop, up.time can trigger an Orchestrator workflow to dynamically provision additional VM guests and bring them online into the e-commerce application. Also, since up.time understands the application, as workload drops over time (e.g. the user peak has dissipated), workflows can then be triggered to de-provision the extra VM guests to avoid sprawl.
There are many more exciting things in this release, but we’ll cover those in another blog post. I’m also going to cover the exciting capability of bridging private and public cloud with up.time. What about dynamically provisioning compute capability in Terremark’s cloud or Amazon’s EC2 from the privacy of your own infrastructure and then having these instances monitored under the global purview of up.time? We can do this, more info next blog post.