The up.time IT Systems Management Blog

Archive for the ‘virtualization vmware’ Category

Cloud computing and popular culture

Friday, June 26th, 2009

This has been one hell of a week for the entertainment industry.  Ed McMahon, Farrah Fawcett and Michael Jackson have all passed away.  Whenever significant cultural events like this occur there is an explosion in communication amongst people, wanting to know what happened and further discuss it amongst their peers.  In the past this would have been isolated to talking with your neighbours, family and friends either in person or over a traditional POTS line.  Fast forward to the 21st century and we now have real time bidirectional communication between virtually anyone anywhere in the world. 

When you have an unpredictable event like the death of a societal icon or the launch of a new service that has the potential for extremely rapid adoption or at the very least high traffic due to curiousity alone, it is very difficult, or practically impossible to anticipate the real world resources needed to support the inbound demand.  This is very clearly shown by the chart from Keynote Systems illustrating the availability and performance impact of this event on news websites.

news-site-index-470

Image from: http://www.datacenterknowledge.com/archives/2009/06/25/michael-jackson-news-slows-web-sites/

TMZ.com was the first news outlet to break the story of Michael Jackson’s death, and consequently their site collapsed outright from the unexpected workload.  It’s hard to fault the IT team responsible for the services delivery, after all no one knew MJ was going to pass away yesterday, and arguably there is no one in entertainment today that would have generated the level of interest from the public as him.  So where am I going with all of this, to the clouds!  If there was ever a real world example of where a cloud solution would have played nicely into the delivery of a service that can be impacted by transient high-intensity workloads that can come without warning, this is it.  Even a properly architected high volume application or service that is designed to handle large increases in transient load has a finite capacity.  If TMZ.com had the ability to automatically spin up cloud resources and shunt the new traffic load over to them during the media frenzy, ideally they would have been able to stay up during the peak of the traffic and provided service quality and performance as good as their normal service levels.  (For the shunting, I’m a big fan of f5 gear for ADN networking)  Now, they could have done this manually I suppose, when they see the traffic coming they could have provisioned some AWS instances, got their site/content up and running and started routing traffic over through a change to their load balancers.  That’ll work, but it’s also manual, going to take them time to get it all implemented and by the time they’re done their end users have already hit a dead site and gone to one of their competitors.  So what to do?  Automate!

With the 5.2 release of up.time that was launched on Wednesday (June 24th, 2009) up.time now has a full bi-directional integration with VMware Orchestrator.  If you are a VMware shop, you get Orchestrator for free with vCenter Server.  If you are not familiar with Orchestrator, you can check it out here.  Essentially, Orchestrator is a policy based workflow automation tool that you can use to build automated scenarios to perform well pretty much anything.  Orchestrator has the concept of plugins that provide Orchestrator with the know how for specific vendor technologies to directly interact with them.  For example, the up.time plugin for Orchestrator lets you do things like add elements, create/modify/delete groups, service groups and other tasks from within Orchestrator.  (Under the hood, this is enabled by a new set of web services in up.time 5.2)  So how does this play into the TMZ.com cloud scenario, well it goes something like this.

  1. up.time is monitoring the end user experience for the website as seen by the logical service address using the HTTP service or WATM monitor.  (www.mynewssite.com)
    1. You can monitor the logical service for overall end user experience.
    2. You can monitor the individual web servers to identify if any given server is being overloaded to determine if that is expected behaviour or an issue like a load balancer algorithm misconfiguration.
    3. You can configure whatever service monitors you need (database, business logic, logs, etc) to determine the ongoing health of the service you are delivering and use that to trigger the automated resolution.
  2. When your end user begins to suffer or servers start to indicate they are becoming overloaded, have up.time trigger an Orchestrator workflow to automatically avoid any end user incident that may occur due to insufficient resources.  That would look something like this
    1. Using an action profile within up.time, trigger the Orchestrator workflow you have defined for automatically shunting workload to the cloud or to scale out internally onto idle resources.  The how you resolve it from a capacity perspective is really up to you.  You could have different capacity scale out workflows depending on where the performance bottleneck is.  If your webservers are overloaded, shunt to the cloud, if your database is overloaded, add a new node to your cluster.  In this scenario let’s scale out our web tier.
    2. up.time tells Orchestrator to trigger the ‘mywebsite cloud scaleout’ workflow, Orchestrator then manages the following
      1. Provision and configure an AWS server (or many if you need to) with the appropriate OS and web content.
      2. Add the new AWS instanes into up.time (via the up.time Orchestrator plugin, it’s downloadable from our site)
        1. Add the instances to the appropriate up.time groups
        2. Add the instances to the appropriate up.time service groups so the new services are monitored and managed
      3. Update the load balancer virtual IP pools to include the new AWS instances and begin sending traffic
    3. We’re now sending traffic to our AWS cloud without anyone ever having had to do anything other than the initial Orchestrator configuration.

I realize that technically the Orchestrator piece is not a 3 click and you’re in Nirvana exercise, however once it is implemented you’ll be able to have your web properties auto scale based on inbound workload before there is ever a problem.  Take it a step further and you can have up.time via Orchestrator deprovision the AWS resources when your site workload drops back to normal levels so you can close off the loop on provision-deprovision and only pay for the AWS resources you use when you need them.  Pretty cool eh?  I think so.  So with a little up front configuration in Orchestrator and up.time you can implement Automated Incident Avoidance and keep your services running when they are faced with the potential of unforseen transient workloads.  With up.time and Orchestrator, this is only one example out of literally hundreds (dare I say thousands) of ways you can automate your infrastructure management to ensure you are operating at the highest possible levels of efficiency both from a technology and a resource standpoint.

Virtual Appliances

Wednesday, June 24th, 2009

I love these things!

VMware’s Virtual Appliance Marketplace (VAM) is like a candy store for we geeks and nerds.  While not quite as robust as say, the iPhone App store, there are hundreds of ready made appliances for hundreds of applications. Pick your solution, download it and run it on your favorite VMware virtualization platform.  Don’t like it?  Simply delete it, nothing to ‘uninstall’.

For those of you who don’t know, a Virtual Appliance is the modern day equivalent of a turn-key application.  The OS, application and any supporting tools are pre-installed and ready to power up.  They save you gobs of time, especially when evaluating a solution. The best part?  Batteries are included, and some assembly is NOT required!  In most cases you don’t need to provision any new virtual hardware, or ask your storage manager for more space on the SAN.  Don’t have a virtualization platform yet? You can download VMware Player, for free, and run the appliance on your desktop.

I know that virtual appliances aren’t that new. they’ve been around for a while now.  I know, “way to be late to the game Mitchell!”, But it’s only recently that VMware has been pushing awareness through their VAM portal, and I’m particularly excited today.  Why?

up.time 5.2 has been appliancized!

the up.time Virtual Appliance is finally here and is a dream come true. Now instead of downloading up.time, making sure you meet all the system requirements, possibly installing a new OS and spending time simply readying yourself for our lighting-fast install, we’ve done it all for you!  Download the appliance and run it. It’s that simple.  Seriously,I fired up the appliance and was ready to play with up.time in about 3 minutes! (excluding download time).

This is a game changer.  No longer are you tied to a platform, or hardware. You can can truly be up and monitoring in minutes. Don’t like it? Go ahead and delete it.  Love it? Move it to a production virtualization platform, like ESX and run with it.

We’re confident that you’ll love it.

Download up.time virtual appliance today and try it free for 30 days.  We’d love to hear what you think (comments please!).

Stop the insanity

Monday, June 22nd, 2009

One of the definitions of insanity is doing the same thing over and over and expecting different results.  In this case, why does this affliction continue to haunt us in IT?  Given the significant advances in technology, specifically in virtualization, all of which are supposed to make our lives easier and more efficient; why haven’t we whipped the beast of IT complexity?  The majority of IT environments are still stuck in a world of break-fix, albeit, perhaps we’re just getting into a faster break-fix mode.

In one recent Gartner article, “Server Virtualization for x86: A Benefits Impact Assessment,” there is a rather telling statement:

“From Gartner surveys and client interactions, we know that, operationally, virtualization appears to be a “wash,” at best — and it actually creates additional costs (people, process development and tools) on a worst-case basis. “

So what are we doing wrong?  One reason is that in daily operations, there isn’t an easy way to prioritize incoming incidents or determine recurring problems.  I would categorize the recurring outages as “death by a thousand cuts.”  This is further exacerbated on teams with a number of sysadmins, where the same problem can be perceived as distinct problems to each sysadmin.  Resource inefficiency is created by having multiple sysadmins solve the same problem over-and-over.

Additionally, in VMware environments, the old traditional metrics of guest CPU, Memory, and I/O are not very useful anymore.  They aren’t good indicators of how guests ‘get along’ during regular compute workloads.  There are a whole new series of VMware specific metrics that are indicators of VM guest contention from a compute, bandwidth, and memory usage point of view.  System’s management tooling needs to understand these new factors to aid in managing a virtual infrastructure, something that traditional ‘Big 4′ tooling just can’t do.  Putting ill-matching VM guests onto the same physical infrastructure is simply asking for incidents and accumulated outages over time.

The right approach should be to stop banging your head against the wall, rather than simply taking two aspirin every day and dealing with the pain.  Instead of waiting for incidents to occur, a more proactive manner of avoiding them should be possible.   With VMware’s launch of vSphere in May a package called Orchestrator is also bundled (this is from their Dunes acquisition of a few years ago).  This is fantastic news for SMBs (and enterprises too) as it means that any installation of VMware vSphere will have runbook automation capabilities.  VMware’s Orchestrator is a very simple drag and drop interface to create (potentially complex) workflows to control your virtual infrastructure.  The latest release of up.time integrates tightly with Orchestrator to add application-level monitoring  and management capabilities and can trigger specific workflows when certain applications are about to exceed SLA objectives or will degrade unless corrective action is taken.   Through an Orchestrator plug-in the up.time API is also exposed, allowing bi-directional communication between Orchestrator and up.time (so you can dynamically add systems, or re-group them on the fly).

So rather than wait for an application to fail and trigger an incident, up.time can take corrective action in advance to complete avoid the incident.  This starts to help us snap out of the break-fix routine that we’re all stuck in.  Let’s take an example of incident avoidance and dynamic infrastructure:

Let’s say that you have an e-commerce application that requires that certain response time thresholds can’t be exceeded and the concurrent user sessions are also a factor.  With up.time, since it’s already a micro-framework that can monitor your entire infrastructure (applications, databases, platforms, networks, etc.), it can trigger actions based on identified thresholds.  If user sessions starts to peak or response time begins to drop, up.time can trigger an Orchestrator workflow to dynamically provision additional VM guests and bring them online into the e-commerce application.   Also, since up.time understands the application, as workload drops over time (e.g. the user peak has dissipated), workflows can then be triggered to de-provision the extra VM guests to avoid sprawl.

There are many more exciting things in this release, but we’ll cover those in another blog post.  I’m also going to cover the exciting capability of bridging private and public cloud with up.time.  What about dynamically provisioning compute capability in Terremark’s cloud or Amazon’s EC2 from the privacy of your own infrastructure and then having these instances monitored under the global purview of up.time?  We can do this, more info next blog post.

VMware vSphere – Are you ready? We are!

Wednesday, June 3rd, 2009

Unless you live under a rock, you know that VMware recently released vSphere 4.  The highly anticipated upgrade to its virtual infrastructure suite.  The number of feature upgrades and enhancements makes the new version somewhat hard to ignore.  But if you’re like me you tend to shy away from .0 releases.  I usually wait for the real world installations to sort out the bugs and let the developer issue a patch or point release. Let someone else be my guinea pig.  The last thing you want is for an upgrade to nuke your production system.

I am, however, happy to report that our experience with vSphere 4 has been relatively smooth so far.  While I’ve not taken the plunge and upgraded our production environment yet, our lab upgrade from 3.5 to the 4.0 beta, and subsequently the general release went off without a hitch.  This gives me the confidence to at least begin the planning stages of the production system upgrade.

Step one is to make sure our existing systems are at the latest version of Infrastructure 3.5 and fully patched. We start that in a a week or so and I’ll keep you all abreast of the progress.  One thing I don’t have to worry about as we ready our production environment for vSphere is that the up.time monitoring station is waiting for us on the other side.  It’s just waiting for me to play catch up!

So, have you upgraded to vSphere yet?  Tell us about your experience with the process and about vSphere in general. Or even better, if you are monitoring your vSphere infrastructure with up.time we’d love to hear about your experience. You can visit the up.time website for more on vSphere Monitoring or VMware monitors.

VMware VMotion & DRS… Problem Solved

Wednesday, May 27th, 2009

I have been working with a large financial institution for the past few months and on Monday, they used one of the many Virtualization reports available in up.time 5 to help solve an issue they were having with one of their VMware ESX clusters. I have been using this report quite a bit but wanted to highlight it on the blog. It’s called the VMware Instance Motion Report and tracks instances (Virtual Machines) and when they VMotion (move) around an ESX cluster. Either manually or by such methods as DRS (Distributed Resource Scheduler).

For the financial customer I was referring to, they had recently setup DRS on a new cluster. For those of you not familiar with DRS, it basically dynamically allocates resources to enforce resource management policies while harmonizing resource usage across multiple ESX hosts. One of the options when setting up DRS is how aggressive you want to be (this is set across the entire cluster) and has five different options which range from Level 1 (Conservative) to level 5 (Aggressive). The person who had setup DRS on this new cluster had it set to level 5 which was causing constant VMotioning between hosts. We were able to immediately see this in the instance motion report, which tracks individual Virtual Machines across multiple ESX hosts.

Problem Solved.

VMware Instance Motion Report

[/caption]

Another one bites the dust, Cittio gone

Tuesday, May 19th, 2009

Our marketspace is quickly shrinking these days, with the recent absorption of Hyperic; and now the acquisition of Cittio by Nimsoft.  Read the 451 Group article (if you have access).  This is good news for us, as there are now even fewer players to choose from for a comprehensive IT system’s management solution.  What was particularly telling about the acquisition is that Nimsoft is cherry-picking specific technology from the asset sale and is killing Cittio’s Zeppelin, a cloud infrastructure monitoring solution (note, the late Hyperic had one as well).  It seems as though vendors cannot make money on the public cloud and the public cloud is still hype.  In all honesty, this seems more like a PR acquisition by Nimsoft rather than technology, it makes them look like an acquirer (even though Cittio was disintegrating) in these peculiar economic times.

Having said this, I do believe that VMware has opportunity with the right kind of management tooling (plugged into their vSphere portofolio) to become a very powerful player in the x86 private cloud space.  They are not there yet, but over the next 9-18 months I suspect all of the true management issues around cloud will be addressed in a manner that will make private cloud computing a reality.

Alex

What to look for in a VM monitoring solution

Friday, May 8th, 2009

I was recently reading through some of the questions over at the “Official VMware Virtualization Group” on LinkedIn, and there was a question about what to look for in a VM monitoring solution, so I thought I would share my response here.

When looking for a VM monitoring solution, you’ll need to look at what you are currently doing from a monitoring standpoint today and decide if you are looking for a VM only point tool, or if you are looking for something broader, that will give you an end to end perspective on your VM environment.  There are great bespoke tools out there for performing very specific tasks, however when you end up with a number of point tools, troubleshooting, reporting and analysis of the environment can become much more difficult.

In an ideal world you want to be able to address the ’3 M’s', Monitor, Measure and Manage with the solution you are putting in place.  Monitor in order to collect the required metrics from the host and guest level as well as any applications or services being supported by the guests.  Measure these metrics against specific goals or thresholds to ensure that everything is operating within the design parameters for the service.  Finally, manage the end to end delivery for the application services being provided to the customer by the virtual infrastructure.  This includes monitoring the end user experience for the apps themselves and the delivery of the services against stated goals within SLAs.

By monitoring, measuring and managing, you will be able to ensure the quality of the delivered service to your customers (internal or external) and have all the information required to effectively maintain the service into the future.  Through proactive alerting, capacity planning and a monitoring solution that provides an end to end single pane of glass across your infrastructure, you’ll be able to reduce your MTTR whenever you experience an outage and run your environment as efficiently as possible.

From a purely product standpoint, you should keep in mind how the monitoring tool or software is licensed.  Some vendors charge per monitored resource, some by CPU socket or core, some by physical host with guest VMs coming at no charge.  Licensing charges can quickly add up in a virtual environment, especially if you are being charged per resource.  With it being so easy to spin up more and more VMs, the associated licensing costs to monitor those VMs can grow in a hurry.  If you check out the vmware monitoring page at uptime software you’ll be able to see what we bring to the table for end to end VM monitoring.

VMware Lifecycle Management Question…

Tuesday, May 5th, 2009

I was recently over at the “Offical VMware Virtualization Group” at Linkedin and a person was asking about VM Lifecycle Management and how they could prevent VMsprawl in their datacenter.

I have been working with VMware Virtualization for the past few years and the topic of VM Lifecycle Management has become much more prevalent over the last 6 months. One of the most confusing things is that VMware named their provisioning system “VMware Lifecycle Manager” when it is really all about getting VM’s into the infrastructure. It’s not so much about managing those VM’s, preventing VM sprawl, performance monitoring of ESX/individual VMs, visibility, control and audibility.

VMware did acquire a company called Dunes Technologies which played in the Lifecycle Management game. VMware is now releasing the Dunes product as VMware vCenter Orchestrator. It really concentrates on drag and drop automation and orchestration. This in combination with vSphere will help solve some of the lifecycle management issues; however, there is still the need for a 3rd party to get total control.

Check out the VMware Monitoring page at uptime software and you will see how up.time 5 can greatly assist in getting control back in your virtual infrastructure. Better virtual management and performance, deep workload profiling, VM density optimization reporting, tracking of vMotion, VMsprawl prevention and consolidation planning tools.  All in one easy to install application that not only will help with your virtual machines but your physical boxes as well.

Adapting to the Integrated Technology Stack: Next Generation IT Systems Management

Saturday, May 2nd, 2009

I read  The race for the integrated technology stack, from Enterprise Strategy Group this week. Some completely valid points are made about the transition that IT departments and tool vendors in the ITSM space are going to have to go through to add value to the ‘new’ integrated data center. Virtualization has already challenged many deeply entrenched paradigms that many IT staff, and software vendors, have struggled to adapt to.

Agility from a training and tooling point of view are going to be essential for companies to see success in their rapidly changing environments and ensure that they are able to maintain their IT SLA with their users through this transition. As the integrated stack and adaptive infrastructure continue to gain share in large environments I have to wonder how software vendors, who are already unable to adapt to the rate of change in the data center, will stay relevant.

I see more agile companies like uptime, who already have mature solutions in the virtual systems management & physical server monitoring space, being able to adapt faster and offer solutions that directly address challenges in the new data center well before the big 4 framework vendors are able to align their solutions with the modern day problem set.

Comment on The Wrong Cloud

Tuesday, April 28th, 2009

Maya Design recently published an article accompanied by a 4 page whitepaper on cloud computing and what is being worked on today,  is in fact the wrong approach to cloud computing.  I found the article and whitepaper echoing a lot of my own sentiment about the current state of cloud computing.  From my perspective, the internet is the only real example of “true” cloud computing.  Salesforce.com, Google, and others, while referred to as cloud services, are not cloud computing but SaaS, which I see as mutually exclusive.  To me cloud is the ability to run arbitrary workloads on ‘the’ cloud, with absolute interoperability.  The internet being the communications cloud is based on a standard communication mechanism (IP) , allowing anyone to communicate over it that speaks IP.

This is not the case with cloud computing.  There are several offerings, APIs, VM target types, OSes, lions, tigers and bears, oh my!  As an industry, we’ve essentially rebranded what we’re already doing and called it cloud computing to make it sexy.  Larry Ellison put it perfectly.  Private cloud is not all that different than using ESX the way we have for years now, or by using grid technologies and application design networking to distribute workload across all of our infrastructure.  Don’t get me wrong, I think that there is a great opportunity for the concept of cloud computing, I just think that we’re taking the wrong approach to the fundamentals of cloud computing.