The Big Picture in IT Systems Management

Monday, March 24, 2008

WSJ Article

The virtualization battle is heating up and up.time is well positioned to deal with all the latest entrants into the virtualization space. Rather than being a pure virtualization play, up.time is taking its core roots in heterogeneous platform management and marrying them with the capability to monitor on virtual platforms.

For the WSJ article, click here: http://online.wsj.com/article/SB120398945599592373.html?mod=sphere_ts&mod=sphere_wd



Monday, March 17, 2008

VM Sprawl

Even though our product monitors VMware environments, we ourselves are big users of VMware internally for our QA and Development environments. This is great for us, as we use our tool internally to manage the QA and Dev systems. In fact, is was our early adoption of VMware a few years ago that lead to the extension of up.time to monitor and manage VM environments.

Our biggest problems involved:
  • losing VMs in the physical environment (see the photo [next], and this is just a small snapshot of the QA environment)
  • temporary creation of VMs for QA testing and then forgetting to decommission them (thus consuming memory and disk storage [cpu not so much])
  • finding the right configuration of a VM image to create a QA environment (e.g. Oracle 10g on RedHat with WebLogic 9.x, etc.)
  • running out of space on the SAN pools b/c of storage sprawl
  • departments requesting additional physical hardware b/c we couldn't internally map out our resource usage across the VM physical systems.
  • occasional load testing killing a physical system (b/c caps weren't put on)
  • trying to identify dependencies between VMs for particular tests or development runs (e.g. system X has Oracle, system Y has WebSphere, system Z has up.time, etc.).
So, as you can see, there were a number of issues related to virtualization that we needed to address. Since up.time was already a great tool for collecting and analyzing performance data over time, we extended the tool to talk directly to VMware ESX (and Virtual Center) to extract physical and virtual configurations, detailed performance data, and Vmotion information. Over time, we were able to graph VM instance performance information not just within a physical VMware system, but across an entire VMware farm. We could identify which VMs were being migrated throughout the farm and how much compute, memory, and storage was being consumed.

Additionally, because up.time is an active server monitoring tool, we were able to create monitors that triggers alerts when new virtual instances were provisioned. This way, if up.time didn't already know about an existing instance (it automatically inventories instances across a VMware farm), an alert would be generated.

There are a number of nice tools that already exist for provisioning management of instances (VMware Dunes, http://www.dunes.ch, and VMware Lab Manager), however, these tools don't actively monitor and profile a live VMware environment. This is where up.time excels.

We are continuing to develop VMware functionality within VMware and brining the traditional up.time "easy-of-use" mantra to VMware sprawl management.



System's Management Hairball

I've now been in this space for over 15 years, and while technology and tools have advanced at incredible rates, managing the technology is still a major pain. Here at uptime we've been diligently trying to make performance and availability monitoring as easy as possible, however it's a constant challenge. Sometimes I wish we could be like salesforce.com and become like a SaaS (software as a service) provider. Why? With SaaS, there is only one hosted code base and one hosted database (well, it's more complicated, but you get the idea), and this is running on hardware that you can control. In our case, where our software is out in the wild at many customer environments we are faced with huge versioning problems.

For example, we monitor Solaris, Linux (RedHat/SuSE), Windows, HP/UX, and AIX systems. We have to deal with code bases in our agents for all these platforms, now, add in platform specific issues (such as architectures, 32/64-bitness, kernel changes, tech releases) and you are asking for an exponential increase in issues. Now factor in all of the applications and services that we monitor (such as Oracle, SQL Server, Exchange, WebLogic, WebSphere) and their corresponding version upgrades. This is on top of our monitoring station support and the platforms and databases that it uses. Now introduce VMware (and other virtualization technologies) into the equation and you're in for a world of hurt. What does this mean? It means that companies in the system's management space, like uptime, spend considerable resources on simple software hygiene, time that could be better spent innovating. Unfortunately, this problem isn't going away any time soon, so we are constantly looking at other methods to simplify.

So, can system's management software become SaaS-like? In my opinion - no. What IT Manager in his/her right mind is going to allow an external vendor access to their secure internal environment, especially when the monitoring software is going to cut to the core of their production servers. This is also why application vendors such Oracle are starting to get into the management space, they know that intelligent management of the applications is where the money is at.

Next up, virtual server sprawl.