The up.time IT Systems Management Blog

Archive for the ‘Uncategorized’ Category

See you at VMworld 2009

Thursday, August 27th, 2009

We’re looking forward to being at VMworld at the Moscone Center this year.  Come see us in Booth 2138, close to the theatre – we’ll have some of our top-notch system engineers on-site to show you some of our exciting new capabilities.

We have a very cool VMworld promo, as we are giving away more than $80k of up.time 5 IT Systems Management software at the booth.  For more details, check out:  http://www.uptimesoftware.com/VMworld-2009.php

uptime in the latest Gartner Magic Quadrant for IT Event Correlation and Analysis

Thursday, July 30th, 2009

I just wanted to make a quick post highlighting that uptime software is on this year’s Gartner Magic Quadrant for IT Event Correlation and Analysis, which was just recently released.  We’re happy to be on this list as a solid vendor with a great track record.

Also, a quick note that we will be at VMworld 2009 in San Francisco, so hope to see you there.

Alex

Cloud computing and popular culture

Friday, June 26th, 2009

This has been one hell of a week for the entertainment industry.  Ed McMahon, Farrah Fawcett and Michael Jackson have all passed away.  Whenever significant cultural events like this occur there is an explosion in communication amongst people, wanting to know what happened and further discuss it amongst their peers.  In the past this would have been isolated to talking with your neighbours, family and friends either in person or over a traditional POTS line.  Fast forward to the 21st century and we now have real time bidirectional communication between virtually anyone anywhere in the world. 

When you have an unpredictable event like the death of a societal icon or the launch of a new service that has the potential for extremely rapid adoption or at the very least high traffic due to curiousity alone, it is very difficult, or practically impossible to anticipate the real world resources needed to support the inbound demand.  This is very clearly shown by the chart from Keynote Systems illustrating the availability and performance impact of this event on news websites.

news-site-index-470

Image from: http://www.datacenterknowledge.com/archives/2009/06/25/michael-jackson-news-slows-web-sites/

TMZ.com was the first news outlet to break the story of Michael Jackson’s death, and consequently their site collapsed outright from the unexpected workload.  It’s hard to fault the IT team responsible for the services delivery, after all no one knew MJ was going to pass away yesterday, and arguably there is no one in entertainment today that would have generated the level of interest from the public as him.  So where am I going with all of this, to the clouds!  If there was ever a real world example of where a cloud solution would have played nicely into the delivery of a service that can be impacted by transient high-intensity workloads that can come without warning, this is it.  Even a properly architected high volume application or service that is designed to handle large increases in transient load has a finite capacity.  If TMZ.com had the ability to automatically spin up cloud resources and shunt the new traffic load over to them during the media frenzy, ideally they would have been able to stay up during the peak of the traffic and provided service quality and performance as good as their normal service levels.  (For the shunting, I’m a big fan of f5 gear for ADN networking)  Now, they could have done this manually I suppose, when they see the traffic coming they could have provisioned some AWS instances, got their site/content up and running and started routing traffic over through a change to their load balancers.  That’ll work, but it’s also manual, going to take them time to get it all implemented and by the time they’re done their end users have already hit a dead site and gone to one of their competitors.  So what to do?  Automate!

With the 5.2 release of up.time that was launched on Wednesday (June 24th, 2009) up.time now has a full bi-directional integration with VMware Orchestrator.  If you are a VMware shop, you get Orchestrator for free with vCenter Server.  If you are not familiar with Orchestrator, you can check it out here.  Essentially, Orchestrator is a policy based workflow automation tool that you can use to build automated scenarios to perform well pretty much anything.  Orchestrator has the concept of plugins that provide Orchestrator with the know how for specific vendor technologies to directly interact with them.  For example, the up.time plugin for Orchestrator lets you do things like add elements, create/modify/delete groups, service groups and other tasks from within Orchestrator.  (Under the hood, this is enabled by a new set of web services in up.time 5.2)  So how does this play into the TMZ.com cloud scenario, well it goes something like this.

  1. up.time is monitoring the end user experience for the website as seen by the logical service address using the HTTP service or WATM monitor.  (www.mynewssite.com)
    1. You can monitor the logical service for overall end user experience.
    2. You can monitor the individual web servers to identify if any given server is being overloaded to determine if that is expected behaviour or an issue like a load balancer algorithm misconfiguration.
    3. You can configure whatever service monitors you need (database, business logic, logs, etc) to determine the ongoing health of the service you are delivering and use that to trigger the automated resolution.
  2. When your end user begins to suffer or servers start to indicate they are becoming overloaded, have up.time trigger an Orchestrator workflow to automatically avoid any end user incident that may occur due to insufficient resources.  That would look something like this
    1. Using an action profile within up.time, trigger the Orchestrator workflow you have defined for automatically shunting workload to the cloud or to scale out internally onto idle resources.  The how you resolve it from a capacity perspective is really up to you.  You could have different capacity scale out workflows depending on where the performance bottleneck is.  If your webservers are overloaded, shunt to the cloud, if your database is overloaded, add a new node to your cluster.  In this scenario let’s scale out our web tier.
    2. up.time tells Orchestrator to trigger the ‘mywebsite cloud scaleout’ workflow, Orchestrator then manages the following
      1. Provision and configure an AWS server (or many if you need to) with the appropriate OS and web content.
      2. Add the new AWS instanes into up.time (via the up.time Orchestrator plugin, it’s downloadable from our site)
        1. Add the instances to the appropriate up.time groups
        2. Add the instances to the appropriate up.time service groups so the new services are monitored and managed
      3. Update the load balancer virtual IP pools to include the new AWS instances and begin sending traffic
    3. We’re now sending traffic to our AWS cloud without anyone ever having had to do anything other than the initial Orchestrator configuration.

I realize that technically the Orchestrator piece is not a 3 click and you’re in Nirvana exercise, however once it is implemented you’ll be able to have your web properties auto scale based on inbound workload before there is ever a problem.  Take it a step further and you can have up.time via Orchestrator deprovision the AWS resources when your site workload drops back to normal levels so you can close off the loop on provision-deprovision and only pay for the AWS resources you use when you need them.  Pretty cool eh?  I think so.  So with a little up front configuration in Orchestrator and up.time you can implement Automated Incident Avoidance and keep your services running when they are faced with the potential of unforseen transient workloads.  With up.time and Orchestrator, this is only one example out of literally hundreds (dare I say thousands) of ways you can automate your infrastructure management to ensure you are operating at the highest possible levels of efficiency both from a technology and a resource standpoint.

Virtual Appliances

Wednesday, June 24th, 2009

I love these things!

VMware’s Virtual Appliance Marketplace (VAM) is like a candy store for we geeks and nerds.  While not quite as robust as say, the iPhone App store, there are hundreds of ready made appliances for hundreds of applications. Pick your solution, download it and run it on your favorite VMware virtualization platform.  Don’t like it?  Simply delete it, nothing to ‘uninstall’.

For those of you who don’t know, a Virtual Appliance is the modern day equivalent of a turn-key application.  The OS, application and any supporting tools are pre-installed and ready to power up.  They save you gobs of time, especially when evaluating a solution. The best part?  Batteries are included, and some assembly is NOT required!  In most cases you don’t need to provision any new virtual hardware, or ask your storage manager for more space on the SAN.  Don’t have a virtualization platform yet? You can download VMware Player, for free, and run the appliance on your desktop.

I know that virtual appliances aren’t that new. they’ve been around for a while now.  I know, “way to be late to the game Mitchell!”, But it’s only recently that VMware has been pushing awareness through their VAM portal, and I’m particularly excited today.  Why?

up.time 5.2 has been appliancized!

the up.time Virtual Appliance is finally here and is a dream come true. Now instead of downloading up.time, making sure you meet all the system requirements, possibly installing a new OS and spending time simply readying yourself for our lighting-fast install, we’ve done it all for you!  Download the appliance and run it. It’s that simple.  Seriously,I fired up the appliance and was ready to play with up.time in about 3 minutes! (excluding download time).

This is a game changer.  No longer are you tied to a platform, or hardware. You can can truly be up and monitoring in minutes. Don’t like it? Go ahead and delete it.  Love it? Move it to a production virtualization platform, like ESX and run with it.

We’re confident that you’ll love it.

Download up.time virtual appliance today and try it free for 30 days.  We’d love to hear what you think (comments please!).

Stop the insanity

Monday, June 22nd, 2009

One of the definitions of insanity is doing the same thing over and over and expecting different results.  In this case, why does this affliction continue to haunt us in IT?  Given the significant advances in technology, specifically in virtualization, all of which are supposed to make our lives easier and more efficient; why haven’t we whipped the beast of IT complexity?  The majority of IT environments are still stuck in a world of break-fix, albeit, perhaps we’re just getting into a faster break-fix mode.

In one recent Gartner article, “Server Virtualization for x86: A Benefits Impact Assessment,” there is a rather telling statement:

“From Gartner surveys and client interactions, we know that, operationally, virtualization appears to be a “wash,” at best — and it actually creates additional costs (people, process development and tools) on a worst-case basis. “

So what are we doing wrong?  One reason is that in daily operations, there isn’t an easy way to prioritize incoming incidents or determine recurring problems.  I would categorize the recurring outages as “death by a thousand cuts.”  This is further exacerbated on teams with a number of sysadmins, where the same problem can be perceived as distinct problems to each sysadmin.  Resource inefficiency is created by having multiple sysadmins solve the same problem over-and-over.

Additionally, in VMware environments, the old traditional metrics of guest CPU, Memory, and I/O are not very useful anymore.  They aren’t good indicators of how guests ‘get along’ during regular compute workloads.  There are a whole new series of VMware specific metrics that are indicators of VM guest contention from a compute, bandwidth, and memory usage point of view.  System’s management tooling needs to understand these new factors to aid in managing a virtual infrastructure, something that traditional ‘Big 4′ tooling just can’t do.  Putting ill-matching VM guests onto the same physical infrastructure is simply asking for incidents and accumulated outages over time.

The right approach should be to stop banging your head against the wall, rather than simply taking two aspirin every day and dealing with the pain.  Instead of waiting for incidents to occur, a more proactive manner of avoiding them should be possible.   With VMware’s launch of vSphere in May a package called Orchestrator is also bundled (this is from their Dunes acquisition of a few years ago).  This is fantastic news for SMBs (and enterprises too) as it means that any installation of VMware vSphere will have runbook automation capabilities.  VMware’s Orchestrator is a very simple drag and drop interface to create (potentially complex) workflows to control your virtual infrastructure.  The latest release of up.time integrates tightly with Orchestrator to add application-level monitoring  and management capabilities and can trigger specific workflows when certain applications are about to exceed SLA objectives or will degrade unless corrective action is taken.   Through an Orchestrator plug-in the up.time API is also exposed, allowing bi-directional communication between Orchestrator and up.time (so you can dynamically add systems, or re-group them on the fly).

So rather than wait for an application to fail and trigger an incident, up.time can take corrective action in advance to complete avoid the incident.  This starts to help us snap out of the break-fix routine that we’re all stuck in.  Let’s take an example of incident avoidance and dynamic infrastructure:

Let’s say that you have an e-commerce application that requires that certain response time thresholds can’t be exceeded and the concurrent user sessions are also a factor.  With up.time, since it’s already a micro-framework that can monitor your entire infrastructure (applications, databases, platforms, networks, etc.), it can trigger actions based on identified thresholds.  If user sessions starts to peak or response time begins to drop, up.time can trigger an Orchestrator workflow to dynamically provision additional VM guests and bring them online into the e-commerce application.   Also, since up.time understands the application, as workload drops over time (e.g. the user peak has dissipated), workflows can then be triggered to de-provision the extra VM guests to avoid sprawl.

There are many more exciting things in this release, but we’ll cover those in another blog post.  I’m also going to cover the exciting capability of bridging private and public cloud with up.time.  What about dynamically provisioning compute capability in Terremark’s cloud or Amazon’s EC2 from the privacy of your own infrastructure and then having these instances monitored under the global purview of up.time?  We can do this, more info next blog post.

Do more with less. Virtualize and save – but plan carefully!

Wednesday, June 10th, 2009

Here’s some more work for you. Here’s some more responsibility. Here’s a shorter deadline. Now do it all with less money, less time, less resources, less, less, less!

It seems as though the more efficient we become, the more constrained we are. The current economic climate doesn’t help either.   Yes, this is the new norm.  So what can you do?

If you’re reading this you’ve probably invested time and money into a virtual infrastructure, or are considering it.  Great!  Virtualized computing environments squeeze every last drop of performance from hardware and, when properly budgeted, can save you thousands in the long run. But don’t expect a free lunch.

Physical to Virtual Consolidation

Consolidation of physical servers to virtual hosts allows you to break the 1 application to 1 server mold.  However the increased density in your server room might create hot spots, especially if you’ve decided on using a blade chassis.

That increased density means you’ll also be pushing your hardware harder. This will likely increase your power requirements, slightly.  Newer hardware is indeed more efficient, and technologies like VMware’s DRS Distributed Power Management allow you to move workloads around to less stressed hosts and power off unused resources. The net effect is a possible overall reduction in power usage, but peak times could actually require more.

An Up Front Expense?

Virtualization is a net new expense. Unless you are starting from scratch, you will need to invest in hardware, and software licensing.  I was recently asked to vet the cost of a 24 host, enterprise level virtual environment.  Assuming a requirement of 10 Terabytes of storage, and going with mid tier hardware I came up with an up-front ballpark cost of USD $225,000.  No small change.  Amortize your projected savings carefully. Is it worth the up-front investment? Luckily you can grow your virtual environment easily as required with little to no negative impact on the existing services.

Implement Standards

Virtualization has made provisioning services a snap. You’ve heard all the marketing buzz — reduced time to market, provision servers in seconds!, etc.  Suddenly that 10T of storage is GONE.  But how?

Sprawl. (Yes, up.time can help you with this!)

Back in the days before virtualization, if you needed more resources you had to justify the expense nine ways from Sunday.  When it finally arrived you’d spend a week staging it.  Then testing and finally implementing it, only to have it completely consumed a few months later!  When you planned your virtual infrastructure you WAY over provisioned it, didn’t you?  You thought ahead 3 years like you  did when you bought a single server for that one application.  However now you’re planning for possibly hundreds of workloads.  Need another machine?  No problem, just clone it and wait a few minutes.   Ever have cash burn a hole in your pocket?  Budgets prevent us from blowing that spare cash.  It’s exactly the same in a virtual environment, except the spare cash is extra CPU cycles, storage and memory.  From simply devising a set of rules for managing the virtual machine life cycle, or implementing tools to manage it, the only way you will realize long-term savings is to ensure you’re only using what you need.  Don’t run your VM environment like that TV salesperson’s famous oven — “Set it, and forget it!”.

If you keep these things in mind when building and managing your VMware vSphere environment, or any other virtual infrastructure, you will absolutely be able to do more, with less. Of course <shamesless plug>, up.time can solve you VMware monitoring needs with it’s deep VMware monitor and reports.

EMC, vmware and Cisco

Friday, June 5th, 2009

Over at the EMC website, there is an article on a new partnership announced by the CEOs of vmware and EMC with regards to cloud computing and overall interoperability.  While I see this as a great thing for cloud computing and virtualization in general, what we really need to see are industry standards emerging with regards to all of the moving pieces that make up the cloud.  Before Ethernet emerged victorious as the go to interconnect, there were several other technologies vying to be the goto connectivity standard.  When I worked for one of Canada’s major FI’s we were a Token Ring shop, anyone deploying Token Ring today?  It’s pretty clear that Ethernet emerged victorious in the connectivity space, and what is great about that is you can with relative ease walk up to any ethernet switch and swap Vendor A’s switch for Vendor B’s switch and things will work the same as before.  This allows you to take advantage of the new features Vendor B’s switch offers or the lower TCO, or whatever the reason you may have for switching from A to B. 

It’s like dialtone, you don’t care what the telco does with their network as long as when you plug your phone into the RJ11 jack you get a dialtone.  I don’t want to have to care what is going on behind the scenes with my cloud provider as long as I have a standardized workload ‘shipping container’ that can be plugged into their cloud and it will process my workload against the SLA that I’m paying for.  While strides are being made to get us to this workload panacea, I think that the true sea change in computing will occur when we have standards that are formed, ratified and adopted by the virtualization industry as a whole.  Network vendors don’t each have ‘their version’ of the ethernet specification.  With this standardized approach vendors will be able to focus on the value that they build on top of the specification, value that doesn’t break compatibility but enhances the overall delivery of their virtualization/cloud offering.

Killing the troll

Monday, June 1st, 2009

A colleague recently forwarded me a link to an article about Killing The Troll which was an article about what causes people to resist downloading or purchasing online.  Essentially, we’ve been conditioned to be un-trustworthy of online experiences and/or vendors; anything that causes fear of wasting money, being mocked, or feeling stupid.  Certainly in IT, we’ve all had this experience; “oh, it doesn’t do that, but it says it does on this brochure?”; “wait, that one-click integration is two weeks of professional services?”; “no, I want to see it on the screen, not in the PowerPoint”

“So let’s declare war on the trolls. Be extraordinarily trustworthy. Show your value. Put your customers first. Keep your promises.”

One of the things that we continually struggle with here at uptime is that we don’t want to be compared to larger system’s management vendors and their selling practices and experiences.  We get continually beat-up by strategists and analysts for saying that we’re easy to use and quick time to value.  There’s a “but what’s your secret sauce?” or “what’s your mindblowing differentiator?”  This is system’s management, it’s ugly, it’s messy, and it’s complicated.  Anything that can help distill the complexity and prove value is a differentiator.  We’ve had a number of customers tell us “you do what you say, and you do it well.”

We’ve really tried to stick to this mantra as we’ve grown.  Of course, we can’t please everybody (no, we’re not supporting Yellow Dog Linux; or our monitoring station isn’t going to run on Windows x64 on Itanium [that's a market of one we don't need]), however, everybody in this organization is thinking about how to make life easier for IT Managers and sysadmins and proving our value quickly.

All this so that you don’t have to worry.

Alex

Something fun from Netapp

Thursday, May 28th, 2009

Netapp has just launched a redux of a flash game that was circulating the internet many months back and it’s rather fun.  http://www.netappdatadefense.com/  It’s neat to see viral marketing strategies like this coming from technology companies like Netapp.  I’ve always been a fan of Netapp storage products, they done what I expected of them, and they play very well in today’s virtualized environments.  By leveraging the Netapp plugin and our ESX server monitoring, up.time provides a very effective platform for managing both your virtual server and storage resources.  There are some great tie ins through the Netapp SDK that can be done with things like lifecycle and stage manager for automatically provisioning storage with your VM provisioning workflows.

virtualization 2.0 ready monitoring

Thursday, May 21st, 2009

Back in February there was an article in Virtual Strategy Magazine on Virtualization 2.0 ready monitoring solutions.   While I agree with all of the points that are brought to light as things that a monitoring/management solution should be able to do, there are a few more thigns I would add to the discussion.

- Deep metrics into the physical and virtual infrastructure.  In order to effectively troubleshoot a problem beyond simply what is broken and down to the causal analysis across physical and virtual resources having a rich set of granular metrics is very important.  In todays day and age, you’d be hard pressed to find someone who said they wanted less data when solving a problem, in IT or otherwise.

- Distributed scalability.    While the article discusses scalability, this needs to be a distributed, loosely coupled type of scaling architecture.  If your scaling paradigm is isolated to within the datacenter walls and you have multiple sites, you are not going to be able to get a single consolidated view of your infrastructure and the services that it is supporting across those sites.  IT services delivered today are 24×7 follow the sun services with delivery of the service moving datacenters with operational hours, or being load balanced across sites based on where the end user is physically located.  Without a single view of the world that you can drill down into, it becomes very difficult to manage the delivery of those large scale distributed services.

- Service level insight.  As the article mentions, it is becoming more about the business services being delivered by the technology than the technology itself.  In order to effectively manage the services being delivered you need to be able to monitor the services themselves, create meaningful constructs out of them and measure the delivery of those services against some specified goal through an SLA.  Without this how do you know whether or not you are actually performing up to the expectations of your customers for the services you are delivering to them.