The up.time IT Systems Management Blog

Archive for the ‘Cloud Virtualization’ Category

Large Scale Cloud Computing Adoption

Monday, October 19th, 2009

There is a very well written article over at ulitzer.com regarding the US Federal Governments IT spend plan for FY11 and their investigation into leveraging cloud computing as a cost cuttimg measure for federal IT spend.  It breaks the analysis down into 3 options:  Public, Hybrid and Private cloud.  In their analysis, the public cloud comes out at a BCR of 15.4 (Benefit/Cost Ratio) with the hybrid and private cloud coming out at 6.8 and 5.7 respectively.  I found these results rather surprising considering the scope of what their analysis entails.

We aren’t talking about migrating a few workloads to the cloud, but thousands and thousands of servers worth of federal workloads.  When defining the public cloud versus hybrid/private solution and the assumptions, they state for the public cloud it is a migration of ‘low-sensitivity’ data onto existing public clouds.  Based on the ever increasing compliance requirements and demand for data privacy and integrity, I would think that the low-sensitivity workloads would not comprise the lions share of the workloads being examined, thereby leaning the tables to the hybrid and/or private cloud offering.

When migrating to the cloud, todays organizations have many terabytes or petabytes (in the case of the US Federal Government, for thousands of workloads) of data that has to be migrated onto the cloud in order to move the complete workload to the cloud.  Moving and synchronizing petabytes of storage while maintaining service continuity through the migration is a non-trivial task.

While the analysis within the article is sound, I think that there are significant hurdles still in place from a large scale public cloud adoption standpoint that are not taken into consideration to the extent that they deserve.  Everyone wants the public cloud computing model to be successful, after all the benefits stand to be great.  I think that in the public cloud, from a security and connectivity standpoint, is not quite there yet for large scale initiatives.  I think that the real successes will come from the creation and adoption of private clouds, with the slow learned migration of workloads to the public cloud as we iron out all of the security, networking and compliance requirements.

Maybe it would make sense to have the public cloud providers offer their own hybrid approach where you deploy your own private cloud and they manage it for you.  You get to leverage the benefits of their processes and technologies developed for managing the public cloud, with the benefits that come with a private cloud.

SpringSource and VMware

Wednesday, August 12th, 2009

As most of you know already, VMware has acquired SpringSource for a quite remarkable $420MM. Along with this purchase comes Hyperic, a struggling open source system’s management vendor that was recently force-merged with SpringSource by communal VCs.
Ultimately, this acquisition sets the stage for development and deployment on cloud computing platforms (PaaS), however, our interest lies in the monitoring, measurement, and management of applications running in the cloud. This is an area in which Hyperic conceivably will be used, however, they will need lots of development effort to enhance their cloud offering (Amazon EC2 API calls to instantiate AMI’s isn’t really what I would call ‘cloud leadership’, or ‘cool’).
I am also curious as to how enterprise customers are going to deal with having open source software managing their environments (there still are a huge number of holdouts in this area, which is why Hyperic was struggling).
This acquisition, in the next 12-18 months, doesn’t help VMware compete against Microsoft’s SCOM in heterogeneous environments (physical, virtual, and multiplatform) – which, in my opinion, poses a greater risk to enterprise adoption.

Alex

Cloud computing and popular culture

Friday, June 26th, 2009

This has been one hell of a week for the entertainment industry.  Ed McMahon, Farrah Fawcett and Michael Jackson have all passed away.  Whenever significant cultural events like this occur there is an explosion in communication amongst people, wanting to know what happened and further discuss it amongst their peers.  In the past this would have been isolated to talking with your neighbours, family and friends either in person or over a traditional POTS line.  Fast forward to the 21st century and we now have real time bidirectional communication between virtually anyone anywhere in the world. 

When you have an unpredictable event like the death of a societal icon or the launch of a new service that has the potential for extremely rapid adoption or at the very least high traffic due to curiousity alone, it is very difficult, or practically impossible to anticipate the real world resources needed to support the inbound demand.  This is very clearly shown by the chart from Keynote Systems illustrating the availability and performance impact of this event on news websites.

news-site-index-470

Image from: http://www.datacenterknowledge.com/archives/2009/06/25/michael-jackson-news-slows-web-sites/

TMZ.com was the first news outlet to break the story of Michael Jackson’s death, and consequently their site collapsed outright from the unexpected workload.  It’s hard to fault the IT team responsible for the services delivery, after all no one knew MJ was going to pass away yesterday, and arguably there is no one in entertainment today that would have generated the level of interest from the public as him.  So where am I going with all of this, to the clouds!  If there was ever a real world example of where a cloud solution would have played nicely into the delivery of a service that can be impacted by transient high-intensity workloads that can come without warning, this is it.  Even a properly architected high volume application or service that is designed to handle large increases in transient load has a finite capacity.  If TMZ.com had the ability to automatically spin up cloud resources and shunt the new traffic load over to them during the media frenzy, ideally they would have been able to stay up during the peak of the traffic and provided service quality and performance as good as their normal service levels.  (For the shunting, I’m a big fan of f5 gear for ADN networking)  Now, they could have done this manually I suppose, when they see the traffic coming they could have provisioned some AWS instances, got their site/content up and running and started routing traffic over through a change to their load balancers.  That’ll work, but it’s also manual, going to take them time to get it all implemented and by the time they’re done their end users have already hit a dead site and gone to one of their competitors.  So what to do?  Automate!

With the 5.2 release of up.time that was launched on Wednesday (June 24th, 2009) up.time now has a full bi-directional integration with VMware Orchestrator.  If you are a VMware shop, you get Orchestrator for free with vCenter Server.  If you are not familiar with Orchestrator, you can check it out here.  Essentially, Orchestrator is a policy based workflow automation tool that you can use to build automated scenarios to perform well pretty much anything.  Orchestrator has the concept of plugins that provide Orchestrator with the know how for specific vendor technologies to directly interact with them.  For example, the up.time plugin for Orchestrator lets you do things like add elements, create/modify/delete groups, service groups and other tasks from within Orchestrator.  (Under the hood, this is enabled by a new set of web services in up.time 5.2)  So how does this play into the TMZ.com cloud scenario, well it goes something like this.

  1. up.time is monitoring the end user experience for the website as seen by the logical service address using the HTTP service or WATM monitor.  (www.mynewssite.com)
    1. You can monitor the logical service for overall end user experience.
    2. You can monitor the individual web servers to identify if any given server is being overloaded to determine if that is expected behaviour or an issue like a load balancer algorithm misconfiguration.
    3. You can configure whatever service monitors you need (database, business logic, logs, etc) to determine the ongoing health of the service you are delivering and use that to trigger the automated resolution.
  2. When your end user begins to suffer or servers start to indicate they are becoming overloaded, have up.time trigger an Orchestrator workflow to automatically avoid any end user incident that may occur due to insufficient resources.  That would look something like this
    1. Using an action profile within up.time, trigger the Orchestrator workflow you have defined for automatically shunting workload to the cloud or to scale out internally onto idle resources.  The how you resolve it from a capacity perspective is really up to you.  You could have different capacity scale out workflows depending on where the performance bottleneck is.  If your webservers are overloaded, shunt to the cloud, if your database is overloaded, add a new node to your cluster.  In this scenario let’s scale out our web tier.
    2. up.time tells Orchestrator to trigger the ‘mywebsite cloud scaleout’ workflow, Orchestrator then manages the following
      1. Provision and configure an AWS server (or many if you need to) with the appropriate OS and web content.
      2. Add the new AWS instanes into up.time (via the up.time Orchestrator plugin, it’s downloadable from our site)
        1. Add the instances to the appropriate up.time groups
        2. Add the instances to the appropriate up.time service groups so the new services are monitored and managed
      3. Update the load balancer virtual IP pools to include the new AWS instances and begin sending traffic
    3. We’re now sending traffic to our AWS cloud without anyone ever having had to do anything other than the initial Orchestrator configuration.

I realize that technically the Orchestrator piece is not a 3 click and you’re in Nirvana exercise, however once it is implemented you’ll be able to have your web properties auto scale based on inbound workload before there is ever a problem.  Take it a step further and you can have up.time via Orchestrator deprovision the AWS resources when your site workload drops back to normal levels so you can close off the loop on provision-deprovision and only pay for the AWS resources you use when you need them.  Pretty cool eh?  I think so.  So with a little up front configuration in Orchestrator and up.time you can implement Automated Incident Avoidance and keep your services running when they are faced with the potential of unforseen transient workloads.  With up.time and Orchestrator, this is only one example out of literally hundreds (dare I say thousands) of ways you can automate your infrastructure management to ensure you are operating at the highest possible levels of efficiency both from a technology and a resource standpoint.

Cost of cloud computing, expensive!

Wednesday, January 28th, 2009

With a large number of initiatives around cloud computing, I was interested in determining if the current cost of moving something like a lab environment into an outsourced environment would be cost effective.  Now, I realize that current ‘cloud’ offerings are really geared to dealing with temporary spikes in compute load rather than moving an entire infrastructure out of a corporate data center, however, mirroring a lab environment is perhaps a plausible use of the cloud.

This demonstration was simply to determine the monthly cost of hosting a lab environment in Amazon’s EC2 and then comparing it to the fully loaded cost of having a lab environment in house. 

The service that I ran the experiment on was Amazon’s EC2 and their storage service (S3) for persistent data management.  EC2 allows you to provision various types of x86 servers of differing compute capabilities and you are billed by instance hour of time.  There is no restriction on how compute intensive your instance is.  Their cost matrix for Linux instances and S3 storage can be viewed here and the Windows pricing is here.  The Windows pricing also includes options for SQL Server (and authentication services).
The experiments I ran were for five systems of various configurations running our application (up.time).  This included Linux running MySQL, Linux running Oracle, Windows running SQL Server and other combinations.  The databases were stored on Amazon’s EBS (Elastic Block Store) storage for persistence reasons.  The applications were run for two weeks under simulated load for monitoring 1,000 systems to get an idea of network and storage bandwidth.
After two weeks, the compute costs, I/O costs, and persistent storage costs were tallied and then scaled to mirror the monthly cost of a sample lab environment.
Amazon EC2 Costs for 300 lab instances.  There are 744 hours in a typical month (24*31).
Instance Type Num Cost/Instance Hour Compute Cost/Month
Windows 100 $0.125 $9,300
Windows + SQL Server 50 $1.100 $40,920
Linux 150 $0.100 $11,160
Windows (SQL/xlarge) 2 $2.400 $3,571.20
Total Cost Per Month $64,951.20
Storage Storage Cost/Month
5.6T (usable) $0.10 Gb/month $573.44
I/O 30B $0.10 per 1MM I/Os $300.00
Network Network Cost/Month
I/O 20 Gb $0.10 Gb/month $2.00
Total EC2 Cost/Month $64,826.64
Total EC2 Cost/Year $789,919.68

Now, if I calculate actual lab costs that mirror this environment here’s what we get (I’ve deliberately excluded our non-x86 platforms such as POWER and SPARC).  I’ve included the retail costs for Microsoft SQL Server and Oracle even though as an ISV we wouldn’t nearly pay as much.  The EC2 cost for Windows systems is considerably higher than Linux, and this is because of the software licensing costs blended into the instance hour calculation.

In the cases of leasing hardware, the number is more or less a constant cost as new gear is purchased and older gear is bought out. For software costs, they’ve been amortized over three years.

Gear Number Cost Per Month
Dell 1950 28
Dell 2950 2
HP DL585 2
10TB iSCSI 1 $10,000
Dell/HP/Equallogic Support $300
HVAC/Power $1,000
Floor Space 500 sq/ft $24 sq/ft/year $1,000
VMware ESX 9 $1,250
Annual Support (VMware) $1,250
Internet $1,200
Network Infrastructure $556
Total Infrastructure Cost/Month $16,556
Software Cost
SQL Server 2008 $2,083
Oracle 10g/11g $2,083
Labour Cost/Month $4,166
Total In-House Cost/Month $24,888.89
Total In-House Annual Cost $298,666.67
I’m torn about including labour, as instance management overhead is the same in both scenarios, however, the actual network and compute infrastructure when in-house, does require some amount of headcount.  In this case, I’ve added 0.5 of a resource (fully loaded cost).
So, the difference between an EC2 lab environment and an in-house environment is ($789,919.68 – $298,666.67) = $491,253.01.  This is quite a substantial difference for an always-on environment.
I am curious as to how many enterprises have truly dynamic workloads that could take advantage of a cloud (either internal or external) to truly derive the cost benefits of cloud computing.
Certainly, at first blush, a straight migration of servers is a costly proposition.
—————————–
Quick update: We have just launched uptimeCloud – the simple way to manage cost in the cloud. This new SaaS product will provide real-time, dynamic cloud cost monitoring, cloud cost forecasting, and cloud capacity management. for more, visit http://www.uptimecloud.com