The up.time IT Systems Management Blog

Archive for the ‘Server Monitoring’ Category

Know When It’s Time to Upgrade your Server Monitoring Software

Tuesday, May 14th, 2013

As enterprises continue to expand their datacenters and computing networks, IT departments are faced with the task of managing these systems and their capacity demands. Unified server monitoring has become necessary for monitoring these ever-growing servers and datacenters and achieving greater transparency in the monitoring and reporting of system events.

However, a goldilocks-like dilemma evolves when the solution is either too big (ITSM) or too small (point tool solutions) to appropriately meet the IT department’s server monitoring and capacity management needs. In this blog, we will identify the most common pitfalls of IT systems management solutions and show you how you can find a server monitoring solution that’s just right.

  1. Monitoring the Wrong Data Do you use “agentless” or SNMP server monitoring? If so, you may be monitoring the wrong data. On the other hand, ITSM suites sometimes generate so much data that it is difficult to sort through and pinpoint important data points.
  2.  

  3. No Clear View of the Data Do you find that, more often than not, you’re alerted of problems by end users before they’re identified by your server monitoring software? Does your root-cause analysis often fail to turn up the underlying causes of problems? If your answer to either of these questions is “yes,” you need a server monitoring solution that provides a transparent and high level view of all of your systems, making it easier for you to pinpoint underperformance for more proactive troubleshooting.
  4.  

  5. Manual Server Monitoring Do you spend a significant amount of time each workweek engaging in manual scripting and server monitoring tasks? Do you manually monitor server backup systems or does your server monitoring program notify you when they are in use? If you are engaging in manual server monitoring, it’s time to enter the age of integrated server monitoring.
  6.  

  7. Complexity and Hidden Costs Is the “shelf ware” piling up at your IT department? Was your ITSM framework difficult to deploy? Have you spent more money on consulting for your ITSM than the original cost of the framework itself? If your ITSM framework is large, clunky and complicated, you may not be getting your money’s worth.
  8.  

  9. Frankenstein Tools On the other side of the spectrum, IT departments that don’t want to invest in expensive ITSM frameworks may attempt to meet their server monitoring needs by cobbling together a collection of freeware and point end tools. These lower end tools often require labor intensive custom scripting and rebuilding. It is also difficult if not impossible to integrate these tools into high level dashboards, making server monitoring a manual and time consuming task.
  10.  

If the problems that we’ve outlined in this blog post are problems that your IT department regularly encounters, it may be time to invest in new server monitoring software. In the age of mushrooming servers and cloud computing, it is imperative that you look for a fully unified server monitoring solution with a single dashboard view that unifies IT monitoring and capacity management into a single, unified package. To learn more about what you can do to streamline your server monitoring efforts, download white paper The 12 Pitfalls of IT Systems Management.

Stop the IT “Blame Game” and Get a Single Source of Truth!

Friday, February 22nd, 2013

My landlord just kicked me out.  Let me rephrase that.  My landlord just politely asked me to leave his property before my lease is up.  Thanks to him, my wife and I have been packing our stuff in preparation of our move.  In all honesty, my wife has been doing most of the packing.  I asked her to pack up all her belongings and kitchenware and I would do the rest.  Somehow, that message got lost and she packed my things too.  When I needed to look for my shoes and  couldn’t find them, I was upset.  I reiterated how I asked her to only pack her stuff but she said I never said her that.  Before I knew it, the blame game was in full effect.  It was her word versus mine.

These kind of things happen in IT infrastructure management too.  When you have more than one tool to monitor your environment and more than one data source for capacity planning, how do you know which one to trust?  The justification for IT environments to use a variety of monitoring tools is that their current set of tools cannot provide all the visibility they need.  For example, some tools are strictly for network monitoring.  Others might go really deep in Windows monitoring but light on everything else.  What’s worse is if there is an overlap in the metrics from each tool, so which one should you go with?  Different tools will gather metrics in different ways and at different time intervals.  One tool might catch a spike while another may not.  It is a full time job just to consolidate data and close information gaps to make sense of it all.

Here is where up.time is different.  up.time provides unified monitoring for all the silos within an IT infrastructure so you can have a true ‘single pane of glass’.  You don’t have to duct tape point tools together to make a homemade Swiss army knife.  up.time IS the swiss army knife and provides a unified and comprehensive view.  It makes capacity planning a breeze because it provides a single data source so you don’t have to try to make sense all the differing metrics!  You can eliminate the blame game (and headaches) in IT when you don’t have multiple tools telling you different things.  You don’t have to go to war with the network team arguing whose data is right when you have a standard tool providing a single view of the truth.  up.time is the solution that enables you to be the IT superstar.  Download up.time and give it a spin today!

CES Roundup – What do Smartwatches have to do with ITSM?

Tuesday, January 15th, 2013

The CES (Consumer Electronics Show) was on last week and there was some pretty cool stuff being shown including flexible screens, auto-piloting cars, and many projects that started off on Kickstarter. One of these projects is the famous Pebble smartwatch.

Server Down Alert!

Smartwatches have always interested me, but until recently, technology was a big limiting factor of functionality. Now that the technology has caught up, we can start to see some cool things happening. Even while I’m waiting for my Kickstarter Pebble, I already have the MetaWatch Stata (another Kickstarter smartwatch) on my wrist. I can already get emails, texts, phone calls and calendar invites to my wrist and it’s a pretty cool feeling. Mentioning that I’ve just received an SMS and/or emails on my watch tends to make some people question if I’ve been to the future and back.

So what do these smartwatches and other technologies have to do with IT systems management? Essentially they’re just more tools for us to get things done! Having a watch on my wrist that not only tells the time (yes, it does that too), but it will now vibrate on my wrist so that I don’t miss an important phone call from a big client, or the outage email/SMS alerting me that one of the server nodes are down.

This also means that we’ll have to make sure we don’t get overloaded with too many notifications or else we’ll just start ignoring everything, which is also a problem. Finding that magic balance of getting just the right amount of information is something that up.time can help provide, for the IT side of things at least. If you’d like to read a little more about intelligent alerting, click here, or read a related post of avoiding the “sea of red” false alerting here.

Manage Capacity and Avoid Downtime During the Holiday Season

Wednesday, December 19th, 2012

Online Shopping and Capacity PlanningComing down to the last month of the year, most people are thinking about spending time with their family and what kind of gifts they should get for their loved ones.  While there are still some people that are hesitant on purchasing goods online, the convenience of shopping in your PJ’s is definitely taking over.  According to statistics, the top 3 spending days are Cyber Monday, Green Monday (who knew this existed?) and Free Shipping Day (another day I’ve never heard of).  While shoppers love the ease and convenience, online retailers have to be absolutely sure they can handle the influx of shoppers.

The issue outlined is essentially what capacity planning is all about.  We have discussed capacity at length through a number of posts (See related posts below).  But what if even after doing your due diligence and you still didn’t have enough capacity?  How can you be agile so you can meet the demands of the screaming shoppers? For the users running VMware with vCenter Orchestrator, you can easily integrate it with up.time so if up.time detects an outage or degraded performance, Orchestrator can easily spin up more instances to lighten the load of your servers.  Even if you aren’t using VMware, you can still use Action Profiles to assist in automating the increase of capacity with up.time!

As the statistics show, those of you in retail will probably be smiling this time of the year.  However, if you are not properly managing your capacity, it could be a disaster.

Happy holidays everyone! :)

-Patrick

Related Capacity Planning posts:

Key to Capacity Planning is Knowledge
Never Run Out of Disk Space Again with Capacity Management
Cradle To The Grave – Virtualization Capacity Management

How to Avoid a “Sea of Red” – False Alerting in your IT Infrastructure

Thursday, October 25th, 2012

I like calling a flood of alerts a sea of red. Hearing that term immediately brings to mind an image of someone drowning and crying out for help. Either that, or someone getting all grumpy when the alerts come.

When evaluating a monitoring solution, don’t just look at the metrics you are able to collect because getting metrics is the easy part. Anyone can write scripts and/or programs to get the numbers they want. Getting useful alerts is much more difficult and useful when you want to get a handle on your IT infrastructure. So what are useful alerts? Let’s look at car alarm as an example. Have you ever heard an alarm going off when a bass-bumping car drives by? That’s one false alarm I don’t need. Similarly, if your monitoring solution is sending you false alerts, what’s the point of having a monitoring solution? When you get enough of these false alerts, you will start filtering them and not read them altogether. Sadly, I have seen system administrators that do that on a regularly basis. They manage hundreds of servers and the monitoring solution sends out hundreds of alerts per day on false positives.

So how can you avoid a sea of red?

There are a number of features in up.time that minimize alert noise in your environment. Let’s focus on two in this post:

  1. Flexible Alert Settings
    The genius who designed my condo put a smoke detector on the ceiling inches away from my bathroom door. So after I take a nice, long hot shower, if I forget to close the door, the steam trips the smoke detector every single time. If the smoke detector would just give it 1-2 minutes of leeway for the steam to dissipate, the firemen wouldn’t be knocking at my door when I have a towel wrapped around my waist. Similarly, how many times do your servers periodically spike in CPU or memory usage which in turn trips the alert thresholds? Those are probably alerts that you can do without. So what does up.time do differently? up.time allows you to configure rechecks for any given monitor. For example, if the CPU usage passed the threshold, depending on the configuration, up.time doesn’t send out the alert right away. The user can choose to have up.time recheck a few times to verify if CPU usage is still high. If, and only if, the usage remains high, then an alert is sent out. How many times to recheck and how often is entirely up to you.
  2.  

  3. Topological Dependencies
    If you can’t make a call on your cell phone, what is the first thing you would check?  Of course you would check if you have reception or not.  Whether you can make a call or not depends on it.  Similarly, in a networked environment, devices and servers depend on a number of things in order for them to work cohesively.  The simplest example is a server depending on a switch for network connectivity.  In a lot of monitoring solutions, if the switch is down, that would result in a sea of red because the monitoring solutions can’t reach any of the servers.  This is especially true in silo-ed solutions where you have one piece monitoring the network, another monitoring the servers, yet another one monitoring the applications.  With up.time’s unified systems monitoring, you can setup the dependencies so it won’t swarm you with alerts when outages happen.  Instead of receiving all those server alerts, up.time will just send you an alert on the switch so you can focus on fixing what’s important.

If you are one of the unfortunate sysadmins buried by alerts, it’s a choice whether you want to continue to work like that. You can choose to weed through hundreds of alerts each day or you can kiss the sea of red goodbye! Work smarter! Try up.time and see the difference!

Introducing the New Dashboarding API – Sneak Peek Pt. 2

Tuesday, October 23rd, 2012

It’s time for part 2 of my sneak peek at up.time 7.1′s new Dashboarding API. Please take a moment to review part 1 for some basics on the API including how to list elements, groups, and monitors.

The Status End Point

The second major piece to the API is the ability to list the current status of elements, groups and monitors. By simply adding the ID of your target followed by /status, a simple listing of all related monitor status is produced. Here’s an example:

GET https://win-dleith:9997/api/v1/elements/14/status

Produces this example JSON output with details on both the element status and the status of each service monitor associated with this element.

{
   "id":14,
   "isMonitored":true,
   "lastCheckTime":"2012-10-22T15:16:44",
   "lastTransitionTime":"2012-10-22T12:14:56",
   "message":"",
   "monitorStatus":[
      {
         "acknowledgedComment":null,
         "elementId":14,
         "id":250,
         "isAcknowledged":false,
         "isHidden":true,
         "isHostCheck":false,
         "isMonitored":true,
         "lastCheckTime":"2012-10-22T15:17:30",
         "lastTransitionTime":"2012-10-22T12:17:31",
         "message":"All metrics collected successfully",
         "name":"Platform Performance Gatherer",
         "status":"OK"
      },
      ....
   ],
   "name":"vmh-rd3.rd.local",
   "powerState":"On",
   "status":"OK",
   "topologyParentStatus":[
      {
         "id":2,
         "isMonitored":true,
         "lastCheckTime":"2012-10-22T15:16:54",
         "lastTransitionTime":"2012-10-22T12:14:48",
         "message":"",
         "name":"rd-vc2",
         "powerState":null,
         "status":"OK"
      }
   ]
}

Notice that this status information includes the times of any recent checks, the current power state of virtual elements, and information about this element’s topological parent so that you can piece together topology views across your enterprise.

Now let’s talk about some of the fun stuff. By piecing together inventory information as well as availability information, we can start to craft some very exciting interactive views of your environment. Here are some examples that are being released along with up.time 7.1 for you to use as a starting point for your dashboarding needs. You will be able to find these examples on The Grid or our new github page.

Pin+Image – Written by Joel

A world map example indicating the status of key applications around the country. Each ‘pin’ highlights both the status of the application and any member service monitors. Hovering over a pin brings up more details on the element and allows you to drill down with a simple click. The background image and location of any status indicators is completely customizable. Build your view to suit whatever your NOC needs.

Incident Console – Written by Patrick

A highly interactive operations view for operations teams or administrators, combining data from several up.time monitoring stations around the country. This console can be extended to link into help desk, ticketing systems, or even configure one-click console/rdc access to help you quickly triage any ongoing problems.

Dynamic Topology View – Written By Alex

Drill down through your topology to easy see the root cause of any outages in your environment. With this heads up view you don’t have to navigate to different pages to understand how the key components of your environment relate to each other. Upstream and downstream components are instantly clear based on your defined up.time topological dependencies.

We’re really looking forward to the up.time 7.1 launch and I will be speaking live at our sneak peek webinar today at 1pm ET. More information one the release will be in your inbox next week.

Auto-Discovery Improvements in up.time 7.1

Monday, October 1st, 2012

up.time 7.1 is introducing a few new changes to our existing automated discovery process that will allow you to scan and monitor your datacenter more efficiently than in previous versions. We have always strived to be one of the easiest monitoring tools on the market to set up, roll out and maintain. Our users have reported being able to roll out monitoring to as many as 400 servers a day using a combination of our agentless and agent based monitoring. With 7.1 we wanted to push this number even higher to save you even more time and effort. The result is a discovery process that is at minimum, 50% faster, and much easier to use with much less manual work and human error.

For those of you who haven’t tried up.time yet, we support the following auto-discovery methods

  • Network Scan: Provide a subnet range and up.time will scan for any servers, network devices, IPs, etc… that can be added.
  • VMware vCenter Server Inventory Synchronization via up.time vSync: up.time will tap into your vCenter environment and automatically add all attached elements in a matter of seconds. up.time will also keep this inventory in sync, so newly discovered VMs & vSphere servers will have monitoring applied automatically.
  • IBM pSeries Discovery: Scan your HMC or VIO servers to discover all attached frames and LPARs

Using these methods it is very easy to populate up.time with everything in your environment. We also offer utilities that will bulk import systems from a text file if you happen to have a running list of every server in your environment.

With up.time 7.1 we are introducing these changes to the auto-discovery process

  • Faster Network Discovery: In up.time 7, the discovery process on a subnet with 200 elements took on average 4m30s (270 seconds). In up.time 7.1, the same subnet now takes 1m45s (105 seconds). That’s a 62% performance improvement, discovering subnets will now take less than half the time.
  • Faster Bulk Addition of Elements: After the network has been discovered, it took on average about 20 seconds to add each discovered element. It was a manual process that could take quite a while in large environments. In up.time 7.1 any discovered element can now be bulk added using standard credentials, this completely removes the manual overhead time and reduces the average add time to about 3 seconds. After entering credentials once, it will take about 10 minutes for up.time to automatically add all 200 elements while you go refill your coffee.
  • Wizard based discovery: The process of setting up a discovery has been streamlined so that it’s easier to see where you are in the process, and understand how each discovery option works. We have also tied vSync discovery into the auto discovery process to highlight it’s discovery capabilities.

That’s all for now, more information will become available over the coming month as we approach the up.time 7.1 release. I will also be hosting a Sneak Peak Webinar to introduce up.time 7.1 to the world in late October, more details to follow.

Two Reasons Why SNMP Monitoring is Essential for your Datacenter

Friday, September 21st, 2012

HTTP, IP, RAM, CPU, MB,…there are tons of acronyms in the IT world.  Heck, the word IT is an acronym.  Some companies use acronyms as their names as well.  Sometimes acronyms can be intimidating.  Quite often, the reason why we shiver when we hear acronyms, is because we don’t know what they stand for.  For example, do you know what SNMP stands for?  A quick Google search yields Simple Network Management Protocol.  Well, that tells me it’s a network protocol.  It doesn’t sound very exciting.  So let me re-define it as:

SNMP is
N
ecessary in
M
onitoring your
P
aradise/Prison

And by Paradise, I mean your datacenter.  Feel free to substitute Paradise with Prison if you aren’t proactively managing your environment ;)  But why is SNMP a must if you want to get a handle on your datacenter?  There are two main reasons:

 

  1. Visibility to Hardware Failures
    When you deal with computers long enough, you are bound to experience a few hardware failures.  On enterprise-grade servers and devices, there are usually redundancies to increase availability. However, redundancy just means there are at least 2 of some components.  There will come a day when all of the components fail.  If you don’t fix the failures when they pop up, you are putting yourself at risk of a disaster.  But of course, you can’t fix something if you don’t know about it.  How do you know if there is a failure?  That’s where SNMP comes in.  Servers and devices in businesses frequently have SNMP capabilities to send what’s called an SNMP trap to a centralized server.  The SNMP trap is just a message notifying someone about a hardware failure.  Having the ability to receive such a message is essential if you want to be on the ball when it comes to failures.
  2.  

  3. Visibility to Device-Specific Metrics
    Any device can support SNMP. If you really wanted to, you can even enable SNMP on a toaster!  The flexibility of SNMP allows administrators to pull whatever metrics and/or statuses they want as long as the OID’s (Object ID, I know, another acronym…) are known.  The type of metric available will depend on what the vendors expose.  The metrics can range from fan speed, number of power supplies to even the ambient temperature.  Keeping an eye on these metrics will provide a complete view of your environment.

 

We have discussed the virtues of having a single pane of glass to give you a complete view of your IT infrastructure.  The two reasons above are why SNMP needs to be a part of your monitoring strategy.  up.time’s SNMP monitoring capabilities make it easy for you to get a handle on your environment.  If you haven’t yet, take up.time out for a test drive, make sure you do!

 

- Patrick

 

The SLA Dashboard – The Key to Measuring Business Services

Wednesday, August 22nd, 2012

SLA MonitoringLast time we discussed how you can instantly identify the impact of your IT infrastructure on business services.  Taking that one step further, let’s talk about how you can measure your business services.  There are many metrics one can use to determine the performance of their business services.  Let’s say you want to measure the throughput of your email system.  up.time has Service Monitors for that.  Or let’s say if you want to determine the number of requests your web server gets, there’s a Service Monitor for that.  These are important metrics, but they are not necessarily essential.

The one vital metric is how often the business services are up.  And as you already know, a business service is not a single process on a single computer.  Whether the service is available or not depends on a number of components.  How do you tie all the components together and have a single number representing the health of your business service?  One way to gauge is by using the Service Level Agreement (SLA) dashboard in up.time.

uptime software SLA Dashboard

SLA Dashboard

Why should you use the SLA dashboard?  If you have SLA requirements, it’s a no brainer to use this dashboard because it provides a real-time status of your SLA. It also lets you know if you are on thin ice, telling you exactly how many more minutes you have until your SLA will be breached.  But what if you don’t have any SLA requirements? You should still setup SLA’s for each of your business services because up.time will be able to quantify how well you are managing your IT infrastructure.  up.time can determine the percentage of time when a service is available given the compliance period.  If you are system administrator, your performance is tied directly to whether services are accessible to the users.  That bonus you wanted?  You will have a much stronger argument why you deserve it if you can show a concrete KPI of how well you are doing.  Or if you are a CIO, having a magical number summarizing what you need to know will enable you to work more efficiently.

Whether you have formal, informal or non-existent SLA on your business service, to get a complete view of your IT infrastructure, the up.time SLA dashboard is a must-have!  Download up.time and take it for spin!

-Patrick

From Dashboard to Deep Dive Diagnostics

Thursday, July 5th, 2012

“So something *WAS* wrong with the network but was that the root cause?”

In my last post, we looked at how you can use the brand new Network Dashboard in up.time 7 to look for network issues. Now that you have confirmed the network was to blame for the poor performance, is the case closed? How do you know if there aren’t other issues? Are you content with guessing and hoping the network was the sole reason for the degradation of performance? That is almost like your doctor saying he/she can give you a full body checkup just by looking at your pinky! Today’s IT infrastructure is not composed of only the network. So how can you dig deeper?

Resource Scan

Resource Scan Dashboard

 

up.time enables users to get a complete view of their IT environment without navigating through a myriad of tools, or what we call “tool soup”. Jumping through different consoles is not only time consuming, it also makes the lives of system administrators more difficult. The time it takes to recover from outages will take longer, which ultimately, will affect your business. See more about tool soup here. How does up.time help? First, if you already have up.time’s intelligent alerting configured, the notifications you get should point you to the right direction. But let’s say if you wanted to do some impromptu analysis. How do you do that?

 

Visibility Over Time

Visibility Over Time

If it’s a performance-related issue, I would recommend looking at the Resource Scan Dashboard – see image above. It gives you a general sense as to how your entire environment is doing. At the same time, with a single click, you can drill down to the server(s) to get deeper information. Want to see why there was such a high memory usage in the middle of the night? up.time gives you visibility by letting you go back in time to identify the resource hog. Again, up.time’s single pane of glass allows you to gain a complete view of the infrastructure. If you haven’t already, download up.time to gain 20/20 vision into your complete IT environment so you won’t miss a beat!

 

– Patrick