When I talk to customers who are at the data center manager or VP of IT level, there is a rather unnerving topic that is almost always brought up: am I being lied to? Or, more politely rephrased, am I not being told everything that I should be? Now, before you think I’m bringing up something juicy, that’s not the case; I’m talking about how difficult it is for an IT manager to fully understand what are all the components in their infrastructure that are causing outages — and then to get up-front answers from their staff (and the tools they use) about the actual root causes.
Rarely do you see a network manager put up his hand and state, “yeah, we misconfigured the router and throttled traffic for a few hours, sorry that it affected most of our applications;” or application developers confessing “we didn’t think that a small SQL change would kill the db ‘that’ much.” It’s almost always the inadvertent modifications that cause the greatest outages, or impacts to an application service.
There are two problems in these common scenarios:
- Most IT managers have a really hard time getting metrics that span network, system, storage, and application layers and presenting them in a meaningful format.
- Transparency of metrics really is a cultural issue.
In the first case, infrastructure is divided into many departments: network, system, application, storage. Each administrator or infrastructure manager has their own tools that they are familiar with and only pay attention to metrics that are relevant to the service they deliver. In almost all cases, this is never with the business user in mind.
Take email, for instance. Email is many things: an external link to the internet (or to disparate offices), gateway servers, virus and spam scanners, MTAs, message stores, directory lookups, web portals, etc. In one relatively “simple” application you’ve spanned all the infrastructure stacks – network, system, application, and storage. So, when a sales line-of-business user files a service ticket saying email is down, where are you going to look? What actually caused the problem? If there is a manager for each part of the infrastructure stack, you’ll probably get four different answers (from many different tools). This situation usually gets reduced to finger pointing and rarely results in a holistic look on the actual cause of the outage. Wait until you implement VMware or other kinds of virtual infrastructure; the dependencies between applications and systems is going to get very difficult to understand and monitor.
This leads into the second problem of infrastructure monitoring: transparency. In discussions I’ve had, there is a very interesting dichotomy in customers and how they deal with this. There are a number who are terrified of presenting a view to the business of how IT infrastructure is working (or not). They feel that offering a view into the infrastructure and its availability would completely undermine any credibility of the department. In most cases, I’ve found there isn’t a lot of credibility to begin with and by exposing relevant metrics, the framework for constructive discussions around application availability can be set.
We’ve seen some fantastic turnarounds in IT department credibility and transparency after the implementation of up.time. Not only does up.time span the infrastructure stack, but it also presents relevant metrics to both administrators who need to manage the environment; and to line-of-business users who simply wish to understand the availabilities of the applications they need.



