Welcome to the New Year everyone! I hope you had a a lot of delicious food and some good times over the holidays, or at the very least some good time off. I myself enjoyed it so much that my new year’s resolution is to hit the gym a lot more (and stop stuffing myself with chocolate as well).
With a new year comes great responsibility, and being in IT, we have to make sure we keep our users happy at all times. To keep the user’s expectations in check we have these fancy SLA documents and reports that are supposed to show how good of a job we’re doing to keep everything up and running all the time. Unfortunately there’s a few problems with this; it can take forever (from days to weeks) to create this report, fudge factors may exist in the data, have to get metrics from different systems/plaforms/layers/tiers/(insert_buzz_word_here), and did I mention this takes forever?
So here are a few tips & tricks to avoid falling into a coma from boredom while still gathering all these important metrics for this report.
1. Defining Metrics – Before even creating the SLA document/report, make sure we’ve defined which metrics indicate availability and performance of the service we’re providing. We also have to make sure we can actively monitor the metrics with our monitoring solution as well. Luckily up.time can monitor pretty much any kind of metric with thousands of default metrics collected out-of-the-box along with an extensive plug-in architecture to allow for a near unlimited number.
2. Baselining Current Service Level – Once we know what we need to look for we need to get a baseline for how we’re currently performing before we commit to providing a level of service that we cannot possible achieve. For this we create all the monitors in uptime to gather/threshold on the metrics so we get the stats.
3. Proactive SLA Management – Once we create the SLA in uptime with objectives (SLO’s) that defines the availability and performance, we get instant visibility on the SLA dashboard. We can also set it up to automatically alert us when a severity 1 (SEV1) issue occurs and starts to affect SLA performance. So not only do we get component-level alerting/self-healing, we also get the alerting and self-healing capabilities at the SLA-level as well.
4. Quick SLA Report – For those that are currently creating your SLAs manually, this will be a big one for you. How quickly can you get that SLA report compiled and completed? Well mine is sitting in my inbox since this morning already so that’s pretty sweet, especially since I can now choose when to tell my boss when I’m finished compiling the report.
If any of this sounds interesting to you we’re having an online webinar coming soon where you can come see for yourself. Also, for a quick overview of setting and managing your SLAs with up.time, check out my latest video – click here to watch.