When it comes to capacity planning, it’s easy to be overwhelmed by the amount of raw data that you can potentially analyze in a data center environment. The question then becomes how you are going to take that data and make sense of it in a way that helps shape your decisions when it comes to adding or removing capacity.
When it comes to estimating capacity, you definitely want to get more depth then a simple metric like CPU utilization on a single server.
Here are some Do’s for capacity planning:
1) Ensure strong visibility across all of your different platforms virtual, physical and cloud – Increasingly applications are being hosted across multiple infrastructure silos. Ensure that when you do your capacity planning, your toolset accounts for all the different platforms you are trying to monitor. The key here is, you want to as easily and rapidly as possible ensure that you can baseline and find trends across all of your infrastructure.
2) Get deeper visibility into the environment than just platform metrics - Platform metrics serve as a good first level of visibility into capacity, that is, obviously if a box is pegged at 100% CPU, this probably means it’s not doing very well from a capacity perspective. Conversely if the box is running only at 50% CPU or 50% memory, this doesn’t nessecarily gaurantee that the middleware or webserver applications running on that platform are not at full capacity or starving for some other kind of resources. Ensure that you are also monitoring the application context sensitive capacity data. If your middleware is only capable of opening 500 database connections, you might want to know that you’re at 490, despite all other platform metrics being “OK”.
3) Schedule on-going reporting to your various capacity and performance teams – Capacity information is extremely useful, but shouldn’t be thought of as a “one off”. Capacity information should be made readily available to all teams, especially those doing physical consolidations, virtualization initiatives, or any team that is making any decisions on the acquisition of new hardware. Since decisions around these initiatives are being made all the time, scheduled reporting ensures that all stakeholders have the latest information available at all times. Not only that, these stakeholders will continue to be informed, even if you or parts of your team are on vacation.
4) Group your infrastructure in ways that reflect real business processes and application services – You want to be able to capacity plan against all kinds of permutations of servers, services and shared infrastructure stacks. Make sure that the groupings of servers you report on reflect real business services, applications and logical groupings (like application clusters). In this way, you should be able to rapidly report on aggregate capacity utilization and trends across these groupings without having to remake the wheel. Your tooling should enable you to report and access this data for subsets of infrastructure in a very low number of clicks, without doing SQL Queries, table joins or any advanced data warehousing.
5) Ensure visibility into time of day outages for capacity and relate them to SLAs – Sometimes the business linkage between availability, capacity and individual metrics streams can get really blurry in the “fog of war”. At the end of the day, you want to understand if capacity issues are driving outages and affecting SLA performance. You want to have the context over whether “time of day” performance is a driving force in this kind of capacity issue. For instance if 5000 VDI’s boot up in the morning and this saturates your connection broker, we might want to be able to correlate this time of day outage to the impact this is having on the end-user-experience for SLAs.
Here are important don’ts for capacity planning:
1) Ensure that you don’t have a virtual buffet of different profiling/capacity tools – The last thing you want to do with capacity planning is try to do “screen level integration” of metrics or performance. That is, copying and pasting metrics from 10 different applications and trying to normalize them in a spreadsheet or data warehouse. This makes it impossible for you to aggregate the data quickly and cleanly. It calls into question the methedology, it wastes valuable time, and it produces results that are more likely than not going to be treated as unreliable by everyone involved.
2) Ensure that your capacity reporting isn’t merely a “static snapshot” in time – Often people use capacity planning reports that are static tables with a static count. For instance the number of virtual instances on Dec 10th. This kind of reporting doesn’t give you the type of insight required to understand the “capacity evolution” of the environment. You are much better off getting visibility into the number of virtual instances over time in a graph, or the virtual density over time so that you can see where things are headed and how workloads might be improperly stacked across the infrastructure stack.
3) Don’t hesitate to use your capacity planning initiative as a catalyst – In my experience, capacity planners have lots of data that would be very valuable to the operations/server teams that actually fix the servers. Sometimes outages are related to a peak capacity issue. Traditionally reports and dashboards from capacity planning tools are much too cumbersome for members of the ops teams to use. If you have the right tools, the same data can be easily displayed and used by all. Having readily available capacity data correlated to outages is something that can be a catalyst for real wold discussions between 2 teams that sometimes find it hard to see “eye to eye”.
4) Don’t hide away your capacity data – Use dashboards and other reporting tools to make your capacity information available to everyone – your ops teams your management. This is essential data, and some teams may not be aware that they need it. How useful would it be for your ops and NOC teams to see an aggregate capacity dashboard up on a big screen? Wouldn’t it be great to see if there are obvious issues and spikes in capacity usage?
5) Don’t necessarily try to run before you can walk – Time and time again, I see people attempt to approach their capacity planning initiatives from a perspective of “pie in the sky”. That is, that they hope that they can guestimate workloads, be able to do theoretical “what if” analysis, and immediately be able to re-order their entire data center in one shot to maximize efficiency by 400%. The truth is, that very few products that promise to do the above are easy to rollout, do not require massive amounts of integration, and will not cost you an arm and a leg. All of this, without even taking into consideration that “what if” analysis is of little value to operations or server engineering teams. As a first step towards the road to the panacea of capacity planning data analysis, you need to consider the cost, the rollout time, and the overall impact of any tool you introduce to do capacity planning. Definitely make sure you are walking (getting visibility), jogging (making this visibility and awareness clear across the organization), and then running (hitting higher order capacity planning ideals with solid data). Most importantly make sure you are choosing a tool set that will help you get on that on-ramp as quickly as possible.
My colleague Joel recently recorded a capacity planning video including tips and best practices while demonstrating up.time’s capacity planning capabilities. You can find it on our YouTube channel at http://www.youtube.com/uptimesoftware or on our website – click here to view video.