Nagios – Opinions and Feedback

I have been using Nagios for several years (http://www.nagios.org/) – and realized the other day that I have never written my formal opinions on the system.

In a nutshell – I love Nagios.  It works fantastically.  At the same time – I am incredibly surprised that there is not mor business insights, reporting or analysis to offer.  To that end – I want to spend this article talking about what I wish I could more of out of Nagios.

Why do I say this? Nagios is a basic monitoring system at heart.  It performs this function extremely well.  The community that has built up Nagios has done a fantastic job of making it reliable and robust.  However, that same community of brilliant engineers have probably never gone through MBA school and have probably never had to perform an in depth trend analysis on failure scenarios.  Their mandate is to detect, respond and repair.

If you are a senior level operations manager – or anyone who needs to prevent issues before they happen – then you will probably agree that there are a lot of slick UI components and analysis tools that Nagios lacks.

For example, it seems rather simple and straight forward to build a tool that would perform history analysis on service groups or host groups to report on failure trends over a period of time.  This could then be correlated with hardware profiles, ip address ranges, data center locations, and lots of other data to determine if there is a certain issue beyond the scope of a single service or host.  Maybe the root cause is related to a batch of servers that contains a memory module that has experienced a higher then normal rate of failures.  Maybe the root cause is a mis-configuration on a router or switch that is causing adverse service affects several hops away.

Another valuable add on would be assigning a dollar value representation (either a static value or a dynamic one via another service check or plugin) to a host or service.  Let’s say when service X is down – it costs the company $1K per minute.  This simple correlation could be used to help prioritize and triage incidents so that responders know what the most critical systems are and which services should be brought back online first.  This can also help triaging when issues are elevated through the Tiers to give responders insight into how severe an issue truly is or isn’t.  This add on will undoubtedly spark opinions as you would be sharing financial data… maybe instead you assign a nominal point system – 5 being a high revenue service and 1 being a low revenue service.

I don’t necessarily like the reports and graphs that simply run off of the total number of alerts or failed services.  If I have 1,000 alerts – well that does not mean much if I have 500 service checks on a single node.  That would mean that only 2 servers are affected.  Rather I see a lot of value in being able to establish a multi level alerting structure.  Being able to classify alerts on a 1-5 priority scale.  This would allow a user to deploy the Nagios system for more elaborate preventative and early warning type applications.

So – without ranting and raving for too long – I should summarize what I am trying to get across.  Nagios is a well developed, mature open source project that offers the community an incredibly robust and powerful monitoring solution.  However it needs a good facelift by the MBA graduate community to get better analytics and high level reporting.

Comments

comments

Posted in and tagged , , .