The normal definition of NMS is "Network Management System". This is nice and easy to say, but very hard to pin down to an exact specification. What constitutes a well-rounded NMS?
I believe it to consist of at least:
There are a large number of Free/Open Source Software and commercial systems that claim to be NMSes, but none come close to covering all this functionality. Typically, systems fall into either a Network Monitoring (Up/Down) or Network Reporting role, not both.
The types of systems available can be crudely categorized into three distinct generations:
Event correlation is the core functionality of an NMS. Without it, too many false negative alerts are generated, which make the system ineffective.
Root Core Analysis takes event correlation a step further. Rather than just dampening alerts from nodes downstream of an existing problem, it only alerts on the real cause of a problem, to significantly reduce the time needed for a fix.
Currently, a typical NMS platform will consist of two main systems, with one solution doing the Up/Down monitoring, the other the reporting. This leads to extremely inefficient double polling of devices. Why ping a host to see if it's up when you've just gathered interface stats from it? Some systems can be integrated to help reduce this double polling, but only a single NMS solution will truly provide the most efficient use of the network.
The traditional NMS provides a network map for operators to be able to point and click through to any problems. Some systems have dropped this functionality, claiming that operators only really need to be told what the real problems are. These are typically Second Generation event correlation engines, that just provide a list of problems for the operator.
However, no matter how advanced the logic is in an NMS, it cannot cover all problems, and providing a visual representation for operators to work with can provide major gains. The human brain works best with visual images rather than the written word. NMSes need a map!
Aside from alerting (via email, SMS, etc.), how should an NMS interface with the operators? There are two distinct camps, dedicated GUI and HTTP. A growing number of HTTP interfaces (typically with some Java thrown in) are being used.
While this type of interface may have its uses, it is not the best medium in an operational environment. A dedicated GUI is the only way to provide a fast, efficient, reliable mechanism for operators to interact with an NMS.
The M stands for Management, but what's being managed, exactly? Network problems, mainly. A single generic management interface is somewhat of a holy grail that some people have been chasing. Is it achievable?
How far should management be taken? Many vendors have proprietary management software for their systems to provide an alternative to the commandline. Should an NMS allow full management of a device without having to resort to a CLI? Some things can be done easily by SNMP, but what interaction should an NMS have with a device's CLI? RANCID provides an easy change management system for routers, but also shows the possibilities of being able to integrate functions into an NMS that typically are done at the CLI level.
Think being able to do mass changes (for example, SNMP community changes) via a few clicks on a GUI, rather than manually having to login to thousands of devices.
I'll mention the commercial solutions as well, as they typically have far better Man-Machine Interfaces. This is a typical problem with F/OSS, as programmers don't usually make good UI engineers.
The beauty of F/OSS is that we have a huge, growing repository of code. So, do we start coding the "perfect" NMS from scratch, or use the tools already provided and just integrate the functionality we require?
Some commercial vendors make big claims about how their code is multi-threaded and "industrial strength". Producing good, clean, efficient code that will run on many platforms and is part of a large system is hard to do! Such a large, complicated system can also be extremely hard for new coders to get into. Keeping the functionality compartmentalized into small programs can ease these problems. This ties in well with using existing toolsets and just concentrating on an integration issue.
OSSIM is taking the integration approach, and it is well worth watching how well this works. Obviously, the double polling issue rears its head here, and would be a serious limiting factor in any large implementation. Although OSSIM is coming from a security requirements background, it offers an example for the creation of a "proper" F/OSS NMS system. How much work is involved in integrating Nagios with RRDTOOL? Could the cheops-ng GUI be used as the frontend for Nagios?
How do our original NMS requirements map to existing F/OSS projects?
|Up/Down Monitoring:||Nagios, BigBrother|
|Configuration Change Management:||RANCID|
Most of the functionality is covered across a number of projects. This only leaves Root Cause Analysis. Unfortunately, this is probably one of the hardest things to do.
First generation NMSes like Nagios and Big Brother rely on polling, via ICMP or an application-specific method (HTTP, FTP, etc.), to do their up/down monitoring. Unfortunately, this really isn't network management. It's just node polling, and has major disadvantages.
To poll means there is a polling interval. What is the state of your network during these intervals? Actively polling the network is also a major scalability problem. The larger your network, the more polling required. Active polling systems are fine for monitoring a handful of systems, but to manage a network, you have to look at other mechanisms.
This is where systems such as OpenNMS and JFFNMS come in. These are realtime event-driven systems. Events are typically from SNMP traps, but can come from other sources such as syslog. There is no polling interval as such in these systems. If a node goes down, an SNMP trap is generated by the switch immediately. You now have true realtime network monitoring.
Of course, SNMP traps are typically not generated on application failures. Most NMSes will resort back to polling to monitor applications.
It would be nice to see better support for enterprise/carrier-grade functionality in F/OSS NMSes, such as support for bulkstats, netflow, and RCA.
However, there is something I have not seen either F/OSS or commercial systems using: Host/Network sniffing. Having a local host-based sniffer or a dedicated sniffer on a mirrored switch port could leverage enormous gains for NMSes:
Developing an NMS-centric pcap-based sniffer seems like the way forward. It could be easily integrated with current systems by being developed separately, and just generating SNMP traps when required.