When Less is More

One of the things I’ve noticed in my years of deploying network management solutions is that people can get real excited when they go from having no visibility into their network to being able to see in great detail what’s happening, as when they deploy OpenNMS. The problem then changes from having no information to having too much.

Network geeks like myself tend to be loathe to turn off certain alerts, but sometimes that can be the best thing for an organization.

When OpenNMS was started, workflow was based on events. Events appeared in the browser, events triggered notices, you could acknowledge events – pretty much everything was events. But events can be noisy, especially if you leverage the SNMP trap capability of many devices. This is why we implemented the alarms subsystem. Alarms can take many events and reduce them into a single alarm. Alarm processing can be automated to insure that issues that are important are escalated, and issues that have been cleared can be removed. The alarms list is supposed to be a “to do” list for the NOC staff.

In order to make that happen, it is a good idea to consider each alarm in your system and insure that it is “actionable”. Each alarm comes with two fields for tracking the resolution progress, and these can be used to document the actions taken to fix the alarm.

The “Sticky” memo field is used to annotate a particular instance of an alarm. For example, suppose there was a “link down” alarm due to a circuit being cut by a backhoe. The NOC engineer would be able to note that the repair was in process and maybe even include a case number. Once the issue is resolved the sticky memo goes away.

The “Journal” memo field is permanently associated with the alarm. This is for notes that could be useful the next time the alarm happens, such as “Contact Jim – he knows how to fix this”, etc.

Alarms can be acknowledged, which will remove them from the list of current issues. It is pretty easy to create an automation that can unacknowledge an alarm if it hasn’t been cleared in a particular amount of time. Thus you can automate “reminders” that the issue is still outstanding.

This doesn’t discount the value of events. In OpenNMS, events have become more like log messages. When an alarm happens on a particular node, that node’s page will reflect the events associated with it, which may shed some light onto the problem. But having too many events appear as alarms can overwhelm the NOC staff to the point that they stop using the system.

Unfortunately, often the best way of dealing with network issues involves trial and error. By limiting alarms it is possible to miss something important. But once that happens, alarms can be created to insure it doesn’t happen again. But the opposite, dumping too much information into alarms, will guarantee that alarms will be ignored, greatly increasing the chance that something important will be missed.

I developed my alarm philosophy during my first network management deployment in the early 1990s. I was consulting for a cellular provider and installing HP OpenView Network Node Manager (version 2.2 I believe) and they had me working in the server room. Besides being a bit cold, in the corner was a large UPS that was constantly beeping.

Beep … Beep … Beep

I asked Avery, the guy I was working with, what was wrong and he replied “Oh, it always does that”. At that very moment I decided that if there is an alarm and you don’t do anything to resolve it, just turn it off.

Just remember that OpenNMS is a platform and thus you get to make a lot of the decisions on how best to get it to work with your organization. Consider that when deciding which events to turn into alarms, and then focus on using automations to insure that the most important issues are treated as such.