Nifty Trick for Auto-Acknowleding Notices

OpenNMS has a built in notification system that can act like a mini-trouble ticketing system. It’s triggered by events. Once an event creates a notice it “walks” a destination path where various actions can occur (send an e-mail, send a page, call a phone number, etc.)

The path has various “targets” and there is an escalation delay between each one. If the notice goes unacknowledged, the next set of targets is triggered.

Each path also has an “initial delay”. This is some amount of time where no notices are sent. It is useful combined with a feature that allows certain events (specifically nodeUp, interfaceUp and nodeRegainedService) to automatically acknowledge corresponding “down” events.

I hate getting a page at 2am that something is down. What I hate more is when a minute later I get a second page that the problem has been resolved. To minimize alerting on these transient outages, I use an initial delay of two minutes (or more) for most of my paths. This allows OpenNMS to make multiple attempts to see if the problem is resolved before notifying me.

Now, there are some issues with the notification system. First, it is triggered by events. This can be a pain for, say, a pesky trap that comes in every minute. Those can be reduced using the alarm sub-system within OpenNMS, so the plan is to eventually move the notification system to trigger on alarms. Since we already have a trouble ticket API built in to alarms, rather than modify notifd we plan to move that functionality to a separate product in the future.

The second is that this auto-acknowledgement feature only works if each event can be refined with a nodeid, interface or service name. If not, there is no way to differentiate similar events, and thus the system may acknowledge the wrong notices.

I just finished working on a support ticket to implement this auto-acknowledgement functionality using alarms, and I thought I’d share my solution.

First the events in question have to become alarms. This is done by adding an <alarm-data> tag to each event. The down should look something like this:

<alarm-data
  reduction-key="%uei%:%dpname%:%parm[#1]%"
  alarm-type="1"
  auto-clean="false"/>

Note that I just chose an generic %parm[#1]% to use the first event parameter to uniquely identify this event, but it could be any parameter, or a combination of parameters, depending on the event in question.

The up would thus look like:

<alarm-data
  reduction-key="%uei%:%dpname%:%parm[#1]%"
  clear-key="[UEI of Down Event]:%dpname%:%parm[#1]%"
  alarm-type="2"
  auto-clean="false"/>

Note the addition of the “clear-key”. This must match the reduction key of the “down” event.

Once these alarms are in place, it is pretty simple to create an automation.

Edit vacuumd-configuration.xml and add an automation:

<automation name="ackMyNotices"
            interval="60000"
            active="true"
            trigger-name="selectMyUpAlarms"
            action-name="ackMyDownNotices" />

with a trigger and an action. The trigger should fire when the “up” alarm is generated:

<trigger name="selectMyUpAlarms" operator=">=" row-count="1" >
  <statement>
    SELECT *, now() AS _ts
    FROM alarms
    WHERE eventuei = '[UEI of Up Event]'
  </statement>
</trigger>

So if there is at least one “up” event, we’re good.

Finally, perform the action:

<action name="ackMyDownNotices" >
  <statement>
    UPDATE notifications
      SET answeredby='auto-acknowledged', respondtime=now()
      WHERE eventid = (SELECT lasteventid FROM alarms WHERE reductionKey = ${clearKey})
      AND answeredby is null
      AND pagetime <  ${lastEventTime}
  </statement>
</action>

This will update the entry in the alarm table to show it as being “auto-acknowledged” if it isn’t acknowledged already. The tricky bit is that join where the eventid from the down alarm is matched with the eventid used for the notice.

Note that this works best with alarms that aren’t reduced, i.e. there aren’t multiple downs before the up event.