Ever Wonder What Your Support Dollar Buys?

One of the hardest parts of our business is justifying the purchase of support and services for our open source product. Shouldn’t it “just work”? If I have a bunch of smart people working for me, shouldn’t I be able to figure this out on my own?

The issue is that with a product as powerful and complex as OpenNMS, quite often it is something outside of the application that is causing the problem. For example, I am visiting a large telecom provider this week and we spent part of my time figuring out a complex issue with syslogs. It had nothing to do with OpenNMS and everything to do with their various devices each sending in logs with different formats (sometimes two or more formats from the same type of equipment). I doubt any solution without the flexibility of OpenNMS would have been able to solve it, and the customer told me today “we couldn’t have gotten this to work without you”.

Also, today Jeff was dealing with a support issue with another one of our clients. Their large provisioning import was never completing. He dug around and posted this reply:

Watching the logs with Provisiond turned up to DEBUG, I noticed a single pattern accounting for nearly all the messages in provisiond.log:

2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.1
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.2
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.3
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.4
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.5
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.6
2012-10-17 19:20:40,087 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SnmpWalker: Sending tracker pdu of size 6
2012-10-17 19:20:40,090 DEBUG [DefaultUDPTransportMapping_127.0.0.1/0]SnmpWalker: Received a tracker PDU of type RESPONSE from /172.22.66.210 of size 0, errorStatus = 0, errorStatusText = Success, errorIndex = 0

It’s the same six objects being requested over and over from ten hosts, all of which appear to be Eltek Valere power units. See there how the size of the RESPONSE PDU is listed as zero? That seemed odd, so I captured some of the SNMP traffic and loaded up the dump into Wireshark.

These devices are replying to our BULK-GETs with response PDUs containing no varbinds, but also indicating no errors, which is silly and seems to send our SnmpWalker class into an infinite loop. You can reproduce this problem using the Net-SNMP snmpbulkget utility:

snmpbulkget –verbose -v2c -c public -Cn6 -Cr1 172.22.66.106 .1.3.6.1.2.1.1.1 .1.3.6.1.2.1.1.2 .1.3.6.1.2.1.1.3 .1.3.6.1.2.1.1.4 .1.3.6.1.2.1.1.5 .1.3.6.1.2.1.1.6

Falling back to SNMPv1 and GET-NEXT seems to elicit a valid response. So I’ve done that for these ten nodes.

This is an example of why we at The OpenNMS Group only hire highly experienced people – how long would it have taken the average person to figure out that out of 7000+ devices these ten were the culprits, as well as coming up with a workaround?

This customer has a bunch of talented people working for them, but they are focused on that client’s business and aren’t as expert on OpenNMS as we are. They might have figured this out, but it could have taken days. Outside of the salaries for that time, the business would suffer since the solution wouldn’t be working. This case on its own probably justified the cost of support.

Now someone might say that a commercial product wouldn’t have suffered from this problem. I find that hard to believe, as any device that abuses the standard to this degree would cause problems for any application. Plus, commercial vendors view support as a cost center, not a revenue stream, and the chance that you would have gotten someone knowledgable on the first try is slim. So you are back to wasting time, and time is money.