In one of my past lives, I was actually training to be a chemist. I remember pretty much only one thing from those days, and that was an exchange between a professor and myself.
The professor asked me “What does a thermometer measure?”. I thought the answer was obvious: temperature. “The temperature of what?” he replied. Hmmm, whatever it is in? He answered, “A thermometer measures one thing – the temperature of the thermometer.”
The point he was trying to make involved techniques to insure that the temperature of the thermometer and the temperature of what you wanted to measure were close if not the same. But I learned that any measuring, or in our case monitoring, system used is constrained to measure only that which is known to it.
Take ICMP response time in OpenNMS, for example. I was real happy my “pings” were on the order of microseconds, and OpenNMS reported round trip times on the order of microseconds. Way to go OpenNMS.
However, it was pointed out to me that the times reported by OpenNMS didn’t necessarily match up with reality. I’d never really looked at it before. I was expecting a number on the order of a millisecond or less, and I got numbers on the order of milliseconds.
But take a command line tool like ping and use it on the OpenNMS system. Local RTT for me is around 50 microseconds. OpenNMS reports an average of 300 microseconds, with peaks of 4-6 milliseconds. And notice that when OpenNMS is first started or under load, the times get longer.
I spent two days trying to figure this out. The problem is that no Java API exists for ICMP, so you have to use an outside program. We use code written in C and accessed via JNI to perform “ping-like” functions.
But unlike other pollers, where we make a request and wait for the answer, with pings we send an ECHO_REQUEST packet, and then another process examines the ECHO_REPLY packets that get returned (as well as all other ICMP traffic, which the program discards). This ReplyReceiver process calls the C code to check for any new ECHO_REPLY packets.
Here’s where the problem lies. We send the system time out with the ECHO_REQUEST packet, and the ECHO_REPLY packet returns that time. When that packet is received, we take another sample of the system time, and the difference is the RTT. But because we don’t check immediately when the reply packet is received, there is a “lag” between when the packet is really received and when OpenNMS marks it as received. This difference can vary under system load, and is why the RTT for ICMP is so off.
Note that this doesn’t affect other pollers, just ICMP.
I am not sure that this can be fixed. The way to fix it is to send the ECHO_REQUEST packet and then wait for the reply, all within the same process. But you can probably see how this would cause problems on large networks. Packets get lost, retries have to be attempted, etc. We would have to write pretty much the whole ICMP management piece in C and then loosely tie it back to Java, instead of doing most of it in Java and using a bare minimum of C. Think this is trivial? Check out all of the O/S specific code in IcmpSocket.c and you’ll see what I am talking about.
Anyway, native ICMP support is not even an option until Java 1.5 (maybe). So I think for now I may remove ICMP response times from the default OpenNMS install. They can still be useful as a number that goes down when things are good and up when things are bad, but since it really isn’t measureing temperature, we shouldn’t call it a thermometer.