Choose the Right Thermometer

Okay, so I have a love/hate relationship with Centurylink. Centurylink provides a DSL circuit to my house. I love the fact that I have something resembling broadband with 10Mbps down and about 1Mbps up. Now that doesn’t even qualify as broadband according to the FCC, but it beats the heck out of the alternatives (and I am jealous of my friends with cable who have 100Mbps down or even 300Mbps).

The hate part comes from reliability, which lately has been crap. This post is actually focused on OpenNMS so I won’t go into all of my issues, but I’ve been struggling with long outages in my service.

The latest issue is a new one: packet loss. Usually the circuit is either up or completely down, but for the last three days I’ve been having issues with a large percentage of dropped packets. Of course I monitor my home network from the office OpenNMS instance, and this will usually manifest itself with multiple nodeLostService events around HTTP since I have a personal web server that I monitor.

The default ICMP monitor does not measure packet loss. As long as at least one ping reply makes it, ICMP is considered up, so the node itself remains up. OpenNMS does have a monitor for packet loss called Strafeping. It sends out 20 pings in a short amount of time and then measures how long they take to come back. So I added it to the node for my home and I saw something unusual: a consistent 19 out of 20 lost packets.

Strafeping Graph

Power cycling the DSL modem seems to correct the problem, and the command line ping was reporting no lost packets, so why was I seeing such packet loss from the monitor? Was Strafeping broken?

While it is always a possibility, I didn’t think that Strafeping was broken, but I did check a number of graphs for other circuits and they looked fine. Thus it had to be something else.

This brings up a touchy subject for me: false positives. Is OpenNMS reporting false problems?

It reminds me of an event happened when I was studying physics back in the late 1980s. I was working with some newly discovered ceramic material that exhibited superconductivity at relatively high temperatures (around 92K). That temperature can be reached using liquid nitrogen, which was relatively easy to source compared to cooler liquids like liquid helium.

I needed to measure the temperature of the ceramic, but mercury (used in most common thermometers) is a solid at those temperatures, so I went to my advisor for suggestions. His first question to me was “What does a thermometer measure?”

I thought it was a trick question, so I answered “temperature” (“thermo” meaning temperature and meter meaning “to measure”). He replied, “Okay, smart guy, the temperature of what?”

That was harder to answer exactly, so I said vague things like the ambient environment, whatever it was next to, etc. He interrupted me and said “No, a thermometer measures one thing: the temperature of the thermometer”.

This was an important lesson, even though it seems obvious. In the case of the ceramic it meant a lot of extra steps to make sure the thermometer we were using (which was based on changes in resistance) was as close to the temperature of the material as possible.

What does that have to do with OpenNMS? Well, OpenNMS is like that thermometer. It is up to us to make sure that the way we decide to use it for monitoring is as close to our criteria as possible. A “false positive” usually indicates a problem with the method versus the tool – OpenNMS is behaving exactly as it should but we need to match it better to what we expect.

In my case I found out the router I use was limited by default to responding 1 ping per second (to avoid DDoS attacks I assume), so last night when I upped that to allow 20 pings per second Strafeping started to work as expected (as you can see in the graph above).

This allowed me to detect when my DSL circuit packet loss started again today. A little after 14:00 the system detected high packet loss. When this happened before, power cycling the modem seemed to fix it, so I headed home to do just that.

While I was on the way, around 15:30, the packet loss seemed to improve, but as you can see from the graph the ping times were all over the place (the line is green but there is a lot of extra “smoke” around it indicating a variance in the response times). I proactively power cycled the modem and things settled down. The Centurylink agent agreed to send me a new modem.

The point of this post is to stress that you need to understand how your monitoring tools actually work and you can often correct issues that make a monitor unusable and turn it into to something useful. Choose the right thermometer.