SMB can cause capsd to stop

Just thought I’d make everyone aware that on some networks the default SMB configuration in capsd.log can cause capsd to stop scanning.

Bug 725 has been opened on this.

There are two workarounds:

1) If you don’t care about SMB, just remove the smb-config from the capsd config file.

2) Create a dummy user on your domain and put in the correct username, password and domain in the smb-config.

OpenNMS Mailing List Etiquette

Well, the OpenNMS server has been listed as a source of Spam – again.

For those who follow the mailing lists, you may remember the last time this happened.

The OpenNMS server is hosted at Rackspace Managed Hosting. They are a really amazing provider of dedicated servers: you can run any O/S you want and have access to an incredible amount of bandwidth.

Unfortunately, some people abuse this and use their servers for such things as Spam. This is against the Rackspace Fair Use Policy and the servers are shut down, but the damage has been done.

The damage occurs because Rackspace does not have an unlimited supply of IP addresses, thus someone legitimate may end up with an ex-Spammer’s address. If that address has been blacklisted, however, mail from the new server will be blocked.

I like Rackspace because they provide a needed service while allowing their clients a lot of freedom, limited only by the Fair Use Policy. With any freedom, however, there can be abuse, and Rackspace does it’s best to limit it.

Unfortunately, there are people out there that maintain lists of IP addresses of Spam sources. Instead of trying to find a way to insure that this list is accurrate, they have taken to blacklisting entire subnets. Unfortunately, OpenNMS has fallen victim to this more than once.

The latest was from RoadRunner. As of Monday, RoadRunner blocked mail from the OpenNMS server. Luckily, a) I know someone who works there, and b) RoadRunner is also a well run business and they have a fairly straightforward system for getting your server whitelisted again. They have promised to whitelist our IP addresses, but as of now this isn’t the case.

Whew, sorry for the long story but it does have a point. If you are having problems receiving mail from the discussion lists, check with your Internet provider or admin to see if we have been blacklisted. If you are subscribing you should get a confirmation e-mail from us within a minute.

Insure that mail from 209.61.155.240 is allowed.

If you are subscribed and suddenly stop receiving messages, log in via the web to the appropriate management page such as this one for “discuss” and see if your preferences have been set to “no mail”. Mailman will automatically do this if mail bounces from your account, perhaps due to a Spam filter.

While I am on the subject, here are a couple of other OpenNMS e-mail gotchas to watch out for:

1) Out of Office messages: if you subscribe to an OpenNMS list, do not set your mailer up to auto-respond with an “Out of Office” message. I don’t get to take vacation so I don’t want to hear about yours (grin). Most auto-responders can look at the message headers and avoid responding to those messages from lists, so please take the time to set them up if you want to use one. Out of Office messages will get you removed from the list immediately, and you’ll need to re-subscribe.

2) Reply To All: Many people are used to hitting “Reply to All” when responding to messages on the list to insure that the author gets a copy (the list is the default Reply-To address). On some mailers, however, it will list both the “@opennms.org” address and the “@lists.opennms.org” address. Those messages end up going to the same place, so the list gets hit with double messages. Please be aware of this and check your mailer if you use this method. Double posts are annoying, but they won’t get you kicked off the list (grin).

3) Getting Kicked off the List: There is only one address I have ever banned from sending mail to OpenNMS, and I actually only did that for a week. The reason was this person posted numerous messages, one after the other, asking (well, demanding) help and was upset when no one responded after four hours on a Saturday, so he posted his questions again. OpenNMS is run by volunteers, so keep that in mind when posting questions. We are a community dedicated to producing the best Network Management System available, and doing it as open-source, but most of us have day jobs. If you need immediate support, consider a commercial support contract from Sortova.

War, What Is It Good For?

Since I consider OpenNMS a global community, I just wanted to post a note expressing my personal hope that the current strife will come to a quick end with minimal loss of life. Unlike many Americans, I have spent time overseas in places like the Middle East and Asia, and I found the people there to be very friendly and kind.

I had other things to say, but I got this in an e-mail from SourceForge today, and it seems to express them better than I could:

Finally, in this time of global political uncertainty, it is good to be
reminded that Open Source software has no boundaries. The work you do
on your Open Source project benefits countless individuals and nations
in every corner of the globe. Regardless of location, faith, or race,
developers are collaborating together to create software that will not
only benefit themselves, but all of humanity. At a time when it’s hard
to see something positive in the news, remember in our own way, the
Open Source community is making a difference.

The Emperor's New Clothes

In one of my past lives, I was actually training to be a chemist. I remember pretty much only one thing from those days, and that was an exchange between a professor and myself.

The professor asked me “What does a thermometer measure?”. I thought the answer was obvious: temperature. “The temperature of what?” he replied. Hmmm, whatever it is in? He answered, “A thermometer measures one thing – the temperature of the thermometer.”

The point he was trying to make involved techniques to insure that the temperature of the thermometer and the temperature of what you wanted to measure were close if not the same. But I learned that any measuring, or in our case monitoring, system used is constrained to measure only that which is known to it.

Take ICMP response time in OpenNMS, for example. I was real happy my “pings” were on the order of microseconds, and OpenNMS reported round trip times on the order of microseconds. Way to go OpenNMS.

However, it was pointed out to me that the times reported by OpenNMS didn’t necessarily match up with reality. I’d never really looked at it before. I was expecting a number on the order of a millisecond or less, and I got numbers on the order of milliseconds.

But take a command line tool like ping and use it on the OpenNMS system. Local RTT for me is around 50 microseconds. OpenNMS reports an average of 300 microseconds, with peaks of 4-6 milliseconds. And notice that when OpenNMS is first started or under load, the times get longer.

I spent two days trying to figure this out. The problem is that no Java API exists for ICMP, so you have to use an outside program. We use code written in C and accessed via JNI to perform “ping-like” functions.

But unlike other pollers, where we make a request and wait for the answer, with pings we send an ECHO_REQUEST packet, and then another process examines the ECHO_REPLY packets that get returned (as well as all other ICMP traffic, which the program discards). This ReplyReceiver process calls the C code to check for any new ECHO_REPLY packets.

Here’s where the problem lies. We send the system time out with the ECHO_REQUEST packet, and the ECHO_REPLY packet returns that time. When that packet is received, we take another sample of the system time, and the difference is the RTT. But because we don’t check immediately when the reply packet is received, there is a “lag” between when the packet is really received and when OpenNMS marks it as received. This difference can vary under system load, and is why the RTT for ICMP is so off.

Note that this doesn’t affect other pollers, just ICMP.

I am not sure that this can be fixed. The way to fix it is to send the ECHO_REQUEST packet and then wait for the reply, all within the same process. But you can probably see how this would cause problems on large networks. Packets get lost, retries have to be attempted, etc. We would have to write pretty much the whole ICMP management piece in C and then loosely tie it back to Java, instead of doing most of it in Java and using a bare minimum of C. Think this is trivial? Check out all of the O/S specific code in IcmpSocket.c and you’ll see what I am talking about.

Anyway, native ICMP support is not even an option until Java 1.5 (maybe). So I think for now I may remove ICMP response times from the default OpenNMS install. They can still be useful as a number that goes down when things are good and up when things are bad, but since it really isn’t measureing temperature, we shouldn’t call it a thermometer.

Thread Safe RRD

We solved an interesting problem today. A client was running OpenNMS in a fairly large environment (8000+ interfaces). It was running smoothly up until about a month ago, then it would die about every 5-7 days.

The only logs of interest were in collectd.log. There would be some “too many open files” errors, followed by a Java “OutOfMemory” error, and then the system would die.

It turns out there was a bad .rrd file (the client had been doing some dump and restore work on the .rrd repository). Since RRD is in C, we use an interface in Java to communicate with it, and apparently it was not returning the proper error message.

Time to implement the new thread safe RRD commands. (grin)