We solved an interesting problem today. A client was running OpenNMS in a fairly large environment (8000+ interfaces). It was running smoothly up until about a month ago, then it would die about every 5-7 days.
The only logs of interest were in collectd.log. There would be some “too many open files” errors, followed by a Java “OutOfMemory” error, and then the system would die.
It turns out there was a bad .rrd file (the client had been doing some dump and restore work on the .rrd repository). Since RRD is in C, we use an interface in Java to communicate with it, and apparently it was not returning the proper error message.
Time to implement the new thread safe RRD commands. (grin)