Agent Provocateur

I’ve been involved with the monitoring of computer networks for a long time, two decades actually, and I’m seeing an alarming trend. Every new monitoring application seems to be insisting on software agents. Basically, in order to get any value out of the application, you have to go out and install additional software on each server in your network.

Now there was a time when this was necessary. BMC Software made a lot of money with its PATROL series of agents, yet people hated them then as much as they hate agents now. Why? Well, first there was the cost, both in terms of licensing and in continuing to maintain them (upgrades, etc.). Next there was the fact that you had to add software to already overloaded systems. I can remember the first time the company I worked for back then deployed a PATROL agent on an Oracle database. When it was started up it took the database down as it slammed the system with requests. Which leads me to the final point, outside of security issues that arise with an increase in the number of applications running on a system, the moment the system experiences a problem the blame will fall on the agent.

Despite that, agents still seem to proliferate. In part I think it is political. Downloading and installing agents looks like useful work. “Hey, I’m busy monitoring the network with these here agents”. Also in part, it is laziness. I have never met a programmer who liked working on someone else’s code, so why not come up with a proprietary protocol and write agents to implement it?

But what bothers me the most is that it is so unnecessary. The information you need for monitoring, with the possible exception of Windows, is already there. Modern operating systems (again, with the exception of Windows) ship with an SNMP agent, usually based on Net-SNMP. This is a secure, powerful extensible agent that has been tried and tested for many years, and it is maintained directly on server itself. You can use SNMPv3 for secure communications, and the “extend” and “pass” directives to make it easy to customize.

Heck, even Windows ships with an extensible SNMP agent, and you can also access data via WMI and PowerShell.

But what about applications? Don’t you need an agent for that?

Not really. Modern applications tend to have an API, usually based on ReST, that can be queried by a management station for important information. Java applications support JMX, databases support ODBC, and when all that fails you can usually use good ol’ HTTP to query the application directly. And the best part is that the application itself can be written to guard against a monitoring query causing undue load on the system.

At OpenNMS we work with a lot of large customers, and they are loathe to install new software on all of their servers. Plus, many of our customers have devices that can’t support additional agents, such as routers and switches, and IoT devices such as thermostats and door locks. This is the main reason why the OpenNMS monitoring platform is, by design, agentless.

A critic might point out that OpenNMS does have an agent in the remote poller, as well as in the upcoming Minion feature set. True, but those act as “user agents”, giving OpenNMS a view into networks as if it was a user of those networks. The software is not installed on every server but instead it just needs the same access as a user would have. So, it can be installed on an existing system or on a small system purchased for that purpose, at a minimum just one for each network to be monitored.

While some new IT fields may require agents, most successful solutions try to avoid them. Even in newer fields such as IT automation, the best solutions are agentless. They are not necessary, and I strongly suggest that anyone who is asked to install an agent for monitoring question that requirement.

OpenNMS is Sweet Sixteen

It was sixteen years ago today that the first code for OpenNMS was published on Sourceforge. While the project was started in the summer of 1999, no one seems to remember the exact date, so we use March 30th to mark the birthday of the OpenNMS project.

OpenNMS Project Details

While I’ve been closely associated with OpenNMS for a very long time, I didn’t start it. It was started by Steve Giles, Luke Rindfuss and Brian Weaver. They were soon joined by Shane O’Donnell, and while none of them are associated with the project today, they are the reason it exists.

Their company was called Oculan, and I joined them in 2001. They built management appliances marketed as “purple boxes” based on OpenNMS and I was brought on to build a business around just the OpenNMS piece of the solution.

As far as I know, this is the only surviving picture of most of the original team, taken at the OpenNMS 1.0 Release party:

OpenNMS 1.0 Release Team

In 2002 Oculan decided to close source all future work on their product, thus ending their involvement with OpenNMS. I saw the potential, so I talked with Steve Giles and soon left the company to become the OpenNMS project maintainer. When it comes to writing code I am very poorly suited to the job, but my one true talent is getting great people to work with me, and judging by the quality of people involved in OpenNMS, it is almost a superpower.

I worked out of my house and helped maintain the community mainly through the #opennms IRC channel on freenode, and surprisingly the project managed not only to survive, but to grow. When I found out that Steve Giles was leaving Oculan, I applied to be their new CEO, which I’ve been told was the source of a lot of humor among the executives. The man they hired had a track record of snuffing out all potential from a number of startups, but he had the proper credentials that VCs seem to like so he got the job. I have to admit to a bit of schadenfreude when Oculan closed its doors in 2004.

But on a good note, if you look at the two guys in the above picture right next to the cake, Seth Leger and Ben Reed, they still work for OpenNMS today. We’re still here. In fact we have the greatest team I’ve every worked with in my life, and the OpenNMS project has grown tremendously in the last 18 months. This July we’ll have our eleventh (!) annual developers conference, Dev-Jam, which will bring together people dedicated to OpenNMS, both old and new, for a week of hacking and camaraderie.

Our goal is nothing short of making OpenNMS the de facto management platform of choice for everyone, and while we still have a long way to go, we keep getting closer. My heartfelt thanks go out to everyone who made OpenNMS possible, and I look forward to writing many more of these notes in the future.

OpenNMS Horizon 17.1.1 Released

Probably the last Horizon 17 version, 17.1.1, has been released. According to TWiO, the next release will be Horizon 18 at the end of the month, with Horizon 19 following at the end of May.

This release is mainly a maintenance release. It does contain one fix I used (NMS-8199), which allows for the state names in the Jira Trouble Ticketing plugin to be configured. This helps a lot if Jira is not in English.

If you are running Horizon 17, this should help it run a bit smoother.

Bug

  • [NMS-7936] – Chart Servlet Outages model exception
  • [NMS-8010] – Groups config rolled back after deleting a user in web UI
  • [NMS-8034] – Adding com.sun.management.jmxremote.authenticate=true on opennms.conf is ignored by the opennms script
  • [NMS-8048] – org.hibernate.exception.SQLGrammarException with ACLs on V17
  • [NMS-8075] – vacuumd-configuration.xml — Database error executing statement
  • [NMS-8113] – Overview about major releases in the release notes
  • [NMS-8153] – Can't modify the Foreign ID on the Requisitions UI when adding a new node
  • [NMS-8159] – When altering the SNMP Trap NBI config, the externally referenced mapping groups are persisted into the main file.
  • [NMS-8161] – Tooltips are not working on the new Requisitions UI
  • [NMS-8165] – OutageDao ACL support is broken causing web UI failures
  • [NMS-8177] – Install guide should use postgres admin for schema updates
  • [NMS-8199] – Allows state names to be configured in the JIRA Ticketer Plugin

Enhancement

  • [NMS-6404] – Allow send events through ReST
  • [NMS-8148] – Create pull request and contribution template to GitHub project

Task

  • [NMS-8151] – Remove all jersey artifacts from lib classpath

Speeding Up OpenNMS Requisition Imports

One thing that differentiates OpenNMS from other applications is the strong focus on tools for provisioning the system. If you want to monitor hundred of thousands of devices, to ultimately millions, the ordinary methods just don’t work.

Users of OpenNMS often create large requisitions from external database sources, and sometimes it can take awhile for the import to complete. Delays can happen if the Foreign Source used for the requisition has a large number of service detectors that won’t exist on most devices.

For example, the default Foreign Source for Horizon 17 has about 15 detectors. Of those, only about 4 will exist on networking equipment (ICMP, SSH, HTTP and HTTPS). When scanning, this can add a lot of time per interface. Assuming 2 retries and a 3 second timeout, that would be 9 seconds for each non-existent service. With just 1000 interfaces, that’s 99000 seconds (9 seconds x 11 services x 1000 interfaces) of time just spent waiting, which translates to 27.5 hours.

Now, granted, the importer has multiple threads so the actual wait time will be less, but you can see how this can impact the time needed to import a requisition. This can be reduced significantly by tuning service detection to the bare minimum needed and perhaps adding other services later on a per device basis without scanning.

OpenNMS Horizon 17.1 Released

As some of you may have noticed, Horizon 17.1 has been released. As Horizon 17 will form the basis for Meridian 2016, I’m extremely happy to see how much progress has been made on fixing issues.

Be sure to check out the Release Notes.

Horizon is our rapid release version, and its goal is to get all the cool new features out as soon as they are ready. In this case, right about release we discovered an annoying but easy to fix bug with provisioning, so if you plan to run Horizon 17.1 you should also apply this patch.

Have fun, and we hope you find OpenNMS useful.

Bug

  • [NMS-4108] – Bad suggestions in install guide
  • [NMS-7152] – Enlinkd Topology Plugin fails to create LLDP links for mismatched link port descriptions.
  • [NMS-7820] – snmp-graph.properties.d files with –vertical-label="verticalLabel" in config
  • [NMS-7866] – Incorrect host in Location header when creating resources via ReST
  • [NMS-7910] – config-tester is broken
  • [NMS-7953] – Opsboard and Opspanel use wrong logo
  • [NMS-7966] – Unable to generate eventconf if a MIB (improperly?) uses a TC to define a TC
  • [NMS-7988] – container features.xml still references jersey 1.18 when it should reference 1.19.
  • [NMS-8000] – Topology-UI shows CDP links not correct
  • [NMS-8014] – Backshift graphs show dates in UTC instead of the browser's timezone
  • [NMS-8017] – StrafePing: Unexpected exception while polling PollableService
  • [NMS-8023] – Grafana Box did not work anymore
  • [NMS-8026] – Constant Thread Locking on Enlinkd
  • [NMS-8029] – Wrong use of opennms.web.base-url
  • [NMS-8038] – NRTG with newts – get StringIndexOutOfBoundsException
  • [NMS-8051] – The newts script only works if cassandra runs on localhost
  • [NMS-8054] – AlarmPersisterIT test is empty
  • [NMS-8064] – The 'newts init' script does work when authentication is enabled in Cassandra
  • [NMS-8065] – ReST Regression in Alarms/Events
  • [NMS-8066] – Newts only uses a single thread when writing to Cassandra
  • [NMS-8073] – User Restriction Filters: mapping class for roles to groups does not work
  • [NMS-8074] – The "Remove From Focus" button intermittently fails
  • [NMS-8079] – The OnmsDaoContainer does not update its cache correctly, leading to a NumberFormatException
  • [NMS-8084] – File not found exception for interfaceSTP-box.jsp on SNMP interface page
  • [NMS-8097] – Installation Guide Debian Bug Version 17.0.0
  • [NMS-8100] – Unable to complete creation of scheduled reports
  • [NMS-8103] – NPE when persisting data with Newts
  • [NMS-8104] – init script checkXmlFiles() fails to pick up errors
  • [NMS-8106] – INFO-severity syslog-derived events end up unmatched
  • [NMS-8109] – Memory leak when using the BSFDetector
  • [NMS-8112] – init script "configtest" exit value is always 1
  • [NMS-8116] – Heat map Alarms/Categories do not show all categories
  • [NMS-8119] – WS-MAN has broken ForeignSourceConfigRestService and the requisitions UI doesn't work.
  • [NMS-8123] – Removing ops boards via the configuration UI does not update the table
  • [NMS-8126] – JNA ping code reuses buffer causing inconsistent reads of packet contents
  • [NMS-8133] – Synchronizing a requisition fails
  • [NMS-8147] – Add all the services declared on Collectd and Pollerd configuration as available services on /opennms/rest/foreignSourceConfig/services

Enhancement

  • [NMS-7123] – Expose SNMP4J 2.x noGetBulk and allowSnmpV2cInV1 capabilities
  • [NMS-7978] – Add threshold comments and whitespace changes to match how the OpenNMS web GUI generates XML files
  • [NMS-8005] – Add support for using NRTG via Ajax calls
  • [NMS-8024] – Add support for OSGi-based Ticketing plugins
  • [NMS-8028] – Add event definition for postfix syslog message TLS disabled
  • [NMS-8030] – Improve the SNMP data collection config parsing to give more flexibility to the users
  • [NMS-8042] – set Up severities for RADLAN-MIB.events.xml
  • [NMS-8068] – Add support for marshalling NorthboundAlarms to XML
  • [NMS-8071] – Event definition file for JUNIPER-IVE-MIB
  • [NMS-8120] – Fixed a paragraph in the "Automatic Discovery" provisioning chapter
  • [NMS-8156] – Upgrade Angular Backend for the Requisitions UI

Add a Weather Widget to OpenNMS Home Screen

I was recently at a client site where I met a man named Jeremy Ford. He’s sharp as a knife and even though, at the time, he was new to OpenNMS, he had already hacked a few neat things into the system (open source FTW).

Weathermap on OpenNMS Home Page

One of those was the addition of a weathermap to the OpenNMS home page. He has graciously put the code up on Github.

The code is a script that will generate a JSP file in the OpenNMS “includes” directory. All you have to do then is to add a reference to it in the main index.jsp file.

For those of you who don’t know or who have never poked around, under the $OPENNMS_HOME directory should be a directory called jetty-webapps. That is the web root directory for the Jetty servlet container that ships with OpenNMS.

Under that directory you’ll find a subdirectory for opennms. When you surf to http://[my OpenNMS Server]:8980/opennms that is the directory you are visiting. In it is an index.jsp file that serves as the main page.

If you are familiar with HTML, the JSP file is very similar. It can contain references to Java code, but a lot of it is straight HTML. The file is kept simple on purpose, with each of the three columns on the main page indicated by comments. The part you will need to change is the third column:

<!-- Right Column -->
        <div class="col-md-3" id="index-contentright">
                <!-- weather box -->
                <jsp:include page="/includes/weather.jsp" flush="false" />

Feel free to look around. If you ever wanted to rearrange the OpenNMS Home page, this is a good place to start.

Now, I used to like poking around with these files since they would update automatically, but later versions of OpenNMS (which contain later versions of Jetty) seem to require a restart. If you get an error, restart OpenNMS and see if it goes away.

Now the weather.jsp file gets generated by Jeremy’s python script. In order to get that to work you’ll need to do two things. The most important is to get an API key from Weather Underground. It is a pretty easy process, but be aware that you can only do 500 queries a day without paying. The second thing you’ll need to do is edit the three URLs in the script and change the location. It is currently set to “CA/San_Francisco” but I was able to change it to “NC/Pittsboro” and it “just worked”.

Finally, you’ll need to set the script up to run via cron. I’m not sure how frequently Weather Underground updates the data, but a 10 minute interval seems to work well. That’s only 144 queries a day, so you could easily double it and still be within your limit.

[IMPORTANT UPDATE: Jeremy pointed out that the script actually does three queries, not just one, so instead of doing 144 queries a day, it’s 432. Still leaves some room with 10 minute queries but you don’t want to increase the frequency too much.]

Thanks to Jeremy for taking the time to share this. Remember, once you get it working, if you upgrade OpenNMS you’ll need to edit index.jsp and add it back, but that should be the only change needed.

OpenNMS at Scale

So, yes, the gang from OpenNMS will be at the SCaLE conference this weekend (I will not be there, unfortunately, due to a self-imposed conference hiatus this year). It should be a great time, and we are happy to be a Gold Sponsor.

But this post is not about that. This is about how Horizon 17 and data collection can scale. You can come by the booth at SCaLE and learn more about it, but here is the overview.

When OpenNMS first started, we leveraged the great application RRDTool for storing performance data. When we discovered a java port called JRobin, OpenNMS was modified to support that storage strategy as well.

Using a Round Robin database has a number of advantages. First, it’s compact. Once the file containing the RRD database is created, it never grows. Second, we used RRDTool to also graph the data.

However, there were problems. Many users had a need to store the raw collected data. RRDTool uses consolidation functions to store a time-series average. But the biggest issue was that writing lots of files required really fast hard drives. The more data you wanted to store, the greater your investment in disk arrays. Ultimately, you would hit a wall, which would require you to either reduce your data collection or partition out the data across multiple systems.

No more. With Horizon 17 OpenNMS fully supports a time-series database called Newts. Newts is built on Cassandra, and even a small Cassandra cluster can handle tens of thousands of inserts a second. Need more performance? Just add more nodes. Works across geographically distributed systems as well, so you get built-in high availability (something that was very difficult with RRDTool).

Just before Christmas I got to visit a customer on the Eastern Shore of Maryland. You wouldn’t think that location would be a hotbed of technical excellence, but it is rare that I get to work with such a quick team.

They brought me up for a “Getting to Know You” project. This is a two day engagement where we get to kick the tires on OpenNMS to see if it is a good fit. They had been using Zenoss Core (the free version) and they hit a wall. The features they wanted were all in the “enterprise” paid version and the free version just wouldn’t meet their needs. OpenNMS did, and being truly open source it fit their philosophy (and budget) much better.

This was a fun trip for me because they had already done most of the work. They had OpenNMS installed and monitoring their network, and they just needed me to help out on some interesting use cases.

One of their issues was the need to store a lot of performance data, and since I was eager to play with the Newts integration we decided to test it out.

In order to enable Newts, first you need a Cassandra cluster. It turns out that ScyllaDB works as well (more on that a bit later). If you are looking at the Newts website you can ignore the instructions on installing it as it it built directly into OpenNMS.

Another thing built in to OpenNMS is a new graphing library called Backshift. Since OpenNMS relied on RRDTool for graphing, a new data visualization tool was needed. Backshift leverages the RRDTool graphing syntax so your pre-defined graphs will work automatically. Note that some options, such as CANVAS colors, have not been implemented yet.

To switch to newts, in the opennms.properties file you’ll find a section:

###### Time Series Strategy ####
# Use this property to set the strategy used to persist and retrieve time series metrics:
# Supported values are:
#   rrd (default)
#   newts

org.opennms.timeseries.strategy=newts

Note: “rrd” strategy can refer to either JRobin or RRDTool, with JRobin as the default. This is set in rrd-configuration.properties.

The next section determines what will render the graphs.

###### Graphing #####
# Use this property to set the graph rendering engine type.  If set to 'auto', attempt
# to choose the appropriate backend depending on org.opennms.timeseries.strategy above.
# Supported values are:
#   auto (default)
#   png
#   placeholder
#   backshift
org.opennms.web.graphs.engine=auto

If you are using Newts, the “auto” setting will utilize Backshift but here is where you could set Backshift as the renderer even if you want to use an RRD strategy. You should try it out. It’s cool.

Finally, we come to the settings for Newts:

###### Newts #####
# Use these properties to configure persistence using Newts
# Note that Newts must be enabled using the 'org.opennms.timeseries.strategy' property
# for these to take effect.
#
org.opennms.newts.config.hostname=10.110.4.30,10.110.4.32
#org.opennms.newts.config.keyspace=newts

There are a lot of settings and most of those are described in the documentation, but in this case I wanted to demonstrate that you can point OpenNMS to multiple Cassandra instances. You can also set different keyspace names which allows multiple instances of OpenNMS to talk to the same Cassandra cluster and not share data.

From the “fine” documentation, they also recommend that you store the data based on the foreign source by setting this variable:

org.opennms.rrd.storeByForeignSource=true

I would recommend this if you are using provisiond and requisitions. If you are currently doing auto-discovery, then it may be better to reference it by nodeid, which is the default.

I want to point out two other values that will need to be increased from the defaults: org.opennms.newts.config.ring_buffer_size and org.opennms.newts.config.cache.max_entries. For this system they were both set to 1048576. The ring buffer is especially important since should it fill up, samples will be discarded.

So, how did it go? Well, after fixing a bug with the ring buffer, everything went well. That bug is one reason that features like this aren’t immediately included in Meridian. Luckily we were working with a client who was willing to let us investigate and correct the issue. By the time it hits Meridian 2016, it will be completely ready for production.

If you enable the OpenNMS-JVM service on your OpenNMS node, the system will automatically collected Newts performance data (assuming Newts is enabled). OpenNMS will also collect performance data from the Cassandra cluster including both general Cassandra metrics as well as Newts specific ones.

This system is connected to a two node Cassandra cluster and managing 3.8K inserts/sec.

Newts Samples Inserted

If I’m doing the math correctly, since we collect values once every 300 seconds (5 minutes) by default, that’s 1.15 million data points, and the system isn’t even working hard.

OpenNMS will also collect on ring buffer information, and I took a screen shot to demonstrate Backshift, which displays the data point as you mouse over it.

Newts Ring Buffer

Horizon 17 ships with a load testing program. For this cluster:

[root@nms stress]# java -jar target/newts-stress-jar-with-dependencies.jar INSERT -B 16 -n 32 -r 100 -m 1 -H cluster
-- Meters ----------------------------------------------------------------------
org.opennms.newts.stress.InsertDispatcher.samples
             count = 10512100
         mean rate = 51989.68 events/second
     1-minute rate = 51906.38 events/second
     5-minute rate = 38806.02 events/second
    15-minute rate = 31232.98 events/second

so there is plenty of room to grow. Need something faster? Just add more nodes. Or, you can switch to ScyllaDB which is a port of Cassandra written in C. When run against a four node ScyllaDB cluster the results were:

[root@nms stress]# java -jar target/newts-stress-jar-with-dependencies.jar INSERT -B 16 -n 32 -r 100 -m 1 -H cluster
-- Meters ----------------------------------------------------------------------
org.opennms.newts.stress.InsertDispatcher.samples
             count = 10512100
         mean rate = 89073.32 events/second
     1-minute rate = 88048.48 events/second
     5-minute rate = 85217.92 events/second
    15-minute rate = 84110.52 events/second

Unfortunately I do not have statistics for a four node Cassandra cluster to compare it directly with ScyllaDB.

Of course the Newts data directly fits in with the OpenNMS Grafana integration.

Grafana Inserts per Second

Which brings me to one down side of this storage strategy. It’s fast, which means it isn’t compact. On this system the disk space is growing at about 4GB/day, which would be 1.5TB/year.

Grafana Disk Space

If you consider that the data is replicated across Cassandra nodes, you would need that amount of space on each one. Since the availability of multi-Terabyte drives is pretty common, this shouldn’t be a problem, but be sure to ask yourself if all the data you are collecting is really necessary. Just because you can collect the data doesn’t mean you should.

OpenNMS is finally to the point where the storing of performance data is no longer an issue. You are more likely to hit limits with the collector, which in part is going to be driven by the speed of the network. I’ve been in large data centers with hundreds of thousands of interfaces all with sub-millisecond latency. On that network, OpenNMS could collect on hundreds of millions of data points. On a network with lots of remote equipment, however, timeouts and delays will impact how much data OpenNMS could collect.

But with a little creativity, even that goes away. Think about it – with a common, decentralized data storage system like Cassandra, you could have multiple OpenNMS instances all talking to the same data store. If you have them share a common database, you can use collectd filters to spread data collection out over any number of machines. While this would take planning, it is doable today.

What about tomorrow? Well, Horizon 18 will introduce the OpenNMS Minion code. Minions will allow OpenNMS to scale horizontally and can be managed directly from OpenNMS – no configuration tricks needed. This will truly position OpenNMS for the Internet of Things.

UPDATE: Alejandro pointed out the following:

There is a property in OpenNMS for Newts that doesn’t appear in opennms.properties called heartbeat org.opennms.newts.query.heartbeat, which expects a duration in milliseconds. By default is 450000 (i.e. 1.5 x 5min) and it is used when no heartbeat is specified. Should generally be 1.5x your biggest collection interval.

If you were using 10 minute polls, set it to 15 min (1.5 x 10min), and then you will see graphs. To do this, add the following to opennms.properties and restart OpenNMS:

org.opennms.newts.query.heartbeat=900000

Avoiding the Sad Graph of Software Death

Seth recently sent me to an interesting article by Gregory Brown discussing a “death spiral” often faced by software projects when issues and feature requests start to out pace the ability to close them.

Sad Graph of Death

Now Seth is pretty much in charge of managing our Jira instance, which is key to managing the progress of OpenNMS software development. He decided to look at our record:

OpenNMS Issues Graph

[UPDATE: Logged into Jira to get a lot more issues on the graph]

Not bad, not bad at all.

A lot of our ability to keep up with issues comes from our project’s investment in using the tool. It is very easy to let things slide, resulting the the first graph above and causing a project to possibly declare “issue bankruptcy“. Since all of this information is public for OpenNMS, it is important to keep it up to date and while we never have enough time for all the things we need to do, we make time for this.

I think it speaks volumes for Seth and the rest team that OpenNMS issues are managed so well. In part it comes naturally from “the open source way” since projects should be as transparent as possible, and managing issues is a key part of that.

Annual LinuxQuestions Poll

Just a quick note that the annual LinuxQuestions “Member’s Choice” poll is out. While I don’t believe OpenNMS is known to many of the members of that site, if you feel like showing it a little love, please register and vote.

http://www.linuxquestions.org/questions/2015-linuxquestions-org-members-choice-awards-117/network-monitoring-application-of-the-year-4175562720/

Many thanks to Jeremy Garcia for maintaining that site and including OpenNMS.