OpenNMS Horizon 18 “Tardigrade” Is Now Available

I am extremely happy to announce the availability of Horizon 18, codenamed “Tardigrade”. Ben is responsible for naming our releases and he’s decided that the theme for Horizon 18 will be animals. The name “Tardigrade” was suggested in the IRC channel by Uberpenguin, and while they aren’t the prettiest things, Wikipedia describes them as “perhaps the most durable of known organisms” so in the context of OpenNMS that is appropriate.

OpenNMS Horizon 18

I am also happy to see the Horizon program working. When we split OpenNMS into Horizon and Meridian, the main reason was to drive faster development. Now instead of a new stable release every 18 months, we are getting them out every 3 to 4 months. And these are great releases – not just major releases in name only.

The first thing you’ll notice if you log in to Horizon 18 as a user in the admin role is that we’ve added a new “opt-in” feature that let’s us know a little bit about how OpenNMS is being used by people. We hope that most of you will choose to send us this information, and in the spirit of the Open Source Way we’ve made all of the statistics available publicly.

OpenNMS Opt-In Screen

One of the key things we are looking for is the list of SNMP Object IDs. This will let us know what devices are being monitored by our users and to increase their level of support. Of course, this requires that your OpenNMS instance be able to reach the stats server on the Internet, and you can change your choice at any time on the Configuration admin page under “Data Choices”. It will only send this information once every 24 hours, so we don’t expect it to impact network traffic at all.

Once you’ve opted in, the next thing you’ll probably notice is new problem lists on the home page listing “services” and “applications”.

OpenNMS BSM Problem Lists

This related to the major feature addition in Horizon 18 of the Business Service Monitor (BSM).

OpenNMS BSM OpenDaylight

As people move from treating servers as pets to treating them like cattle, the emphasis has shifted to understanding how well applications and microservices are running as a whole instead of focusing on individual devices. The BSM allows you to configure these services and then leverage all the usual OpenNMS crunchy goodness as you would a legacy service like HTTP running on a particular box. The above screenshot comes from some prototype work Jesse has been doing with integrating OpenNMS with OpenDaylight. As you can see at a glance, while the ICMP service is down on a particular device, the overall Network Fabric is still functioning perfectly.

Another thing I’m extremely proud of is the increase in the quality of documentation. Ronny and the rest of the documentation team are doing a great job, and we’ve made it a requirement that new features aren’t complete without documentation. Please check out the release notes as an example. It contains a pretty comprehensive lists of changes in 18.

A few I’d like to point out:

Horizon 17 is one of the most powerful and stable releases of OpenNMS ever, and we hope to continue that tradition with Horizon 18. Hats off to the team for such great work.

Here is a list of all the issues addressed in Horizon 18:

Release Notes – OpenNMS – Version 18.0.0

Bug

  • [NMS-3489] – "ADD NODE" produces "too much" config
  • [NMS-4845] – RrdUtils.createRRD log message is unclear
  • [NMS-5788] – model-importer.properties should be deprecated and removed
  • [NMS-5839] – Bring WaterfallExecutor logging on par with RunnableConsumerThreadPool
  • [NMS-5915] – The retry handler used with HttpClient is not going to do what we expect
  • [NMS-5970] – No HTML title on Topology Map
  • [NMS-6344] – provision.pl does not import requisitions with spaces in the name
  • [NMS-6549] – Eventd does not honor reloadDaemonConfig event
  • [NMS-6623] – Update JNA.jar library to support ARM based systems
  • [NMS-7263] – jaxb.properties not included in jar
  • [NMS-7471] – SNMP Plugin tests regularly failing
  • [NMS-7525] – ArrayOutOfBounds Exception in Topology Map when selecting bridge-port
  • [NMS-7582] – non RFC conform behaviour of SmtpMonitor
  • [NMS-7731] – Remote poller dies when trying to use the PageSequenceMonitor
  • [NMS-7763] – Bridge Data is not Collected on Cisco Nexus
  • [NMS-7792] – NPE in JmxRrdMigratorOffline
  • [NMS-7846] – Slow LinkdTopologyProvider/EnhancedLinkdTopologyProvider in bigger enviroments
  • [NMS-7871] – Enlinkd bridge discovery creates erroneous entries in the Bridge Forwarding Tables of unrelated switches when host is a kvm virtual host
  • [NMS-7872] – 303 See Other on requisitions response breaks the usage of the Requisitions ReST API
  • [NMS-7880] – Integration tests in org.opennms.core.test-api.karaf have incomplete dependencies
  • [NMS-7918] – Slow BridgeBridgeTopologie discovery with enlinkd.
  • [NMS-7922] – Null pointer exceptions with whitespace in requisition name
  • [NMS-7959] – Bouncycastle JARs break large-key crypto operations
  • [NMS-7967] – XML namespace locations are not set correctly for namespaces cm, and ext
  • [NMS-7975] – Rest API v2 returns http-404 (not found) for http-204 (no content) cases
  • [NMS-8003] – Topology-UI shows LLDP links not correct
  • [NMS-8018] – Vacuumd sends automation events before transaction is closed
  • [NMS-8056] – opennms-setup.karaf shouldn't try to start ActiveMQ
  • [NMS-8057] – Add the org.opennms.features.activemq.broker .xml and .cfg files to the Minion repo webapp
  • [NMS-8058] – Poll all interface w/o critical service is incorrect
  • [NMS-8072] – NullPointerException for NodeDiscoveryBridge
  • [NMS-8079] – The OnmsDaoContainer does not update its cache correctly, leading to a NumberFormatException
  • [NMS-8080] – VLAN name is not displayed
  • [NMS-8086] – Provisioning Requisitions with spaces in their name.
  • [NMS-8096] – JMX detector connection errors use wrong log level
  • [NMS-8098] – PageSequenceMonitor sometimes gives poor failure reasons
  • [NMS-8104] – init script checkXmlFiles() fails to pick up errors
  • [NMS-8116] – Heat map Alarms/Categories do not show all categories
  • [NMS-8118] – CXF returning 204 on NULL responses, rather than 404
  • [NMS-8125] – Memory leak when using Groovy + BSF
  • [NMS-8128] – NPE if provisioning requisition name has spaces
  • [NMS-8137] – OpenNMS incorrectly discovers VLANs
  • [NMS-8146] – "Show interfaces" link forgets the filters in some circumstances
  • [NMS-8167] – Cannot search by MAC address
  • [NMS-8168] – Vaadin Applications do not show OpenNMS favicon
  • [NMS-8189] – Wrong interface status color on node detail page
  • [NMS-8194] – Return an HTTP 303 for PUT/POST request on a ReST API is a bad practice
  • [NMS-8198] – Provisioning UI indication for changed nodes is too bright
  • [NMS-8208] – Upgrade maven-bundle-plugin to v3.0.1
  • [NMS-8214] – AlarmdIT.testPersistManyAlarmsAtOnce() test ordering issue?
  • [NMS-8215] – Chart servlet reloads Notifd config instead of Charts config
  • [NMS-8216] – Discovery config screen problems in latest code
  • [NMS-8221] – Operation "Refresh Now" and "Automatic Refresh" referesh the UI differently
  • [NMS-8224] – JasperReports measurements data-source step returning null
  • [NMS-8235] – Jaspersoft Studio cannot be used anymore to debug/create new reports
  • [NMS-8240] – Requisition synchronization is failing due to space in requisition name
  • [NMS-8248] – Many Rcsript (RScript) files in OPENNMS_DATA/tmp
  • [NMS-8257] – Test flapping: ForeignSourceRestServiceIT.testForeignSources()
  • [NMS-8272] – snmp4j does not process agent responses
  • [NMS-8273] – %post error when Minion host.key already exists
  • [NMS-8274] – All the defined Statsd's reports are being executed even if they are disabled.
  • [NMS-8277] – %post failure in opennms-minion-features-core: sed not found
  • [NMS-8293] – Config Tester Tool doesn't check some of the core configuration files
  • [NMS-8298] – Label of Vertex is too short in some cases
  • [NMS-8299] – Topology UI recenters even if Manual Layout is selected
  • [NMS-8300] – Center on Selection no longer works in STUI
  • [NMS-8301] – v2 Rest Services are deployed twice to the WEB-INF/lib directory
  • [NMS-8302] – Json deserialization throws "unknown property" exception due to usage of wrong Jax-rs Provider
  • [NMS-8304] – An error on threshd-configuration.xml breaks Collectd when reloading thresholds configuration
  • [NMS-8313] – Pan moving in Topology UI automatically recenters
  • [NMS-8314] – Weird zoom behavior in Topology UI using mouse wheel
  • [NMS-8320] – Ping is available for HTTP services
  • [NMS-8324] – Friendly name of an IP service is never shown in BSM
  • [NMS-8330] – Switching Topology Providers causes Exception
  • [NMS-8335] – Focal points are no longer persisted
  • [NMS-8337] – Non-existing resources or attributes break JasperReports when using the Measurements API
  • [NMS-8353] – Plugin Manager fails to load
  • [NMS-8361] – Incorrect documentation for org.opennms.newts.query.heartbeat
  • [NMS-8371] – The contents of the info panel should refresh when the vertices and edges are refreshed
  • [NMS-8373] – The placeholder {diffTime} is not supported by Backshift.
  • [NMS-8374] – The logic to find event definitions confuses the Event Translator when translating SNMP Traps
  • [NMS-8375] – License / copyright situation in release notes introduction needs simplifying
  • [NMS-8379] – Sluggish performance with Cassandra driver
  • [NMS-8383] – jmxconfiggenerator feature has unnecessary includes
  • [NMS-8386] – Requisitioning UI fails to load in modern browsers if used behind a proxy
  • [NMS-8388] – Document resources ReST service
  • [NMS-8389] – Heatmap is not showing
  • [NMS-8394] – NoSuchElement exception when loading the TopologyUI
  • [NMS-8395] – Logging improvements to Notifd
  • [NMS-8401] – There are errors on the graph definitions for OpenNMS JMX statistics
  • [NMS-8403] – Document styles of identifying nodes in resource IDs

Enhancement

  • [NMS-2504] – Create a better landing page for Configure Discovery aftermath
  • [NMS-4229] – Detect tables with Provisiond SNMP detector
  • [NMS-5077] – Allow other services to work with Path Outages other than ICMP
  • [NMS-5905] – Add ifAlias to bridge Link Interface Info
  • [NMS-5979] – Make the Provisioning Requisitions "Node Quick-Add" look pretty
  • [NMS-7123] – Expose SNMP4J 2.x noGetBulk and allowSnmpV2cInV1 capabilities
  • [NMS-7446] – Enhance Bridge Link Object Model
  • [NMS-7447] – Update BridgeTopology to use the new Object Model
  • [NMS-7448] – Update Bridge Topology Discovery Strategy
  • [NMS-7756] – Change icon for Dell PowerConnector switch
  • [NMS-7798] – Add Sonicwall Firewall Events
  • [NMS-7903] – Elasticsearch event and alarm forwarder
  • [NMS-7950] – Create an overview for the developers guide
  • [NMS-7965] – Add support for setting system properties via user supplied .properties files
  • [NMS-7976] – Merge OSGi Plugin Manager into Admin UI
  • [NMS-7980] – provide HTTPS Quicklaunch into node page
  • [NMS-8015] – Remove Dependencies on RXTX
  • [NMS-8041] – Refactor Enhanced Linkd Topology
  • [NMS-8044] – Provide link for Microsoft RDP connections
  • [NMS-8063] – Update asciidoc dependencies to latest 1.5.3
  • [NMS-8076] – Allow user to access local documentation from OpenNMS Jetty Webapp
  • [NMS-8077] – Add NetGear Prosafe Smart switch SNMP trap events and syslog events
  • [NMS-8092] – Add OpenWrt syslog and related event definitions
  • [NMS-8129] – Disallow restricted characters from foreign source and foreign ID
  • [NMS-8149] – Update asciidoctorj to 1.5.4 and asciidoctorjPdf to 1.5.0-alpha.11
  • [NMS-8152] – Collect and publish anonymous statistics to stats.opennms.org
  • [NMS-8160] – Remove Quick-Add node to avoid confusions and avoid breaking the ReST API
  • [NMS-8163] – Requisitions UI Enhancements
  • [NMS-8179] – ifIndex >= 2^31
  • [NMS-8182] – Add HTTPS as quick-link on the node page
  • [NMS-8205] – Generate events for alarm lifecycle changes
  • [NMS-8209] – Upgrade junit to v4.12
  • [NMS-8210] – Add support for calculating the derivative with a Measurements API Filter
  • [NMS-8211] – Add support for retrieving nodes with a filter expression via the ReST API
  • [NMS-8218] – External event source tweaks to admin guide
  • [NMS-8219] – Copyright bump on asciidoc docs
  • [NMS-8225] – Integrate the Minion container and packages into the mainline OpenNMS build
  • [NMS-8226] – Upgrade SNMP4J to version 2.4
  • [NMS-8238] – Topology providers should provide a description for display
  • [NMS-8251] – Parameterize product name in asciidoc docs
  • [NMS-8259] – Cleanup testdata in SnmpDetector tests
  • [NMS-8265] – SNMP collection systemDefs for Cisco ASA5525-X, ASA5515-X
  • [NMS-8266] – SNMP collection systemDefs for Juniper SRX210he2, SRX100h
  • [NMS-8267] – Create documentation for SNMP detector
  • [NMS-8271] – Enable correlation engines to register for all events
  • [NMS-8296] – Be able to re-order the policies on a requisition through the UI
  • [NMS-8334] – Implement org.opennms.timeseries.strategy=evaluate to facilitate the sizing process
  • [NMS-8336] – Set the required fields when not specified while adding events through ReST
  • [NMS-8349] – Update screenshots with 18 theme in user documentation
  • [NMS-8365] – Add metric counter for drop counts when the ring buffer is full
  • [NMS-8377] – Applying some organizational changes on the Requisitions UI (Grunt, JSHint, Dist)

Story

Task

  • [NMS-8236] – Move the "vaadin-extender-service" module to opennms code base

Agent Provocateur

I’ve been involved with the monitoring of computer networks for a long time, two decades actually, and I’m seeing an alarming trend. Every new monitoring application seems to be insisting on software agents. Basically, in order to get any value out of the application, you have to go out and install additional software on each server in your network.

Now there was a time when this was necessary. BMC Software made a lot of money with its PATROL series of agents, yet people hated them then as much as they hate agents now. Why? Well, first there was the cost, both in terms of licensing and in continuing to maintain them (upgrades, etc.). Next there was the fact that you had to add software to already overloaded systems. I can remember the first time the company I worked for back then deployed a PATROL agent on an Oracle database. When it was started up it took the database down as it slammed the system with requests. Which leads me to the final point, outside of security issues that arise with an increase in the number of applications running on a system, the moment the system experiences a problem the blame will fall on the agent.

Despite that, agents still seem to proliferate. In part I think it is political. Downloading and installing agents looks like useful work. “Hey, I’m busy monitoring the network with these here agents”. Also in part, it is laziness. I have never met a programmer who liked working on someone else’s code, so why not come up with a proprietary protocol and write agents to implement it?

But what bothers me the most is that it is so unnecessary. The information you need for monitoring, with the possible exception of Windows, is already there. Modern operating systems (again, with the exception of Windows) ship with an SNMP agent, usually based on Net-SNMP. This is a secure, powerful extensible agent that has been tried and tested for many years, and it is maintained directly on server itself. You can use SNMPv3 for secure communications, and the “extend” and “pass” directives to make it easy to customize.

Heck, even Windows ships with an extensible SNMP agent, and you can also access data via WMI and PowerShell.

But what about applications? Don’t you need an agent for that?

Not really. Modern applications tend to have an API, usually based on ReST, that can be queried by a management station for important information. Java applications support JMX, databases support ODBC, and when all that fails you can usually use good ol’ HTTP to query the application directly. And the best part is that the application itself can be written to guard against a monitoring query causing undue load on the system.

At OpenNMS we work with a lot of large customers, and they are loathe to install new software on all of their servers. Plus, many of our customers have devices that can’t support additional agents, such as routers and switches, and IoT devices such as thermostats and door locks. This is the main reason why the OpenNMS monitoring platform is, by design, agentless.

A critic might point out that OpenNMS does have an agent in the remote poller, as well as in the upcoming Minion feature set. True, but those act as “user agents”, giving OpenNMS a view into networks as if it was a user of those networks. The software is not installed on every server but instead it just needs the same access as a user would have. So, it can be installed on an existing system or on a small system purchased for that purpose, at a minimum just one for each network to be monitored.

While some new IT fields may require agents, most successful solutions try to avoid them. Even in newer fields such as IT automation, the best solutions are agentless. They are not necessary, and I strongly suggest that anyone who is asked to install an agent for monitoring question that requirement.

OpenNMS Horizon 17.1.1 Released

Probably the last Horizon 17 version, 17.1.1, has been released. According to TWiO, the next release will be Horizon 18 at the end of the month, with Horizon 19 following at the end of May.

This release is mainly a maintenance release. It does contain one fix I used (NMS-8199), which allows for the state names in the Jira Trouble Ticketing plugin to be configured. This helps a lot if Jira is not in English.

If you are running Horizon 17, this should help it run a bit smoother.

Bug

  • [NMS-7936] – Chart Servlet Outages model exception
  • [NMS-8010] – Groups config rolled back after deleting a user in web UI
  • [NMS-8034] – Adding com.sun.management.jmxremote.authenticate=true on opennms.conf is ignored by the opennms script
  • [NMS-8048] – org.hibernate.exception.SQLGrammarException with ACLs on V17
  • [NMS-8075] – vacuumd-configuration.xml — Database error executing statement
  • [NMS-8113] – Overview about major releases in the release notes
  • [NMS-8153] – Can't modify the Foreign ID on the Requisitions UI when adding a new node
  • [NMS-8159] – When altering the SNMP Trap NBI config, the externally referenced mapping groups are persisted into the main file.
  • [NMS-8161] – Tooltips are not working on the new Requisitions UI
  • [NMS-8165] – OutageDao ACL support is broken causing web UI failures
  • [NMS-8177] – Install guide should use postgres admin for schema updates
  • [NMS-8199] – Allows state names to be configured in the JIRA Ticketer Plugin

Enhancement

  • [NMS-6404] – Allow send events through ReST
  • [NMS-8148] – Create pull request and contribution template to GitHub project

Task

  • [NMS-8151] – Remove all jersey artifacts from lib classpath

Speeding Up OpenNMS Requisition Imports

One thing that differentiates OpenNMS from other applications is the strong focus on tools for provisioning the system. If you want to monitor hundred of thousands of devices, to ultimately millions, the ordinary methods just don’t work.

Users of OpenNMS often create large requisitions from external database sources, and sometimes it can take awhile for the import to complete. Delays can happen if the Foreign Source used for the requisition has a large number of service detectors that won’t exist on most devices.

For example, the default Foreign Source for Horizon 17 has about 15 detectors. Of those, only about 4 will exist on networking equipment (ICMP, SSH, HTTP and HTTPS). When scanning, this can add a lot of time per interface. Assuming 2 retries and a 3 second timeout, that would be 9 seconds for each non-existent service. With just 1000 interfaces, that’s 99000 seconds (9 seconds x 11 services x 1000 interfaces) of time just spent waiting, which translates to 27.5 hours.

Now, granted, the importer has multiple threads so the actual wait time will be less, but you can see how this can impact the time needed to import a requisition. This can be reduced significantly by tuning service detection to the bare minimum needed and perhaps adding other services later on a per device basis without scanning.

OpenNMS Horizon 17.1 Released

As some of you may have noticed, Horizon 17.1 has been released. As Horizon 17 will form the basis for Meridian 2016, I’m extremely happy to see how much progress has been made on fixing issues.

Be sure to check out the Release Notes.

Horizon is our rapid release version, and its goal is to get all the cool new features out as soon as they are ready. In this case, right about release we discovered an annoying but easy to fix bug with provisioning, so if you plan to run Horizon 17.1 you should also apply this patch.

Have fun, and we hope you find OpenNMS useful.

Bug

  • [NMS-4108] – Bad suggestions in install guide
  • [NMS-7152] – Enlinkd Topology Plugin fails to create LLDP links for mismatched link port descriptions.
  • [NMS-7820] – snmp-graph.properties.d files with –vertical-label="verticalLabel" in config
  • [NMS-7866] – Incorrect host in Location header when creating resources via ReST
  • [NMS-7910] – config-tester is broken
  • [NMS-7953] – Opsboard and Opspanel use wrong logo
  • [NMS-7966] – Unable to generate eventconf if a MIB (improperly?) uses a TC to define a TC
  • [NMS-7988] – container features.xml still references jersey 1.18 when it should reference 1.19.
  • [NMS-8000] – Topology-UI shows CDP links not correct
  • [NMS-8014] – Backshift graphs show dates in UTC instead of the browser's timezone
  • [NMS-8017] – StrafePing: Unexpected exception while polling PollableService
  • [NMS-8023] – Grafana Box did not work anymore
  • [NMS-8026] – Constant Thread Locking on Enlinkd
  • [NMS-8029] – Wrong use of opennms.web.base-url
  • [NMS-8038] – NRTG with newts – get StringIndexOutOfBoundsException
  • [NMS-8051] – The newts script only works if cassandra runs on localhost
  • [NMS-8054] – AlarmPersisterIT test is empty
  • [NMS-8064] – The 'newts init' script does work when authentication is enabled in Cassandra
  • [NMS-8065] – ReST Regression in Alarms/Events
  • [NMS-8066] – Newts only uses a single thread when writing to Cassandra
  • [NMS-8073] – User Restriction Filters: mapping class for roles to groups does not work
  • [NMS-8074] – The "Remove From Focus" button intermittently fails
  • [NMS-8079] – The OnmsDaoContainer does not update its cache correctly, leading to a NumberFormatException
  • [NMS-8084] – File not found exception for interfaceSTP-box.jsp on SNMP interface page
  • [NMS-8097] – Installation Guide Debian Bug Version 17.0.0
  • [NMS-8100] – Unable to complete creation of scheduled reports
  • [NMS-8103] – NPE when persisting data with Newts
  • [NMS-8104] – init script checkXmlFiles() fails to pick up errors
  • [NMS-8106] – INFO-severity syslog-derived events end up unmatched
  • [NMS-8109] – Memory leak when using the BSFDetector
  • [NMS-8112] – init script "configtest" exit value is always 1
  • [NMS-8116] – Heat map Alarms/Categories do not show all categories
  • [NMS-8119] – WS-MAN has broken ForeignSourceConfigRestService and the requisitions UI doesn't work.
  • [NMS-8123] – Removing ops boards via the configuration UI does not update the table
  • [NMS-8126] – JNA ping code reuses buffer causing inconsistent reads of packet contents
  • [NMS-8133] – Synchronizing a requisition fails
  • [NMS-8147] – Add all the services declared on Collectd and Pollerd configuration as available services on /opennms/rest/foreignSourceConfig/services

Enhancement

  • [NMS-7123] – Expose SNMP4J 2.x noGetBulk and allowSnmpV2cInV1 capabilities
  • [NMS-7978] – Add threshold comments and whitespace changes to match how the OpenNMS web GUI generates XML files
  • [NMS-8005] – Add support for using NRTG via Ajax calls
  • [NMS-8024] – Add support for OSGi-based Ticketing plugins
  • [NMS-8028] – Add event definition for postfix syslog message TLS disabled
  • [NMS-8030] – Improve the SNMP data collection config parsing to give more flexibility to the users
  • [NMS-8042] – set Up severities for RADLAN-MIB.events.xml
  • [NMS-8068] – Add support for marshalling NorthboundAlarms to XML
  • [NMS-8071] – Event definition file for JUNIPER-IVE-MIB
  • [NMS-8120] – Fixed a paragraph in the "Automatic Discovery" provisioning chapter
  • [NMS-8156] – Upgrade Angular Backend for the Requisitions UI

Add a Weather Widget to OpenNMS Home Screen

I was recently at a client site where I met a man named Jeremy Ford. He’s sharp as a knife and even though, at the time, he was new to OpenNMS, he had already hacked a few neat things into the system (open source FTW).

Weathermap on OpenNMS Home Page

One of those was the addition of a weathermap to the OpenNMS home page. He has graciously put the code up on Github.

The code is a script that will generate a JSP file in the OpenNMS “includes” directory. All you have to do then is to add a reference to it in the main index.jsp file.

For those of you who don’t know or who have never poked around, under the $OPENNMS_HOME directory should be a directory called jetty-webapps. That is the web root directory for the Jetty servlet container that ships with OpenNMS.

Under that directory you’ll find a subdirectory for opennms. When you surf to http://[my OpenNMS Server]:8980/opennms that is the directory you are visiting. In it is an index.jsp file that serves as the main page.

If you are familiar with HTML, the JSP file is very similar. It can contain references to Java code, but a lot of it is straight HTML. The file is kept simple on purpose, with each of the three columns on the main page indicated by comments. The part you will need to change is the third column:

<!-- Right Column -->
        <div class="col-md-3" id="index-contentright">
                <!-- weather box -->
                <jsp:include page="/includes/weather.jsp" flush="false" />

Feel free to look around. If you ever wanted to rearrange the OpenNMS Home page, this is a good place to start.

Now, I used to like poking around with these files since they would update automatically, but later versions of OpenNMS (which contain later versions of Jetty) seem to require a restart. If you get an error, restart OpenNMS and see if it goes away.

Now the weather.jsp file gets generated by Jeremy’s python script. In order to get that to work you’ll need to do two things. The most important is to get an API key from Weather Underground. It is a pretty easy process, but be aware that you can only do 500 queries a day without paying. The second thing you’ll need to do is edit the three URLs in the script and change the location. It is currently set to “CA/San_Francisco” but I was able to change it to “NC/Pittsboro” and it “just worked”.

Finally, you’ll need to set the script up to run via cron. I’m not sure how frequently Weather Underground updates the data, but a 10 minute interval seems to work well. That’s only 144 queries a day, so you could easily double it and still be within your limit.

[IMPORTANT UPDATE: Jeremy pointed out that the script actually does three queries, not just one, so instead of doing 144 queries a day, it’s 432. Still leaves some room with 10 minute queries but you don’t want to increase the frequency too much.]

Thanks to Jeremy for taking the time to share this. Remember, once you get it working, if you upgrade OpenNMS you’ll need to edit index.jsp and add it back, but that should be the only change needed.

OpenNMS at Scale

So, yes, the gang from OpenNMS will be at the SCaLE conference this weekend (I will not be there, unfortunately, due to a self-imposed conference hiatus this year). It should be a great time, and we are happy to be a Gold Sponsor.

But this post is not about that. This is about how Horizon 17 and data collection can scale. You can come by the booth at SCaLE and learn more about it, but here is the overview.

When OpenNMS first started, we leveraged the great application RRDTool for storing performance data. When we discovered a java port called JRobin, OpenNMS was modified to support that storage strategy as well.

Using a Round Robin database has a number of advantages. First, it’s compact. Once the file containing the RRD database is created, it never grows. Second, we used RRDTool to also graph the data.

However, there were problems. Many users had a need to store the raw collected data. RRDTool uses consolidation functions to store a time-series average. But the biggest issue was that writing lots of files required really fast hard drives. The more data you wanted to store, the greater your investment in disk arrays. Ultimately, you would hit a wall, which would require you to either reduce your data collection or partition out the data across multiple systems.

No more. With Horizon 17 OpenNMS fully supports a time-series database called Newts. Newts is built on Cassandra, and even a small Cassandra cluster can handle tens of thousands of inserts a second. Need more performance? Just add more nodes. Works across geographically distributed systems as well, so you get built-in high availability (something that was very difficult with RRDTool).

Just before Christmas I got to visit a customer on the Eastern Shore of Maryland. You wouldn’t think that location would be a hotbed of technical excellence, but it is rare that I get to work with such a quick team.

They brought me up for a “Getting to Know You” project. This is a two day engagement where we get to kick the tires on OpenNMS to see if it is a good fit. They had been using Zenoss Core (the free version) and they hit a wall. The features they wanted were all in the “enterprise” paid version and the free version just wouldn’t meet their needs. OpenNMS did, and being truly open source it fit their philosophy (and budget) much better.

This was a fun trip for me because they had already done most of the work. They had OpenNMS installed and monitoring their network, and they just needed me to help out on some interesting use cases.

One of their issues was the need to store a lot of performance data, and since I was eager to play with the Newts integration we decided to test it out.

In order to enable Newts, first you need a Cassandra cluster. It turns out that ScyllaDB works as well (more on that a bit later). If you are looking at the Newts website you can ignore the instructions on installing it as it it built directly into OpenNMS.

Another thing built in to OpenNMS is a new graphing library called Backshift. Since OpenNMS relied on RRDTool for graphing, a new data visualization tool was needed. Backshift leverages the RRDTool graphing syntax so your pre-defined graphs will work automatically. Note that some options, such as CANVAS colors, have not been implemented yet.

To switch to newts, in the opennms.properties file you’ll find a section:

###### Time Series Strategy ####
# Use this property to set the strategy used to persist and retrieve time series metrics:
# Supported values are:
#   rrd (default)
#   newts

org.opennms.timeseries.strategy=newts

Note: “rrd” strategy can refer to either JRobin or RRDTool, with JRobin as the default. This is set in rrd-configuration.properties.

The next section determines what will render the graphs.

###### Graphing #####
# Use this property to set the graph rendering engine type.  If set to 'auto', attempt
# to choose the appropriate backend depending on org.opennms.timeseries.strategy above.
# Supported values are:
#   auto (default)
#   png
#   placeholder
#   backshift
org.opennms.web.graphs.engine=auto

If you are using Newts, the “auto” setting will utilize Backshift but here is where you could set Backshift as the renderer even if you want to use an RRD strategy. You should try it out. It’s cool.

Finally, we come to the settings for Newts:

###### Newts #####
# Use these properties to configure persistence using Newts
# Note that Newts must be enabled using the 'org.opennms.timeseries.strategy' property
# for these to take effect.
#
org.opennms.newts.config.hostname=10.110.4.30,10.110.4.32
#org.opennms.newts.config.keyspace=newts

There are a lot of settings and most of those are described in the documentation, but in this case I wanted to demonstrate that you can point OpenNMS to multiple Cassandra instances. You can also set different keyspace names which allows multiple instances of OpenNMS to talk to the same Cassandra cluster and not share data.

From the “fine” documentation, they also recommend that you store the data based on the foreign source by setting this variable:

org.opennms.rrd.storeByForeignSource=true

I would recommend this if you are using provisiond and requisitions. If you are currently doing auto-discovery, then it may be better to reference it by nodeid, which is the default.

I want to point out two other values that will need to be increased from the defaults: org.opennms.newts.config.ring_buffer_size and org.opennms.newts.config.cache.max_entries. For this system they were both set to 1048576. The ring buffer is especially important since should it fill up, samples will be discarded.

So, how did it go? Well, after fixing a bug with the ring buffer, everything went well. That bug is one reason that features like this aren’t immediately included in Meridian. Luckily we were working with a client who was willing to let us investigate and correct the issue. By the time it hits Meridian 2016, it will be completely ready for production.

If you enable the OpenNMS-JVM service on your OpenNMS node, the system will automatically collected Newts performance data (assuming Newts is enabled). OpenNMS will also collect performance data from the Cassandra cluster including both general Cassandra metrics as well as Newts specific ones.

This system is connected to a two node Cassandra cluster and managing 3.8K inserts/sec.

Newts Samples Inserted

If I’m doing the math correctly, since we collect values once every 300 seconds (5 minutes) by default, that’s 1.15 million data points, and the system isn’t even working hard.

OpenNMS will also collect on ring buffer information, and I took a screen shot to demonstrate Backshift, which displays the data point as you mouse over it.

Newts Ring Buffer

Horizon 17 ships with a load testing program. For this cluster:

[root@nms stress]# java -jar target/newts-stress-jar-with-dependencies.jar INSERT -B 16 -n 32 -r 100 -m 1 -H cluster
-- Meters ----------------------------------------------------------------------
org.opennms.newts.stress.InsertDispatcher.samples
             count = 10512100
         mean rate = 51989.68 events/second
     1-minute rate = 51906.38 events/second
     5-minute rate = 38806.02 events/second
    15-minute rate = 31232.98 events/second

so there is plenty of room to grow. Need something faster? Just add more nodes. Or, you can switch to ScyllaDB which is a port of Cassandra written in C. When run against a four node ScyllaDB cluster the results were:

[root@nms stress]# java -jar target/newts-stress-jar-with-dependencies.jar INSERT -B 16 -n 32 -r 100 -m 1 -H cluster
-- Meters ----------------------------------------------------------------------
org.opennms.newts.stress.InsertDispatcher.samples
             count = 10512100
         mean rate = 89073.32 events/second
     1-minute rate = 88048.48 events/second
     5-minute rate = 85217.92 events/second
    15-minute rate = 84110.52 events/second

Unfortunately I do not have statistics for a four node Cassandra cluster to compare it directly with ScyllaDB.

Of course the Newts data directly fits in with the OpenNMS Grafana integration.

Grafana Inserts per Second

Which brings me to one down side of this storage strategy. It’s fast, which means it isn’t compact. On this system the disk space is growing at about 4GB/day, which would be 1.5TB/year.

Grafana Disk Space

If you consider that the data is replicated across Cassandra nodes, you would need that amount of space on each one. Since the availability of multi-Terabyte drives is pretty common, this shouldn’t be a problem, but be sure to ask yourself if all the data you are collecting is really necessary. Just because you can collect the data doesn’t mean you should.

OpenNMS is finally to the point where the storing of performance data is no longer an issue. You are more likely to hit limits with the collector, which in part is going to be driven by the speed of the network. I’ve been in large data centers with hundreds of thousands of interfaces all with sub-millisecond latency. On that network, OpenNMS could collect on hundreds of millions of data points. On a network with lots of remote equipment, however, timeouts and delays will impact how much data OpenNMS could collect.

But with a little creativity, even that goes away. Think about it – with a common, decentralized data storage system like Cassandra, you could have multiple OpenNMS instances all talking to the same data store. If you have them share a common database, you can use collectd filters to spread data collection out over any number of machines. While this would take planning, it is doable today.

What about tomorrow? Well, Horizon 18 will introduce the OpenNMS Minion code. Minions will allow OpenNMS to scale horizontally and can be managed directly from OpenNMS – no configuration tricks needed. This will truly position OpenNMS for the Internet of Things.

UPDATE: Alejandro pointed out the following:

There is a property in OpenNMS for Newts that doesn’t appear in opennms.properties called heartbeat org.opennms.newts.query.heartbeat, which expects a duration in milliseconds. By default is 450000 (i.e. 1.5 x 5min) and it is used when no heartbeat is specified. Should generally be 1.5x your biggest collection interval.

If you were using 10 minute polls, set it to 15 min (1.5 x 10min), and then you will see graphs. To do this, add the following to opennms.properties and restart OpenNMS:

org.opennms.newts.query.heartbeat=900000

Avoiding the Sad Graph of Software Death

Seth recently sent me to an interesting article by Gregory Brown discussing a “death spiral” often faced by software projects when issues and feature requests start to out pace the ability to close them.

Sad Graph of Death

Now Seth is pretty much in charge of managing our Jira instance, which is key to managing the progress of OpenNMS software development. He decided to look at our record:

OpenNMS Issues Graph

[UPDATE: Logged into Jira to get a lot more issues on the graph]

Not bad, not bad at all.

A lot of our ability to keep up with issues comes from our project’s investment in using the tool. It is very easy to let things slide, resulting the the first graph above and causing a project to possibly declare “issue bankruptcy“. Since all of this information is public for OpenNMS, it is important to keep it up to date and while we never have enough time for all the things we need to do, we make time for this.

I think it speaks volumes for Seth and the rest team that OpenNMS issues are managed so well. In part it comes naturally from “the open source way” since projects should be as transparent as possible, and managing issues is a key part of that.

Annual LinuxQuestions Poll

Just a quick note that the annual LinuxQuestions “Member’s Choice” poll is out. While I don’t believe OpenNMS is known to many of the members of that site, if you feel like showing it a little love, please register and vote.

http://www.linuxquestions.org/questions/2015-linuxquestions-org-members-choice-awards-117/network-monitoring-application-of-the-year-4175562720/

Many thanks to Jeremy Garcia for maintaining that site and including OpenNMS.