Archive for October, 2010
I’m a bit of a security nut. I encrypt my hard drives on laptops, I prefer using mail with GPG, I set up all my network services to work over SSL, etc.
But one area I’ve always been a bit slack on is in updating software. Sure, I get graphical reminders on my desktop, so that’s a no brainer, but what about the 30 or so servers we use at The OpenNMS Group?
Now, granted, most of the security issues that arise from the operating systems I use tend to be local exploits, and I trust anyone who has local access to any of my machines, but still, I thought it would be nice to have some sort of notice that one of my machines needed to update its software.
We use three main operating systems: CentOS, Debian and OS X. All three of them have a package management system, each with its own set of commands. I figured I could write a short script for each one that would tell me if there were packages that needed updating.
For CentOS, I used “yum check-update”, specifically:
# cat crons/update.sh #!/bin/bash yum check-update 2>/dev/null | grep -v Load | grep -v \* | grep -v '^$' \ | wc -l > /root/var/updatestatus.txt
This should output the number of packages needing to be updated, and the “grep -v” statement remove some of the formatting lines. For those of you wondering about the “updatestatus.txt: bit, read on.
For Debian, I used “apt-get upgrade”, but there were a few caveats:
# cat crons/update.sh #!/bin/bash apt-get update > /dev/null 2>&1 ISZERO=`apt-get -s upgrade \ | grep 0\ upgraded,\ 0\ newly\ installed,\ 0\ to\ remove\ and\ 0\ not\ upgraded | wc -l` if [ $ISZERO -eq 1 ] ; then echo 0 > /root/var/updatestatus.txt exit 0 fi echo 1 > /root/var/updatestatus.txt
The first thing I needed to do is run an “apt-get update” followed by an “apt-get -s upgrade” (the -s just simulates the upgrade as I didn’t want to actually perform it without human review). Unlike the “yum” command above, I didn’t take the time to return a number indicating the number of packages that should be upgraded, instead I just return a “1” if that number isn’t zero.
Finally, for OS X:
$ cat crons/update.sh #!/bin/bash ISZERO=`/usr/sbin/softwareupdate -l 2>/dev/null | grep \* | wc -l` echo $ISZERO > /Users/admin/var/updatestatus.txt
Here I used the “softwareupdate” command to list the packages that need updating, removing formatting lines.
So now I had a single script on all three operating systems that would return a non-zero number when updates are available.
The question is how to run it. Some monitoring systems might use ssh. But that presents maintenance, security, and performance issues. You first have to configure keys for each of your systems so that the monitoring server can access the remote system to run the command. This in turn opens up a security issue as now there is a new possibility for bad guys to get local access to the machine (and remember, 99% of the security updates you install are related to local exploits). Finally, doing a lot of ssh commands at scale puts an incredible load on the monitoring server.
I prefer to use Net-SNMP. It’s included in all three operating systems, which means the configuration is basically the same, and it is secure and high performance.
In order to have Net-SNMP run my script I used the “extend” directive in the “/etc/snmp/snmpd.conf” file:
extend update /root/crons/update.sh
This will add a table entry into the “nsExtendTable” called “update”. The OID will include “220.127.116.11.116.101” which is the decimal values for “u”, “p”, “d”, “a”, “t” and “e”.
Now these commands can take some time to run, so I didn’t want someone doing a lot of SNMPGET commands being able to DOS my system, so I looked at this value:
NET-SNMP-EXTEND-MIB::nsExtendCacheTime."update" = INTEGER: 5
which in numbers is:
.18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = INTEGER: 5
and decided to increase it. By default, Net-SNMP will cache commands for five seconds. I thought it’d be cooler to cache it for, like, five minutes.
Unfortunately, you can’t set the cache time in the configuration, so I set up a read-only user:
com2sec rwUser 127.0.0.1 WonceRmany
That only ran from localhost, and then I could set the value:
snmpset -v1 -c WonceRmany localhost .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 i 300
This works fine, until you restart. Then it gets reset (which is why I’d love to have it in the config itself). But I figured I could always edit the “snmpd” init script to run the set for me, but another problem arose.
These checks can take a long time to execute – 30 seconds to a minute or more. Running them via an SNMPGET will always cause the GET to timeout. Plus, on some machines, running these commands requires special privileges, so instead of mucking around getting the Net-SNMP permission, I opted for a simpler method.
I set up a cron to run every hour that would run this command and dump the output to /root/var/updatestatus.txt (you could put it anywhere you like, of course). Then I just had Net-SNMP check that file:
extend update /bin/cat /root/var/updatestatus.txt
An example cron for a CentOS machine:
10 * * * * /root/crons/update.sh
I decided on once an hour because I could stagger when the cron ran across my machines so I wouldn’t have a bunch happening at once.
This works surprisingly well – sometimes too well, in fact, since there can be several one or two package updates each week. It’s also funny to watch the updates come in, since each system seems to have a different preferred mirror and it can take hours before the different mirrors see the update.
The only downside to this method is that since the cron only runs once an hour, once the update is completed the service will remain down an hour before coming back up. You can prevent this by manually running the update script after doing the update, if you so choose, or just dumping a zero into the update file.
Bringing it all together, so now I have the following:
NET-SNMP-EXTEND-MIB::nsExtendCommand."update" = STRING: /bin/cat NET-SNMP-EXTEND-MIB::nsExtendArgs."update" = STRING: /root/var/updatestatus.txt NET-SNMP-EXTEND-MIB::nsExtendInput."update" = STRING: NET-SNMP-EXTEND-MIB::nsExtendCacheTime."update" = INTEGER: 5 NET-SNMP-EXTEND-MIB::nsExtendExecType."update" = INTEGER: exec(1) NET-SNMP-EXTEND-MIB::nsExtendRunType."update" = INTEGER: run-on-read(1) NET-SNMP-EXTEND-MIB::nsExtendStorage."update" = INTEGER: permanent(4) NET-SNMP-EXTEND-MIB::nsExtendStatus."update" = INTEGER: active(1) NET-SNMP-EXTEND-MIB::nsExtendOutput1Line."update" = STRING: 0 NET-SNMP-EXTEND-MIB::nsExtendOutputFull."update" = STRING: 0 NET-SNMP-EXTEND-MIB::nsExtendOutNumLines."update" = INTEGER: 1 NET-SNMP-EXTEND-MIB::nsExtendResult."update" = INTEGER: 0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."update".1 = STRING: 0
or with numbers:
.18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = STRING: /bin/cat .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = STRING: /root/var/updatestatus.txt .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = STRING: .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = INTEGER: 5 .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = INTEGER: exec(1) .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = INTEGER: run-on-read(1) .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = INTEGER: permanent(4) .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = INTEGER: active(1) .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = STRING: 0 .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = STRING: 0 .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101 = INTEGER: 1 .188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101 = INTEGER: 0 .18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101.1 = STRING: 0
So now I just configure the OpenNMS system to test for the existence of “.188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101.1″ in either capsd:
<protocol-plugin protocol="Update" class-name="org.opennms.netmgt.capsd.plugins.SnmpPlugin" scan="on"> <property key="vbname" value=".18.104.22.168.4.1.8072.1.3.2.22.214.171.124.126.96.36.199.116.101.1" /> <property key="timeout" value="3000" /> <property key="retry" value="1" /> </protocol-plugin>
Or in the default foreign source in the provisioner:
And then I add it to the poller configuration:
<package name="custom"> <filter>IPADDR != '0.0.0.0'</filter> <rrd step="300"> <rra xmlns="">RRA:AVERAGE:0.5:1:2016</rra> <rra xmlns="">RRA:AVERAGE:0.5:12:1488</rra> <rra xmlns="">RRA:AVERAGE:0.5:288:366</rra> <rra xmlns="">RRA:MAX:0.5:288:366</rra> <rra xmlns="">RRA:MIN:0.5:288:366</rra> </rrd> <service name="Update" interval="300000" user-defined="true" status="on"> <parameter key="retry" value="1"/> <parameter key="timeout" value="3000"/> <parameter key="port" value="161"/> <parameter key="oid" value=".188.8.131.52.4.1.8072.1.3.2.184.108.40.206.220.127.116.11.116.101.1"/> <parameter key="operator" value="<"/> <parameter key="operand" value="1"/> </service> <downtime begin="0" end="300000" interval="30000"/> <downtime begin="300000" end="43200000" interval="300000"/> <downtime begin="43200000" interval="600000"/> </package>
and don’t forget the “monitor” line at the bottom:
<monitor service="Update" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>
Note that I stuck this monitor in its own package. This is a best practice when using OpenNMS. If you place all of your customizations in to separate packages and leave the default configurations in place, upgrades become much easier.
I hope at least one of my three readers found this useful, and I hope it gives you some other ideas on integrating your IT information under one roof with OpenNMS.
Having started out in the telecom industry (back when it was different from datacom – yes, I know, I’m old) I was always a fan of Fluke test gear, so I find it kind of amusing that they’ve decided to pick on OpenNMS to promote their latest commercial network management tool, OptiView.
I didn’t even know Fluke was in the network management business, so I was surprised when someone sent me a link to their website in which they feature part of an OpenNMS screenshot as their “wrong way” example.
I’m pretty certain they just grabbed this image off of the web because the text of the page could read as an advertisement for OpenNMS. They obviously didn’t do their homework. I thought it would be a fun exercise to examine their claims in the context of our project.
Lack of Proper Perspective
In this paragraph they state “Central polling misses performance from the user’s perspective”. This is true, and it is why OpenNMS has a remote poller that performs synthetic transactions from the point of view of remote end users and integrates with most popular mapping software so that engineers can easily pinpoint problems. This is in use at nearly 3000 sites worldwide for Papa Johns Pizza – it would be interesting to know if Fluke has an install on that scale, and if so, how much it would cost.
A False Sense of Security
They lost me a little on this one, but they seem to be saying “our monitoring is better than your monitoring”. OpenNMS has multiple levels of monitors, from simple ping/port checks up to capturing the full user experience with the Page Sequence Monitor and the Mail Transport Monitor. When OpenNMS polls for service assurance, it is, for all practical purposes, a user of network services and it reports back what a user would experience.
Lack Troubleshooting and In-depth Analysis
This section states the need for root cause analysis and “packet-level ‘on-the-wire’ visibility.
Well, as for root cause, OpenNMS duplicates the functionality of such classic management products as Netcool/Omnibus and Netcool/Impact, so I’m pretty certain it can address whatever it is OptiView claims to do.
As for packet-level inspection, this is one area that OpenNMS does not cover. One of the reasons is that with today’s large and distributed networks, it is not feasible to monitor every single packet on the network. What OpenNMS does do is indicate areas where there are problems, and then engineers can take their packet sniffer and investigate further. We often use Wireshark in diagnosing customer issues, once OpenNMS determines the part of the network needing attention.
Risks of an Incomplete Picture
This list of bullet points is pretty valid, but the assumption that tools like OpenNMS provide “an incomplete picture” is patently false. I tried to download their “NMS Risks & Shortcomings” white paper but got an error message “This area of the site is temporarily unavailable.” Heh.
This is typical FUD from a commercial company trying, and failing, to differentiate itself from other underpowered and overpriced commercial software tools.
But I must say I’m somewhat flattered by this since our goal with the OpenNMS project is to make every decision about a network management solution to include the question “Why aren’t you using OpenNMS?”
I’m hoping than everyone who might find this site asks themselves the same thing.