Borealis Monitoring

The monitoring system implemented for Borealis is a custom configured installation of Nagios Core, working with NRPE. Nagios monitoring behaves according to objects defined in configuration files, all of which have copies in SuperDARN Canada’s Nagios repository.

Nagios

Nagios core runs as a service under apache2. It is easy to install, but a little tricky to configure for specific purposes. The program executes external plugins that obtain information from the system, and then displays the output on locally hosted webpage. Locally, where and which plugins are executed is determined by host and service objects specified in configuration files. This is also done with monitoring on remote machines, with one exception.

The remote server runs plugins using an a service called NRPE (Nagios Remote Plugin Executor). This process runs on TCP port 5666 by default, and sends plugin output over the network to the Nagios service running on the central host. The central host accepts this output through a plugin called check_nrpe, usage specified in the commands.cfg config file. This remote host output is then displayed normally alongside the local services.

In our configuration, remote hosts send information on services continuously, allowing connections from hosts specified in their nrpe.cfg file. To operate properly, both the hostname of the remote host, and that of the central Nagios host, must be included on this line.

The last key difference between NRPE and Nagios Core is that commands to be executed on the remote host are defined in that host’s nrpe.cfg file. Whereas commands executed by Nagios Core are defined in the commands.cfg by default.

Installation

Detailed instructions for installing Nagios Core on several operating systems can be found on Nagios’ website.

After installing, simply replace the configuration files with those found in this repository.

Installation of NRPE is similarly simple. See our Nagios github page for detailed installation steps: https://github.com/SuperDARNCanada/Nagios/tree/develop. An example on how we use Nagios to monitor Borealis is shown here: https://github.com/SuperDARNCanada/Nagios/tree/develop

Downtime Monitoring

A useful metric for measuring the reliability of the system is to quantify the amount of downtime (or uptime), that is, the percentage of time that the radar is non-operating (or operating). A companion repository, borealis-data-utils, contains a script called borealis_gaps.py which reads through a directory of Borealis HDF5 files and reports all downtimes in a given date range. The script takes several command line options, allowing the user to specify the date range, type of file to search for, minimum downtime duration to report, number of processes to use, and output format of the report. Follow the links above for more details.