---
title: "Problem detection and alerting"
date: 2020-08-07T18:00:00+02:00
---
Everything is distributed, automated and runs in perfect harmony with a common goal: protect your data. But bad things
happen, and rarely when you expect them. This is why you need to watch for services states and send a notification when
something goes wrong. Monitoring systems are well-known in the enterprise world. For our use case, we don't need to
deploy a complex infrastructure to check couple of hosts. For this reason, I choose to use the good old [Nagios
Core](https://www.nagios.org/projects/nagios-core/). It even provides a web interface for humans like us.
# How it works
There are two types of checks:
- **host**: check if host is alive or not
- **service**: check if service of a host is healthy or not
To check if a host is available, the simplest implementation is to use ping:
{{< rawhtml >}}
{{< /rawhtml >}}
For services, there is a tool to execute remote plugins called
[NRPE](https://support.nagios.com/kb/article/nrpe-agent-and-plugin-explained-612.html)[^1]. It works with a client on
the monitoring host and an agent on the remote host that executes commands on demand. The return code defines the check
result.
{{< rawhtml >}}
{{< /rawhtml >}}
Services states can be:
- **OK**: it works as expected
- **WARNING**: it works but we should take a look
- **CRITICAL**: it's broken
- **UNKNOWN**: something is wrong with the plugin configuration or communication
Plugins can define a warning and/or critical threshold to manage the expected state. For example, I would like to know
when disk space usage of a storage host goes over, say, 80% (warning) and 100% (critical). I have time to take action to
free some space or order new hard drives before it becomes critical. And if I do nothing, a higher alert will be sent if
the disk becomes full.
# Installation
My monitoring host runs on Raspbian 10:
```
apt update
apt install nagios4 monitoring-plugins
```
Installed.
By default, the web interface was broken. I had to disable the following block in the */etc/nagios4/apache2.conf* file:
```
#
# ...
#
```
For security reasons, I enabled a basic authentication (a.k.a *htaccess*) in the *DirectoryMatch* block of the same file
and created an *admin* user:
```
AuthUserFile "/etc/nagios4/htdigest.users"
AuthType Basic
AuthName "Restricted Files"
AuthBasicProvider file
Require user admin
```
In the CGI configuration file */etc/nagios4/cgi.cfg*, the default user can be set to *admin* as it is now protected by
basic security:
```
default_user_name=admin
```
Now the web interface should be up and running at http://monitoring-ip/nagios4. For my own usage, I've set up a reverse
proxy (nginx) on the VPS host to expose this interface to a public endpoint so I can access it from anywhere with my
credentials.
# Configuration
A fresh installation applies sane defaults to make Nagios work out-of-the-box. It even enables localhost monitoring.
Unfortunately, I want to check this host like any other server in the infrastructure. The first thing I do is to disable
the following include in */etc/nagios4/nagios.cfg* file:
```
#cfg_file=/etc/nagios4/objects/localhost.cfg
```
I don't want to be spammed by my monitoring system. Servers may be slower and take time to respond. The Wi-Fi connection
of the monitoring system may hang for a while... until someone reboots the host physically. During this extended period
of time (multiple hours), my family and I may sleep. I don't want to wake up with hundreds of notifications saying "Hey,
the monitoring system is DOWN!". One or two notifications is enough.
The following new templates can be defined in */etc/nagios4/conf.d/templates.cfg*:
```
define host {
name home-host
use generic-host
check_command check-host-alive
contact_groups admins
notification_options d,u,r
check_interval 5
retry_interval 5 ; retry every 5 minutes
max_check_attempts 12 ; alert at 1 hour (12x5 minutes)
notification_interval 720 ; resend notifications every 12 hours
register 0 ; template
}
define service {
name home-service
use generic-service
check_interval 5
retry_interval 5 ; retry every 5 minutes
max_check_attempts 12 ; alert at 1 hour (12x5 minutes)
notification_interval 720 ; 12 hours
register 0 ; template
}
```
There are multiple components to define:
- **hosts** (*/etc/nagios4/conf.d/hosts.cfg*): every single host
- **hostgroups** (*/etc/nagios4/conf.d/hostgroups.cfg*): groups of hosts
- **services** (*/etc/nagios4/conf.d/services.cfg*): services that will be attached to hostgroups
For example, I need to know ZFS usage of all storage servers:
- **hosts**: *storage1*, *storage2*, *storage3* with their IP addresses
- **hostgroups**: *storage-servers* that will regroup *storage1*, *storage2* and *storage3*
- **services**: *zfs_capacity* that will be attached to *storage-servers*
Host definition:
```
define host {
use home-host
host_name storage1
alias storage1
address XX.XX.XX.XX
}
```
Hostgroup definition:
```
define hostgroup {
hostgroup_name storage-servers
alias Storage servers
members storage1,storage2,storage3
}
```
Service definition:
```
define service {
use home-service
hostgroup_name storage-servers
service_description zfs_capacity
check_command check_nrpe!check_zfs_capacity
}
```
On all storage servers, we also need to define a NRPE command:
```
command[check_zfs_capacity]=/usr/local/bin/sudo /usr/local/sbin/sanoid --monitor-capacity
```
ZFS usage is now monitored!
I have repeated this process for all services I wanted to check to end up with:
[![Monitoring services](/monitoring-services.png)](/monitoring-services.png)
A single host can be in multiple hostgroups. For my tests, I always added features to *storage1*. I created a hostgroup
for each new capability and added only *storage1* to it. That means *storage1* had the same services as *storage2* and
*storage3*, and the new tested ones.
# Notifications
At work, we use [Opsgenie](https://www.atlassian.com/software/opsgenie) to define on call schedules within a team. Of
course, I don't want to receive a push notification on my phone for my home servers. This is why I choose to be notified
by e-mail. In the past, I hosted some e-mail boxes at home but I didn't want to deal with spam and SPF records to prove
to the world that my service is legit. I have a couple of [domain names](https://www.gandi.net/en/domain) with
(limited) e-mail services included. For the monitoring purpose, this is more than enough to do the job.
On Nagios, you can set the e-mail address in the contacts configuration file
*/etc/nagios4/objects/contacts.cfg*.
I followed this [great tutorial](https://www.linode.com/docs/email/postfix/postfix-smtp-debian7/) to configure
[postfix](http://www.postfix.org/) to send e-mails using the SMTP server of the provider. Secure and no more spam. I
have configured this new e-mail box on my phone so I can be alerted asynchronously and smoothly when something wrong
happens.
[^1]: Nagios Remote Plugin Executor