self-hosting.riou.xyz/content/posts/problem-detection-and-alerting.md

---
title: "Problem detection and alerting"
date: 2020-08-07T18:00:00+02:00
---

Everything is distributed, automated and runs in perfect harmony with a common goal: protect your data. But bad things
happen, and rarely when you expect them. This is why you need to watch for services states and send a notification when
something goes wrong. Monitoring systems are well-known in the enterprise world. For our use case, we don't need to
deploy a complex infrastructure to check couple of hosts. For this reason, I choose to use the good old [Nagios
Core](https://www.nagios.org/projects/nagios-core/). It even provides a web interface for humans like us.

# How it works

There are two types of checks:
- **host**: check if host is alive or not
- **service**: check if service of a host is healthy or not

To check if a host is available, the simplest implementation is to use ping:

{{< rawhtml >}}
<p style="text-align: center;"><img src="/monitoring-host-check.svg" alt="Monitoring host check " style="width: 50%;"></p>
{{< /rawhtml >}}

For services, there is a tool to execute remote plugins called
[NRPE](https://support.nagios.com/kb/article/nrpe-agent-and-plugin-explained-612.html)[^1]. It works with a client on
the monitoring host and an agent on the remote host that executes commands on demand. The return code defines the check
result.

{{< rawhtml >}}
<p style="text-align: center;"><img src="/monitoring-service-check.svg" alt="Monitoring service check " style="width: 65%;"></p>
{{< /rawhtml >}}

Services states can be:
- **OK**: it works as expected
- **WARNING**: it works but we should take a look
- **CRITICAL**: it's broken
- **UNKNOWN**: something is wrong with the plugin configuration or communication

Plugins can define a warning and/or critical threshold to manage the expected state. For example, I would like to know
when disk space usage of a storage host goes over, say, 80% (warning) and 100% (critical). I have time to take action to
free some space or order new hard drives before it becomes critical. And if I do nothing, a higher alert will be sent if
the disk becomes full.

# Installation

My monitoring host runs on Raspbian 10:
```
apt update
apt install nagios4 monitoring-plugins
```

Installed.

By default, the web interface was broken. I had to disable the following block in the */etc/nagios4/apache2.conf* file:

```
#   <Files "cmd.cgi">
#       ...
#    </Files>
```

For security reasons, I enabled a basic authentication (a.k.a *htaccess*) in the *DirectoryMatch* block of the same file
and created an *admin* user:

```
AuthUserFile "/etc/nagios4/htdigest.users"
AuthType Basic
AuthName "Restricted Files"
AuthBasicProvider file
Require user admin
```

In the CGI configuration file */etc/nagios4/cgi.cfg*, the default user can be set to *admin* as it is now protected by
basic security:

```
default_user_name=admin
```

Now the web interface should be up and running at http://monitoring-ip/nagios4. For my own usage, I've set up a reverse
proxy (nginx) on the VPS host to expose this interface to a public endpoint so I can access it from anywhere with my
credentials.

# Configuration

A fresh installation applies sane defaults to make Nagios work out-of-the-box. It even enables localhost monitoring.
Unfortunately, I want to check this host like any other server in the infrastructure. The first thing I do is to disable
the following include in */etc/nagios4/nagios.cfg* file:

```
#cfg_file=/etc/nagios4/objects/localhost.cfg
```

I don't want to be spammed by my monitoring system. Servers may be slower and take time to respond. The Wi-Fi connection
of the monitoring system may hang for a while... until someone reboots the host physically. During this extended period
of time (multiple hours), my family and I may sleep. I don't want to wake up with hundreds of notifications saying "Hey,
the monitoring system is DOWN!". One or two notifications is enough.

The following new templates can be defined in */etc/nagios4/conf.d/templates.cfg*:

```
define host {
    name                    home-host
    use                     generic-host
    check_command           check-host-alive
    contact_groups          admins
    notification_options    d,u,r
    check_interval          5
    retry_interval          5               ; retry every 5 minutes
    max_check_attempts      12              ; alert at 1 hour (12x5 minutes)
    notification_interval   720             ; resend notifications every 12 hours
    register                0               ; template
}

define service {
    name                    home-service
    use                     generic-service
    check_interval          5
    retry_interval          5               ; retry every 5 minutes
    max_check_attempts      12              ; alert at 1 hour (12x5 minutes)
    notification_interval   720             ; 12 hours
    register                0               ; template
}
```

There are multiple components to define:
- **hosts** (*/etc/nagios4/conf.d/hosts.cfg*): every single host
- **hostgroups** (*/etc/nagios4/conf.d/hostgroups.cfg*): groups of hosts
- **services** (*/etc/nagios4/conf.d/services.cfg*): services that will be attached to hostgroups

For example, I need to know ZFS usage of all storage servers:
- **hosts**: *storage1*, *storage2*, *storage3* with their IP addresses
- **hostgroups**: *storage-servers* that will regroup *storage1*, *storage2* and *storage3*
- **services**: *zfs_capacity* that will be attached to *storage-servers*

Host definition:

```
define host {
    use         home-host
    host_name   storage1
    alias       storage1
    address     XX.XX.XX.XX
}
```

Hostgroup definition:

```
define hostgroup {
    hostgroup_name  storage-servers
    alias           Storage servers
    members         storage1,storage2,storage3
}
```

Service definition:

```
define service {
    use                 home-service
    hostgroup_name      storage-servers
    service_description zfs_capacity
    check_command       check_nrpe!check_zfs_capacity
}
```

On all storage servers, we also need to define a NRPE command:

```
command[check_zfs_capacity]=/usr/local/bin/sudo /usr/local/sbin/sanoid --monitor-capacity
```

ZFS usage is now monitored!

I have repeated this process for all services I wanted to check to end up with:

[![Monitoring services](/monitoring-services.png)](/monitoring-services.png)

A single host can be in multiple hostgroups. For my tests, I always added features to *storage1*. I created a hostgroup
for each new capability and added only *storage1* to it. That means *storage1* had the same services as *storage2* and
*storage3*, and the new tested ones.

# Notifications

At work, we use [Opsgenie](https://www.atlassian.com/software/opsgenie) to define on call schedules within a team. Of
course, I don't want to receive a push notification on my phone for my home servers. This is why I choose to be notified
by e-mail. In the past, I hosted some e-mail boxes at home but I didn't want to deal with spam and SPF records to prove
to the world that my service is legit.  I have a couple of [domain names](https://www.gandi.net/en/domain) with
(limited) e-mail services included. For the monitoring purpose, this is more than enough to do the job.

On Nagios, you can set the e-mail address in the contacts configuration file
*/etc/nagios4/objects/contacts.cfg*.

I followed this [great tutorial](https://www.linode.com/docs/email/postfix/postfix-smtp-debian7/) to configure
[postfix](http://www.postfix.org/) to send e-mails using the SMTP server of the provider. Secure and no more spam. I
have configured this new e-mail box on my phone so I can be alerted asynchronously and smoothly when something wrong
happens.

[^1]: Nagios Remote Plugin Executor