Initial commit

Signed-off-by: Julien Riou <julien@riou.xyz>
2024-12-22 07:56:14 +01:00 · 2024-12-22 07:56:14 +01:00 · 8e018ba84d
commit 8e018ba84d
43 changed files with 14239 additions and 0 deletions
--- a/content/posts/do-your-sensors-yourself.md
+++ b/content/posts/do-your-sensors-yourself.md
@ -0,0 +1,136 @@
+---
+title: "Do your sensors yourself"
+date: 2020-08-17T18:00:00+02:00
+---
+
+A big question I've asked myself during this project is what is the best place to put my storage servers? There are
+multiple environmental variables to watch out: **temperature**, **humidity** and **noise**. If components are too hot,
+they could be damaged in the long run. Of course, water and electricity are not friends. You can add a fan to move air
+out of the case and reduce both temperature and humidity but the computer will become noisy. We need to measure those
+variables. Unfortunately, all systems have different set of built-in sensors but not all of them are exposed to the
+operating system. So I decided to build my own sensors. 
+
+# Sensors hardware
+
+I'm a newbie in electronics. I never weld anything. In the DIY[^1] world, there is a open-source micro-controller, the
+[Arduino Uno](https://store.arduino.cc/arduino-uno-rev3), that costs only a few bucks (20€). There are cheaper
+alternatives available like the Elegoo Uno (11€). To build sensors, you'll need good sensors like the
+[DHT22](https://www.waveshare.com/wiki/DHT22_Temperature-Humidity_Sensor) for temperature and humidity and
+[KY-037](https://electropeak.com/learn/how-to-use-ky-037-sound-detection-sensor-with-arduino/) for capturing sound.  To
+connect everything together, you'll need a [breadboard](https://en.wikipedia.org/wiki/Breadboard),
+[resistors](https://en.wikipedia.org/wiki/Resistor) and cables.
+
+Components:
+- [Elegoo Uno R3](https://www.amazon.fr/dp/B01N91PVIS/ref=cm_sw_r_tw_dp_x_8NtkFbHZ6X6K9)
+- [DHT22 sensor](https://www.amazon.fr/dp/B07TTJNY1C/ref=cm_sw_r_tw_dp_x_QOtkFbBM2ZAAD)
+- [KY-037 sensor](https://www.amazon.fr/dp/B07ZHGX5T6/ref=cm_sw_r_tw_dp_x_kPtkFbXRRK7ZP)
+- [10k Ω resistor](https://www.amazon.fr/dp/B06XKQLPFV/ref=cm_sw_r_tw_dp_x_EPtkFbB24855X)
+- [breadboard](https://www.amazon.fr/dp/B06XKZWCJB/ref=cm_sw_r_tw_dp_x_.PtkFb01X4WNW)
+- [cables](https://www.amazon.fr/dp/B01JD5WCG2/ref=cm_sw_r_tw_dp_x_QQtkFbRA6PSG0)
+
+In electronics, you need to build closed circuits going from the power supply ("+") to the ground ("-"). The Arduino
+card can be plugged on an USB port which provides power to the card, on the "5V" pin. The end of the circuit should
+return to the "GND" pin, which means "ground". The breadboard can help you extending the circuit and plug more than one
+element (resistors and sensors at the same time). The top and bottom parts are connected horizontally. The central part
+connects elements vertically. Horizontal and vertical parts are isolated from each other. Resistors role is to regulate
+electrical intensity. They act like a tap for distributing water. If there is too much water at a time, the glass can be
+full too quickly and water can spit everywhere. We'll put a resistor in front of the DHT22 to have valid values and to
+prevent damages.
+
+The circuit looks like this:
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/sensors.svg" alt="Sensors circuit" style="width: 65%;"></p>
+{{< /rawhtml >}}
+
+The DHT22 sensor has three pins: **power**, **digital** and **ground** (and not four like in the schema). The KY-037
+sensor has four pins: **analog**, **ground**, **power** and **digital** (and not three like in the schema). We'll use
+the analog pin to gather data from the sound sensor.
+
+# Sensors software
+
+The circuit is plugged to a computer via USB and it's ready to be used. To be able to read values, we need to compile
+low-level code and execute it on the board. For this purpose, you can install the [Arduino
+IDE](https://www.arduino.cc/en/Main/Software) which is available on multiple platforms. My personal computer runs on
+Ubuntu (no joke please) and I tried to use the packages from the repositories. However, they are too old to work. You
+should [install the IDE yourself](https://www.arduino.cc/en/Guide/Linux). I've added my own user to the "dialout" group
+to be able to use the serial interface to send compiled code to the board. The code itself is called a "sketch". You can
+find mine [here](https://github.com/jouir/arduino-sensors-toolkit/blob/master/sensors2serial.ino). Click on "Upload",
+job done.
+
+# Multiplexing
+
+Values are sent to the serial port but only one program can read this interface at a time. No luck, we would like to
+send those metrics to the alerting and trending systems. Both have their own schedules. They will try to access this
+interface at the same time. Moreover, programs that would like to read the serial port will have to wait for, at least,
+four seconds. In the IoT[^2] world, we often see the usage of [MQTT](https://en.wikipedia.org/wiki/MQTT), a queuing
+protocol. To solve the performance issue, I've developed a simple daemon that reads values from the serial interface and
+publishes them to an MQTT broker called [serial2mqtt](https://github.com/jouir/arduino-sensors-toolkit/#serial2mqtt).
+I've installed [Mosquitto](https://mosquitto.org/) on storage servers so the multiplexing happens locally.
+
+# Thresholds
+
+What is the **critical temperature**? I [found](https://www.apc.com/us/en/faqs/FA157464/) that UPS batteries should not
+run in an environment where temperatures exceed 25°C (warning) and must not go over 40°C (critical). This summer, I had
+multiple buzzer alerts on storage3 and the temperature was over 29°C every time.
+
+What is the **critical humidity**? Humidity is the concentration of water in a volume of air. In tropical regions of the
+world, we often see a 100% humidity level, with working computers. Humidity is proportional to the temperature. The
+hotter it is, the more water can be contained in the air. Generally, temperature in a computer case is warmer than the
+ambient temperature. What is dangerous is not the quantity of water in the air, it's when water condense. A good rule of
+thumb is to avoid going over 80%. But 100% should not be a problem.
+
+# Alerting
+
+On Nagios, I use the [check-mqtt](https://github.com/jpmens/check-mqtt) script on the monitored storage host under an
+NRPE command:
+
+```
+# Sensors
+command[check_ambient_temperature]=/usr/local/bin/python3.7 /usr/local/libexec/nagios/check-mqtt.py -m 10 --readonly -t sensors/temperature -H localhost -P 1883 -u nagios -p ***** -w "float(payload) > 25.0" -c "float(payload) > 40.0"
+command[check_ambient_humidity]=/usr/local/bin/python3.7 /usr/local/libexec/nagios/check-mqtt.py -m 10 --readonly -t sensors/humidity -H localhost -P 1883 -u nagios -p ***** -w "float(payload) > 80.0" -c "float(payload) > 95.0"
+```
+
+[![storage2](/sensors-storage2-alert.png)](/sensors-storage2-alert.png)
+
+# Observability
+
+Telegraf has a [mqtt_consumer](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mqtt_consumer) input
+plugin:
+
+```
+[[inputs.mqtt_consumer]]
+  servers = ["tcp://localhost:1883"]
+  topics = [
+    "sensors/humidity",
+    "sensors/temperature",
+    "sensors/sound"
+  ]
+  persistent_session = true
+  client_id = "telegraf"
+  data_format = "value"
+  data_type = "float"
+  username = "telegraf"
+  password = "*****"
+```
+
+Grafana is able to display environmental variables now:
+
+[![storage1](/sensors-storage1.png)](/sensors-storage1.png)
+[![storage2](/sensors-storage2.png)](/sensors-storage2.png)
+[![storage3](/sensors-storage3.png)](/sensors-storage3.png)
+
+# In the end
+
+I tried to measure noise but I failed. The KY-037 sensor is designed to detect sound variations like a big noise for a
+short period of time. When we try to measure the ambient noise level, it requires a lot of conversions to get values in
+[decibel](https://en.wikipedia.org/wiki/Decibel). So I decided to ignore values coming from the sensor and to hear it
+myself.
+
+I can put my storage servers in the attic, in a room or in the cellar. The attic is right under the roof which is too
+hot in the summer (over 40°C). Rooms are occupied during the night and noise is a problem. I am lucky to have a free
+room right now but it's too hot during the summer (over 25°C). There is the cellar left, where all the conditions are
+optimal, even humidity. Remote locations all have a cellar which is perfect!
+
+[^1]: Do It Yourself
+[^2]: Internet of Things
--- a/content/posts/geographic-distribution-with-sanoid-and-syncoid.md
+++ b/content/posts/geographic-distribution-with-sanoid-and-syncoid.md
@ -0,0 +1,136 @@
+---
+title: "Geographic distribution with Sanoid and Syncoid"
+date: 2020-08-03T18:00:00+02:00
+---
+
+Failures happen at multiple levels: a single disk can fail, as well as multiple disks, a single server, multiple
+servers, a geographic region, a country, the world, the universe. The probability decreases with the number of
+simultaneous events.  Costs and complexity increase with the number of failure events you want to handle. It's up to you
+to find the right balance between all those variables.
+
+For my own infrastructure at home, I was able to put storage servers into three different locations. Two in Belgium
+(with 10Km distance from one another), one in France. They all share the same data. Up to two storage servers can burn
+or be flooded entirely without data loss. There are different redundant solutions at the host level but I will not cover
+them in this article.
+
+{{< rawhtml >}}
+  <script src="https://unpkg.com/leaflet@latest/dist/leaflet.js"></script>
+  <link href="https://unpkg.com/leaflet@latest/dist/leaflet.css" rel="stylesheet"/>
+  <div id="osm-map"></div>
+  <script type="text/javascript">
+    var element = document.getElementById('osm-map');
+    element.style = 'height:500px;';
+    var map = L.map(element);
+    L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
+      attribution: '&copy; <a href="http://osm.org/copyright">OpenStreetMap</a> contributors'
+    }).addTo(map);
+    var center = L.latLng('49.708', '2.516');
+    map.setView(center, 7);
+    L.marker(L.latLng('48.8566969', '2.3514616')).addTo(map); // storage france
+    L.marker(L.latLng('50.4549568', '3.9519580')).addTo(map); // storage belgium (x2)
+  </script>
+  <p><!-- space --></p>
+{{< /rawhtml >}}
+
+# Backup management
+
+Storage layer relies on ZFS pools. There is a wonderful free software called
+[Sanoid](https://github.com/jimsalterjrs/sanoid) to take snapshots of your datasets and manage their retention. Here is
+an example of configuration on a storage host:
+
+```
+[zroot]
+    hourly = 0
+    daily = 0
+    monthly = 0
+    yearly = 0
+    autosnap = no
+    autoprune = no
+
+[storage/xxx]
+    use_template = storage
+
+[storage/yyy]
+    use_template = storage
+
+[storage/zzz]
+    use_template = storage
+
+[template_storage]
+    hourly = 0
+    daily = 31
+    monthly = 12
+    yearly = 10
+    autosnap = yes
+    autoprune = yes
+```
+
+Where *storage/xxx*, *storage/yyy*, and *storage/zzz* are datasets exposed to my family computers. With this
+configuration, I am able to keep 10 years of snapshots. This may change over time depending on disk space, performance
+or retention requirements. The *zroot* dataset has no snapshot nor prune policy but is declared in the configuration for
+monitoring purpose.
+
+Sanoid is compatible with FreeBSD but it requires [system
+changes](https://github.com/jimsalterjrs/sanoid/blob/master/FREEBSD.readme). You'll need an "sh" compatible shell to be
+compatible with mbuffer. I've chosen to install and use "bash" because I'm familiar with it on GNU/Linux servers.
+
+To automatically create and prune snapshots, I've created a cron job that runs every minute:
+
+```
+* * * * * /usr/local/sbin/sanoid --cron --verbose >> /var/log/sanoid.log
+```
+
+# Remote sync
+
+Sanoid comes with a tool to sync local snapshots with a remote host called
+[Syncoid](https://github.com/jimsalterjrs/sanoid#syncoid). It is similar to "rsync" but for ZFS snapshots. If the
+synchronization fails in the middle, Syncoid can **resume** the replication where it was left, without restarting from
+zero. It also supports **compression** on the wire. This is handy for low bandwidth networks like the one I have. To be
+able to send dataset to remote destination, I've set up direct SSH communication (via the VPN) with ed25519 keys.
+
+Then cron jobs for automation:
+```
+0 2,6 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage2:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1
+0 3,7 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage3:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1
+```
+
+Beware, I use the "root" user for this connection. This can be a **security flow**. You should create a user with low
+privileges and possibly use "sudo" with a restriction to the command. You should disable root login over SSH. The
+countermeasure I've implemented is to disable password authentication on the root user ("*PermitRootLogin
+without-password*" in sshd_config file from OpenSSH server). I've also restricted SSH connections to the VPN and local
+networks only. No public network allowed.
+
+# Local usage
+
+Now, ZFS snapshots are automatically created and replicated. How can we start using the service? *I want to send my
+data!* Every location has its own storage server. The idea is to use the local network and send data to the local server
+and let the Sanoid/Syncoid couple handle the rest over the VPN for data safety.
+
+At the beginning, all my family members were using [Microsoft Windows](https://en.wikipedia.org/wiki/Microsoft_Windows)
+(10). To provide the most user friendly experience, I thought it was a good idea to create a
+[CIFS](https://en.wikipedia.org/wiki/Server_Message_Block) share with
+[Samba](https://en.wikipedia.org/wiki/Samba_(software)). The authentication system was a pain to configure but the
+network drive was recognized and it worked... for a while. Every single Samba update on the storage server broke the
+share.  I've lost countless hours debugging this s\*\*t.
+
+I started to show them alternatives to Windows. One day, my wife accepted to change. She opted for
+[Kubuntu](https://kubuntu.org/). Then my parents-in-law changed too. I was able to remove the Samba share and use
+[NFS](https://en.wikipedia.org/wiki/Network_File_System) instead. This changed my life. The network folder has never
+stopped working since the switch. For my personal use, I use [rsync](https://en.wikipedia.org/wiki/Rsync) and cron to
+**automatically** send my local folders.
+
+The storage infrastructure looks like this (storage1 example):
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/geographic-distribution-diagram.svg" alt="Geographic distribution diagram" style="width: 50%;"></p>
+{{< /rawhtml >}}
+
+Syncoid is configured to replicate to other nodes:
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/geographic-distribution-diagram-2.svg" alt="Geographic distribution part 2" style="width: 50%;"></p>
+{{< /rawhtml >}}
+
+The most important rule is to **strictly forbid writes** on the **same dataset** on two **different locations** at the
+**same time**. This
+setup is not "[multi-master](https://en.wikipedia.org/wiki/Multi-master_replication)" compliant at all.
+
+In the end, the data management is fully automated. Data losses belong to the past.
--- a/content/posts/hardware-adventures-and-operating-systems-installation.md
+++ b/content/posts/hardware-adventures-and-operating-systems-installation.md
@ -0,0 +1,76 @@
+---
+title: "Hardware adventures and operating systems installation"
+date: 2020-07-24T18:00:00+02:00
+---
+
+At the beginning of the project, the goal was to create a single storage server at my apartment. So I bought a [fancy
+case](https://www.ldlc.com/fr-be/fiche/PB00181814.html) with racks in front to hot replace disks and I retrieved an
+[Intel NUC motherboard](https://www.intel.com/content/www/us/en/products/boards-kits/nuc/boards.html) from work. It had
+only two SATA ports available to connect disks which is not enough to plug at least four disks: one for the system and
+three for the storage. I bought a [PCI RAID card](https://www.amazon.fr/gp/product/B0001Y7PU8) to add four slots. I
+connected two small SSD for the system and four data disks, then installed FreeBSD without any issue. I started to copy
+data to the storage space when a noisy alarm[^1] began to wake everybody up in the building. This was unbearable. I
+decided to buy a *micro ATX* motherboard with processor and memory to replace the Intel NUC board.  Wrong. I confused
+[micro ATX](https://en.wikipedia.org/wiki/MicroATX) with [mini ITX](https://en.wikipedia.org/wiki/Mini-ITX) formats. The
+first one was too big to fit in the box. So I bought a classic ATX case with a cheap power supply and 3x2TB disks from
+work.  **Storage1** was born.
+
+At that point, I had a working storage server and some pieces to build a second one. At the same time, my wife and I had
+a baby. My office at home became the newborn bedroom. I paused this project for a year to focus on my family. Then, we
+bought a house with plenty of space to handle life serenely.
+
+During the move, I unpacked my very first computer that I had assembled in 2008. The only missing thing was a physical
+slot to rack the fourth disk. I bought a [low cost ATX case](https://www.amazon.fr/gp/product/B00LA7PC6Y/) and moved
+every piece into. I started before work on a Friday but didn't finish on time. My home office was covered with computers
+pieces all day long. When I finished work, I went back to the project when a friendly neighbor called on me for help
+because his computer crashed. Right before going to bed, I tried to connect the power button to the motherboard without
+instructions, and it didn't work. I finally found it on the web and made it work, at midnight.  **Storage 2** was born.
+
+It runs on a quite old hardware (10+ years). I thought it would be easy to install FreeBSD because it was created in the
+90s[^2]. I tried to boot from USB but the stick was not recognized. I burnt a CD-ROM with version 12, the latest release
+at that time. The installer was not able to load because of a [LUA
+error](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234031) in the bootloader. In the comments and on forums, some
+people managed to make version 11 work. I burnt a CD-ROM with version 11, same result. After having lost an afternoon of
+my time and two CD-ROMs, I went back into my comfort zone and installed a Debian 10 with success.
+
+Recently, my family offered me the missing hardware pieces to finalize the third storage host. The big one with 4TB
+disks in the mini case. The one I had bought at the beginning of the project. In the end, it is not so practical. Disks
+are not fixed to the rack. They can move back and forth a few centimeters. Some disks were not recognized by the system
+because they were not connected. I pushed all of them with a screwdriver to ensure they were plugged into the SATA
+connector. For the price, I expected it to work out-of-the-box. I was surprised to find four SATA ports on the
+motherboard where I expected five or six. I removed one system disk. Goodbye dirty hack with adhesive tape to stick the
+second SSD! Let's join your friends in the stock. **Storage 3** was born.
+
+Here is the detailed list of components:
+
+| Host     | Component    | Reference 																				                                                       |
+| -------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| storage1 | Case         | [Antec One](https://media.ldlc.com/ld/products/00/01/00/62/LD0001006251_2.jpg)                                                                 |
+|          | Power supply | [Antec Basiq Series VP350P](https://media.ldlc.com/ld/products/00/00/89/95/LD0000899597_2.jpg)                                                 |
+|          | Motherboard  | [Gigabyte GA-B150M-DS3H](https://media.ldlc.com/ld/products/00/03/45/35/LD0003453579_2.jpg)                                                    |
+|          | CPU          | [Intel Celeron G3900 (2.8 GHz)](https://media.ldlc.com/ld/products/00/01/47/39/LD0001473956_2_0001473966_0001571304_0001571323_0003614881.jpg) |
+|          | RAM          | [G.Skill Aegis 4 Go (1 x 4 Go) DDR4 2133 MHz CL15](https://www.ldlc.com/fr-be/fiche/PB00202287.html)                                           |
+|          | System disks | [LDLC SSD F2 32 GB](https://media.ldlc.com/ld/products/00/03/42/11/LD0003421194_2_0003421246.jpg) (x2)                                         |
+|          | Data disks   | 2TB HDD 3.5" (x3)                                                                                                                              |
+| storage2 | Case         | [Advance Grafit](https://www.amazon.fr/gp/product/B00LA7PC6Y/)                                                                                 |
+|          | Power supply | No reference found                                                                                                                             |
+|          | Motherboard  | Asus M2A-VM HDMI                                                                                                                               |
+|          | CPU          | AMD Athlon 64 X2 5000+ Socket AM2                                                                                                              |
+|          | RAM          | G.Skill Kit Extreme2 2 x 1 Go PC6400 PK (x2)                                                                                                   |
+|          | System disk  | Recycled 160GB HDD 3.5"                                                                                                                        |
+|          | Data disks   | 1TB HDD 3.5" (x3)                                                                                                                              |
+| storage3 | Case         | [In Win IW-MS04](https://www.ldlc.com/fr-be/fiche/PB00181814.html)                                                                             |
+|          | Motherboard  | [ASRock H310CM-ITX/AC](https://www.ldlc.com/fr-be/fiche/PB00275155.html)                                                                       |
+|          | CPU          | [Intel Celeron G4920 (3.2 GHz)](https://www.ldlc.com/fr-be/fiche/PB00247186.html)                                                              |
+|          | RAM          | [G.Skill Aegis 4 Go (1 x 4 Go) DDR4 2133 MHz CL15](https://www.ldlc.com/fr-be/fiche/PB00202287.html)                                           |
+|          | System disk  | [LDLC SSD F2 32 GB](https://media.ldlc.com/ld/products/00/03/42/11/LD0003421194_2_0003421246.jpg)                                              |
+|          | Data disks   | 4TB HDD 3.5" (x3)                                                                                                                              |
+
+Despite heterogeneous components, storage servers have been successfully running for a while now.
+
+[^1]: Later, I found out that the noise was coming from the disk backplane and not the motherboard. There is a buzzer
+that emits a sound sequence depending on the detected anomaly. At the apartment and at my current house in the summer,
+the temperature in the room was too high (more than 29°C). I moved the host in a cold place. Problem solved.
+
+[^2]: FreeBSD [initial release](https://en.wikipedia.org/wiki/FreeBSD) was on November 1, 1993.
+
--- a/content/posts/increased-observability-with-the-TIG-stack.md
+++ b/content/posts/increased-observability-with-the-TIG-stack.md
@ -0,0 +1,141 @@
+---
+title: "Increased observability with the TIG stack"
+date: 2020-08-10T18:00:00+02:00
+---
+
+[Observability](https://en.wikipedia.org/wiki/Observability) has become a buzzword lately. I must admit, this is one of
+the many reasons why I use it in the title. In reality, this article will talk about fetching measurements and creating
+beautiful graphs to feel like [detective Derrick](https://en.wikipedia.org/wiki/Derrick_(TV_series)), an *old* and wise
+detective solving cases by encouraging criminals to confess by themselves.
+
+With the recent [Go](https://golang.org/) programming language [gain of
+popularity](https://opensource.com/article/17/11/why-go-grows), we have seen a lot of new software coming into the
+database world: [CockroachDB](https://www.cockroachlabs.com/), [TiDB](https://pingcap.com/products/tidb),
+[Vitess](https://vitess.io/), etc. Among them, the **TIG stack**
+([**T**elegraf](https://github.com/influxdata/telegraf), [**I**nfluxDB](https://github.com/influxdata/influxdb) and
+[**G**rafana](https://github.com/grafana/grafana)) has become a reference to gather and display metrics.
+
+The goal is to see the evolution of different resources usage (memory, processor, storage space), power consumption,
+environment variables (temperature, humidity), on every single host of the infrastructure.
+
+# Telegraf
+
+The first component of the stack is Telegraf, an agent that can fetch metrics from multiple sources
+([input](https://github.com/influxdata/telegraf/tree/master/plugins/inputs)) and write them to multiple destinations
+([output](https://github.com/influxdata/telegraf/tree/master/plugins/outputs)). There are tens of built-in plugins
+available! You can even gather a custom source of data with
+[exec](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) with an expected
+[format](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md).
+
+I configured Telegraf to fetch and send metrics every minute (*interval* and *flush_interval* in the *agent* section is
+*"60s"*) which is enough for my personal usage. Most of the plugins I use are built-in: cpu, disk, diskio, kernel, mem,
+processes, system, zfs, net, smart, ping, etc.
+
+The [zfs](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zfs) plugin fetches ZFS pool statistics like
+size, allocation, free space, etc, on FreeBSD but not [on Linux](https://github.com/influxdata/telegraf/issues/2616).
+The issue is known but has not been merged upstream yet. So I have developed a simple Python snippet to fill the gap on
+my only storage server running on Linux:
+
+```
+#!/usr/bin/python
+
+import subprocess
+
+def parse_int(s):
+    return str(int(s)) + 'i'
+
+def parse_float_with_x(s):
+    return float(s.replace('x', ''))
+
+def parse_pct_int(s):
+    return parse_int(s.replace('%', ''))
+
+if __name__ == '__main__':
+    measurement = 'zfs_pool'
+
+    pools = subprocess.check_output(['/usr/sbin/zpool', 'list', '-Hp']).splitlines()
+    output = []
+    for pool in pools:
+        col = pool.split("\t")
+        tags = {'pool': col[0], 'health': col[9]}
+        fields = {}
+
+        if tags['health'] == 'UNAVAIL':
+            fields['size'] = 0
+        else:
+            fields['size'] = parse_int(col[1])
+            fields['allocated'] = parse_int(col[2])
+            fields['free'] = parse_int(col[3])
+            fields['fragmentation'] = '0i' if col[6] == '-' else parse_pct_int(col[6])
+            fields['capacity'] = parse_int(col[7])
+            fields['dedupratio'] = parse_float_with_x(col[8])
+
+        tags = ','.join(['{}={}'.format(k, v) for k, v in tags.items()])
+        fields = ','.join(['{}={}'.format(k, v) for k, v in fields.items()])
+        print('{},{} {}'.format(measurement, tags, fields))
+```
+
+Called by the following input:
+
+```
+[[inputs.exec]]
+  commands = ['/opt/telegraf-plugins/zfs.py']
+  data_format = "influx"
+```
+
+This exec plugin does exactly the same job as the zfs input running on FreeBSD.
+
+All those metrics are sent to a single output, InfluxDB, hosted on the monitoring server.
+
+# InfluxDB
+
+Measurements can be stored in a time series database which is designed to organize data around time. InfluxDB is a
+perfect use case for what we need. Of course, there are other time series databases. I've chosen this one because it is
+well documented, it fits my needs and I wanted to learn new things.
+[Installation](https://docs.influxdata.com/influxdb/v1.8/introduction/install/) is straightforward. I've enabled
+[HTTPS](https://docs.influxdata.com/influxdb/v1.8/administration/https_setup/) and
+[authentication](https://docs.influxdata.com/influxdb/v1.8/administration/authentication_and_authorization/#set-up-authentication).
+I use a simple setup with only one node in the *cluster*. No sharding. Only one database. Even if there is not so many
+metrics sent by Telegraf, I've created a default [retention
+policy](https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#retention-policy-management) to store
+two years of data which is more than enough. A new default retention policy will become the default route to store all
+your new points. Don't be afraid to see all the existing measurements vanished. Nothing has been deleted. They just are
+under the previous policy and need to be
+[moved](https://community.influxdata.com/t/applying-retention-policies-to-existing-measurments/802). You should define a
+[backup](https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/) policy too.
+
+# Grafana
+
+Now that we are able to gather and store metrics, we need to visualize them. This is the role of
+[Grafana](https://grafana.com/). During my career, I played with
+[Graylog](https://docs.graylog.org/en/3.2/pages/dashboards.html), [Kibana](https://www.elastic.co/kibana) and Grafana.
+The last one is my favorite. It is generally blazing fast! Even on a Raspberry Pi. The look and feel is amazing. The
+theme is dark by default but I like the light one.
+
+I have created four dashboards:
+- **system**: load, processor, memory, system disk usage, disk i/o, network quality and bandwidth
+- **storage**: ZFS pool allocation, capacity, fragmentation and uptime for each disk
+- **power consumption**: kWh used per day, week, month, year, current UPS load, price per year (more details on a next
+  post)
+- **sensors**: ambient temperature, humidity and noise (more details on a next post)
+
+Every single graph has a *$host* [variable](https://grafana.com/docs/grafana/latest/variables/templates-and-variables/)
+at the dashboard level to be able to filter metrics per host. On top of the screen, a dropdown menu is automatically
+created to select the host based on an InfluxDB query.
+
+And because a picture is worth a thousand words, here are some screenshots of my own graphs:
+
+[![System](/grafana-system.png)](/grafana-system.png)
+[![Storage](/grafana-storage.png)](/grafana-storage.png)
+[![Power consumption](/grafana-power-consumption.png)](/grafana-power-consumption.png)
+[![Sensors](/grafana-sensors.png)](/grafana-sensors.png)
+
+# Infrastructure
+
+To sum this up, the infrastructure looks like this:
+
+![TIG stack](/monitoring-tig.svg)
+
+Whenever I want, I can sit back on a comfortable sofa, open a web browser and let the infrastructure speak for itself.
+Easy, right?
+
--- a/content/posts/infrastructure-overview.md
+++ b/content/posts/infrastructure-overview.md
@ -0,0 +1,67 @@
+---
+title: "Infrastructure overview"
+date: 2020-07-20T18:30:00+02:00
+---
+
+The idea behind this infrastructure is to run on commodity servers. No need to buy big racks of expensive servers as we
+see in data centers. Simple homemade computers will do the job. At work, I have access to cheap hard drives that were
+used in servers and either are out of warranty or not suitable for enterprise workload. They generally are half their
+market price. I have a mix of brand new and re-used drives to reduce the risk of having two disks failing at the same
+time in the same host.
+
+There are three components in the infrastructure:
+* **storage** servers that hold the data
+* **monitoring** server that grabs metrics and sends alerts
+* **vps**[^1] server used to create a VPN[^2] and watch for monitoring server availability
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/infrastructure-overview.svg" alt="Infrastructure overview" style="width: 65%;"></p>
+{{< /rawhtml >}}
+
+# Storage
+
+Every storage server is designed to be hosted on a different location. Each one could be unplugged from a location then
+plugged somewhere else and work the same way as before. They require an Internet access to be able to contact the VPS to
+join the VPN.
+
+The technology that holds data is **[ZFS](https://en.wikipedia.org/wiki/ZFS)**. I have the chance to use it at work for
+production workloads and it makes life way easier. I am used to manage GNU/Linux servers
+([Debian](https://www.debian.org/)) and I know that [FreeBSD](https://www.freebsd.org/) has built-in ZFS support, so I
+wanted to give it a try. I didn't choose [FreeNAS](https://www.freenas.org/) because I wanted to do everything by myself
+to learn and use only the features I needed.
+
+The right balance I found to maximize available disk space while keeping data safe is to use **three disks** in a
+[RAID-Z](https://en.wikipedia.org/wiki/ZFS#RAID_(%22RaidZ%22)). Storage servers are allowed to lose one disk at a time
+without breaking the service. In the meantime, almost all the cumulative space is available to use. Datasets are
+configured to use **lz4** compression because it saves disk space without pushing too much pressure on the CPU. 
+
+| Host     | Disk capacity |
+| -------- | ------------: |
+| storage1 | 5.44T         |
+| storage2 | 2.72T         |
+| storage3 | 10.9T         |
+
+# Monitoring
+
+Like any system administrator, I want to be alerted when something goes wrong on the infrastructure. I also want to
+browse the history with graphs to see trends. There was a [Raspberry Pi](https://www.raspberrypi.org/) waiting to be
+used in a drawer. It is now connected to the Wi-Fi network somewhere in the house, perfectly hidden, to do this job in
+the background.
+
+# VPS
+
+I am not a network engineer. Actually, this is not my job and I don't want it to be. There are numerous experts in the
+field that do this very well and I am thankful to them. But a computer without network connectivity is not very useful.
+When self-hosting, you have to deal with your ISP modem settings. There is no standard as far as I know. Mine has no
+fixed public IPv4 address. I tried to develop scripts to automatically update a subdomain name with the current public
+IP address and try to contact it from the outside. The name worked, but the communication always failed.
+
+To solve this problem, I [rent a VPS](https://www.ovhcloud.com/fr/vps/) hosted close to storage locations and I have
+configured an [OpenVPN](https://openvpn.net/) server. This is a single point of failure and a *bottleneck* because all
+the traffic goes to this server to communicate with others. In fact, Internet bandwidth at home is the real bottleneck
+so the VPS should not be a problem. It also acts as the entry point from the outside world for metrics and monitoring
+websites.
+
+[^1]: [Virtual Private Server](https://en.wikipedia.org/wiki/Virtual_private_server)
+
+[^2]: [Virtual Private Network](https://en.wikipedia.org/wiki/Virtual_private_network)
--- a/content/posts/network-configuration-with-openvpn.md
+++ b/content/posts/network-configuration-with-openvpn.md
@ -0,0 +1,112 @@
+---
+title: "Network configuration with OpenVPN"
+date: 2020-07-27T18:00:00+02:00
+---
+
+Networking is hard. Dealing with ISP modem settings is even harder. Mine doesn't have a static public IP address by
+default. If the modem reboots, it is likely that it will be assigned a new one. For regular people, it is not a problem
+for browsing the Internet. But for hackers like us, that means we cannot use the IP address itself to reach the private
+network from the outside world. It becomes a problem when we try to join hosts in different networks.
+
+For your information, this is the price my ISP would like me to pay for this "option":
+
+![Fixed IP option](/fixed-ip-option.png)
+
+This is insane!
+
+The first idea was to deploy a script on each host that discover the public IP address and register an A record on a
+given subdomain name. This job could have been run by a cron daemon. It would transform a dynamic IP address into a
+predictable name. It was like the [no-ip](https://www.noip.com/) service. It worked. I was able to know the home public
+IP address.
+
+Then, I started to use [port
+mapping](https://www.proximus.be/support/en/id_sfaqr_ports_mapping/personal/support/internet/internet-at-home/advanced-settings/internet-port-mapping-on-your-modem.html#/bbox3)
+to redirect a given port on my router to a host in the private network. By default, some protocols like SSH, HTTP and
+HTTPS are [not
+open](https://www.proximus.be/support/en/id_sfaqr_ports_unblock_secu/personal/support/internet/security-and-protection/internet-ports-and-security/open-internet-ports.html),
+even if you configure port mapping correctly. You have to go on the ISP website and lower your *security level* from
+high to low. At my apartment, I successfully managed to reach some port from the outside, but never on my current house.
+The major problem of this procedure is its **complexity** and the fact it **highly depends on your ISP
+devices/settings**. I had to find a simpler solution.
+
+Here comes [OpenVPN](https://openvpn.net/). It's an open-source software which creates  private networks on public
+networks. It uses encryption to secure connection between each host to keep your transport safe. The initial setup is
+quite long and complex but you just have to follow this [great
+tutorial](https://www.digitalocean.com/community/tutorials/how-to-set-up-an-openvpn-server-on-debian-10) and it will
+work like a charm. The drawback is you'll need a single point to act as a server. I choose to [rent a
+VPS](https://www.ovhcloud.com/fr/vps/) for a few euros per month. It has a fixed IP address and a decent bandwidth for
+our usage. It runs on Debian but there are plenty of operating systems available.
+
+The OpenVPN certificate management can be a bit disturbing at first. I use my monitoring host as CA[^1] to keep trust at
+home and every host has its own client certificate. I've set static IP addressing up to always assign the same address
+to clients. I've enabled direct communication between clients because storage servers will send snapshots to each
+others. I didn't configure clients to forward all their packets to the VPN server because the goal here is not to hide
+behind it for privacy.
+
+I have changed the following settings on the VPN server:
+
+```
+topology subnet                                 ; declare a subnet like home
+server 10.xx.xx.xx 255.xx.xx.xx                 ; with the range you like
+client-to-client                                ; allow clients to talk to each other
+client-config-dir /etc/openvpn/ccd              ; static IP configuration per client
+ifconfig-pool-persist /var/log/openvpn/ipp.txt  ; IP lease settings
+```
+
+Example of *ipp.txt* file:
+
+```
+storage1,10.xx.xx.xx
+storage2,10.yy.yy.yy
+storage3,10.zz.zz.zz
+```
+
+Example of */etc/openvpn/ccd/storage1.user* file:
+
+```
+ifconfig-push 10.xx.xx.xx 255.xx.xx.xx
+```
+
+The network configuration declared in *client-config-dir* must match the one in *ipp.txt*.
+
+The configuration generated by the *make_config.sh* script (see the tutorial mentioned above) can be written to:
+* */etc/openvpn/client.conf* (Debian)
+* */usr/local/etc/openvpn/openvpn.conf* (FreeBSD)
+
+When the OpenVPN service is started, you should be able to see the tun interface up and running.
+
+```
+tun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1500
+	options=80000<LINKSTATE>
+	inet6 fe80::xxxx:xxxx:xxxx:xxxx%tun0 prefixlen 64 scopeid 0x3
+	inet 10.xx.xx.xx --> 10.xx.xx.xx netmask 0xffffff00
+	groups: tun
+	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
+	Opened by PID 962
+```
+```
+3: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 100
+    link/none 
+    inet 10.xx.xx.xx/xx brd 10.xx.xx.xx scope global tun0
+       valid_lft forever preferred_lft forever
+```
+
+Et voilà! Every server is now part of a private network:
+
+```
+monitoring ~ # nmap -sn 10.xx.xx.xx/xx
+Starting Nmap 7.70 ( https://nmap.org ) at 2020-07-13 17:28 CEST
+Nmap scan report for vps (10.xx.xx.xx)
+Host is up (0.018s latency).
+Nmap scan report for 10.xx.xx.xx
+Host is up (0.032s latency).
+Nmap scan report for 10.xx.xx.xx
+Host is up (0.24s latency).
+Nmap scan report for 10.xx.xx.xx
+Host is up (0.22s latency).
+Nmap scan report for 10.xx.xx.xx
+Host is up.
+Nmap done: xx IP addresses (5 hosts up) scanned in 13.11 seconds
+```
+
+[^1]: [Certificate Authority](https://en.wikipedia.org/wiki/Certificate_authority)
--- a/content/posts/power-consumption-and-failures-prevention.md
+++ b/content/posts/power-consumption-and-failures-prevention.md
@ -0,0 +1,146 @@
+---
+title: "Power consumption and failures prevention"
+date: 2020-08-14T18:00:00+02:00
+---
+
+Providing a full storage service means having computers up 24x7. On one hand, if we power off the local storage server
+when we aren't using it, we'll have to find a solution to respect the backup policy and synchronize with remote servers
+that could be down at the moment. On the other hand, if we let the storage server up all the time, it will consume
+unnecessary resources and throw money down the drain. I deeply know that a personal computer, which is idle most of the
+time, doesn't consume so much power. This is my conviction. But how to verify it?
+
+With [observability]({{< ref "posts/increased-observability-with-the-tig-stack" >}}), I thought it would be easy to
+gather power consumption via built-in sensors. I tried something that I know,
+[lm_sensors](https://hwmon.wiki.kernel.org/lm_sensors), which is included in the Linux kernel and exposes CPU
+temperatures, fans speed, power voltages, etc.
+
+```
+storage2 ~ # sensors
+k8temp-pci-00c3
+Adapter: PCI adapter
+Core0 Temp:   +30.0°C  
+Core0 Temp:   +22.0°C  
+Core1 Temp:   +30.0°C  
+Core1 Temp:   +16.0°C  
+
+acpitz-acpi-0
+Adapter: ACPI interface
+temp1:        +40.0°C  (crit = +75.0°C)
+
+atk0110-acpi-0
+Adapter: ACPI interface
+Vcore Voltage:      +1.10 V  (min =  +1.45 V, max =  +1.75 V)
+ +3.3 Voltage:      +3.39 V  (min =  +3.00 V, max =  +3.60 V)
+ +5.0 Voltage:      +4.97 V  (min =  +4.50 V, max =  +5.50 V)
+12.0 Voltage:     +12.22 V  (min = +11.20 V, max = +13.20 V)
+CPU FAN Speed:     3391 RPM  (min =    0 RPM, max = 1800 RPM)
+CHASSIS FAN Speed:    0 RPM  (min =    0 RPM, max = 1800 RPM)
+POWER FAN Speed:   1662 RPM  (min =    0 RPM, max = 1800 RPM)
+CPU Temperature:    +26.0°C  (high = +90.0°C, crit = +125.0°C)
+MB Temperature:     +37.0°C  (high = +70.0°C, crit = +125.0°C)
+```
+
+The ACPI interface returns some voltages measurements. But I doubt they can be used to find the instant consumption in
+watt (W) and extrapolate the consumption over time in kilowatt-hour (kWh). On laptops, such information can be computed
+from battery statistics. Unfortunately, all computers of the infrastructure are desktops without batteries.
+
+I needed to buy a product. A [lot](https://modernsurvivalblog.com/alternative-energy/kill-a-watt-meter/)
+[of](https://www.howtogeek.com/107854/the-how-to-geek-guide-to-measuring-your-energy-use/)
+[websites](https://www.pcmag.com/news/how-to-measure-home-power-usage)
+[talk](https://michaelbluejay.com/electricity/measure.html) about how to measure power consumption for computers and
+even for the whole house. The common thing that comes out is the recommendation to use a
+[wattmeter](https://en.wikipedia.org/wiki/Wattmeter).
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/zaeel-wattmetre.jpg" alt="Wattmeter" style="width: 25%;"></p>
+{{< /rawhtml >}}
+
+It's an instrument which can be plugged between the power outlet and your device to measure how much energy is consumed
+instantaneously (W), over time (kWh) and even the total price if you have configured the kWh price. There is a LCD to
+display the results. A wattmeter is cheap. I've bought [this
+model](https://www.amazon.fr/dp/B07GN5NPDJ/ref=cm_sw_r_tw_dp_x_FMMiFb2911HN7) which does a good job. Sadly, we cannot
+gather the data from the LCD to load them to the metrics infrastructure easily. It also lacks of precision for the
+price. We can enter only two digits after the floating point while the energy provider gives us a price with five
+digits.
+
+Speaking of the price, my [energy provider](https://www.engie.be/fr/) publishes a beautiful but beyond understanding
+[grid of prices](https://www.engie.be/fr/energie/electricite-gaz/prix-conditions). They are dependent on the pack of
+products, the region and the distributor. They change over time. You can have an electrical counter for the day and for
+the night. Moreover, price is displayed in cents and not euros! I called them to have a price estimation. Come on, we
+are in a digitized world. They should, at least, display the current price of the contract somewhere in the customer
+panel.
+
+During my researches, I found that we could use an uninterruptible power supply (UPS) to gather power consumption
+metrics. As a bonus, it is able to protect from power variations and interruptions that could harm computers. However,
+they are quite expensive. Their prices range from 50€ to hundreds of euros. As I'm a total newbie in this domain, I've
+read this detailed [guide](https://www.materiel.net/guide-achat/g13-les-onduleurs-et-prises-parafoudre/1/) (FR) to gain
+some knowledge. So I decided to buy an [APC Back-UPS Pro
+550](https://www.apc.com/shop/be/en/products/APC-Power-Saving-Back-UPS-Pro-550/P-BR550GI).
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/apc-back-ups-pro-550.jpg" alt="UPS" style="width: 25%;"></p>
+{{< /rawhtml >}}
+
+It has an USB interface to control it with [apcupsd](https://en.wikipedia.org/wiki/Apcupsd) and display power
+information with the "apcaccess" binary. It's compatible with both Debian and FreeBSD and it even has a [telegraf
+plugin](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/apcupsd)!
+
+```
+storage1 ~ # /usr/local/sbin/apcaccess 
+APC      : 001,036,0867
+DATE     : 2020-07-30 15:56:46 +0200  
+HOSTNAME : storage1
+VERSION  : 3.14.14 (31 May 2016) freebsd
+UPSNAME  : storage1
+CABLE    : USB Cable
+DRIVER   : USB UPS Driver
+UPSMODE  : Stand Alone
+STARTTIME: 2020-07-26 18:28:21 +0200  
+MODEL    : Back-UPS RS 550G 
+STATUS   : ONLINE 
+LINEV    : 234.0 Volts
+LOADPCT  : 10.0 Percent
+BCHARGE  : 100.0 Percent
+TIMELEFT : 37.5 Minutes
+MBATTCHG : 5 Percent
+MINTIMEL : 3 Minutes
+MAXTIME  : 0 Seconds
+SENSE    : Medium
+LOTRANS  : 176.0 Volts
+HITRANS  : 282.0 Volts
+ALARMDEL : No alarm
+BATTV    : 13.7 Volts
+LASTXFER : No transfers since turnon
+NUMXFERS : 0
+TONBATT  : 0 Seconds
+CUMONBATT: 0 Seconds
+XOFFBATT : N/A
+SELFTEST : NO
+STATFLAG : 0x05000008
+SERIALNO : 4B1939P01928  
+BATTDATE : 2019-09-23
+NOMINV   : 230 Volts
+NOMBATTV : 12.0 Volts
+NOMPOWER : 330 Watts
+FIRMWARE : 857.L7 .I USB FW:L7
+END APC  : 2020-07-30 15:57:32 +0200  
+```
+
+The 550 is the first model of the Back-UPS Pro range so it has [IEC C13 power
+plugs](https://en.wikipedia.org/wiki/IEC_60320#C13/C14_coupler) only, suitable for computers, but no [Euro/French
+plugs](https://en.wikipedia.org/wiki/AC_power_plugs_and_sockets#CEE_7.2F5_socket_and_CEE_7.2F6_plug_.28French.3B_Type_E.29)
+compatible with any electrical device. As I connect only a single computer to the UPS, this is the most economical
+solution.
+
+Once the data had been fed to the observability platform, I was able to import this [beautiful
+dashboard](https://grafana.com/grafana/dashboards/10835) from the Grafana community. I've customized it to my own needs
+and here is the result:
+
+[![storage1](/power-consumption-storage1.png)](/power-consumption-storage1.png)
+[![storage2](/power-consumption-storage2.png)](/power-consumption-storage2.png)
+[![storage3](/power-consumption-storage3.png)](/power-consumption-storage3.png)
+
+You can download my dashboard [here](/grafana-power-consumption.json).
+
+Our winner is *storage3* which costs less than a kebab per year! The worst case is *storage2*, the old hardware, that
+consumes the equivalent of an incandescent light bulb. See, the power consumption is not so bad after all.
--- a/content/posts/problem-detection-and-alerting.md
+++ b/content/posts/problem-detection-and-alerting.md
@ -0,0 +1,200 @@
+---
+title: "Problem detection and alerting"
+date: 2020-08-07T18:00:00+02:00
+---
+
+Everything is distributed, automated and runs in perfect harmony with a common goal: protect your data. But bad things
+happen, and rarely when you expect them. This is why you need to watch for services states and send a notification when
+something goes wrong. Monitoring systems are well-known in the enterprise world. For our use case, we don't need to
+deploy a complex infrastructure to check couple of hosts. For this reason, I choose to use the good old [Nagios
+Core](https://www.nagios.org/projects/nagios-core/). It even provides a web interface for humans like us.
+
+# How it works
+
+There are two types of checks:
+- **host**: check if host is alive or not
+- **service**: check if service of a host is healthy or not
+
+To check if a host is available, the simplest implementation is to use ping:
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/monitoring-host-check.svg" alt="Monitoring host check " style="width: 50%;"></p>
+{{< /rawhtml >}}
+
+For services, there is a tool to execute remote plugins called
+[NRPE](https://support.nagios.com/kb/article/nrpe-agent-and-plugin-explained-612.html)[^1]. It works with a client on
+the monitoring host and an agent on the remote host that executes commands on demand. The return code defines the check
+result.
+
+{{< rawhtml >}}
+<p style="text-align: center;"><img src="/monitoring-service-check.svg" alt="Monitoring service check " style="width: 65%;"></p>
+{{< /rawhtml >}}
+
+Services states can be:
+- **OK**: it works as expected
+- **WARNING**: it works but we should take a look
+- **CRITICAL**: it's broken
+- **UNKNOWN**: something is wrong with the plugin configuration or communication
+
+Plugins can define a warning and/or critical threshold to manage the expected state. For example, I would like to know
+when disk space usage of a storage host goes over, say, 80% (warning) and 100% (critical). I have time to take action to
+free some space or order new hard drives before it becomes critical. And if I do nothing, a higher alert will be sent if
+the disk becomes full.
+
+# Installation
+
+My monitoring host runs on Raspbian 10:
+```
+apt update
+apt install nagios4 monitoring-plugins
+```
+
+Installed.
+
+By default, the web interface was broken. I had to disable the following block in the */etc/nagios4/apache2.conf* file:
+
+```
+#   <Files "cmd.cgi">
+#       ...
+#    </Files>
+```
+
+For security reasons, I enabled a basic authentication (a.k.a *htaccess*) in the *DirectoryMatch* block of the same file
+and created an *admin* user:
+
+```
+AuthUserFile "/etc/nagios4/htdigest.users"
+AuthType Basic
+AuthName "Restricted Files"
+AuthBasicProvider file
+Require user admin
+```
+
+In the CGI configuration file */etc/nagios4/cgi.cfg*, the default user can be set to *admin* as it is now protected by
+basic security:
+
+```
+default_user_name=admin
+```
+
+Now the web interface should be up and running at http://monitoring-ip/nagios4. For my own usage, I've set up a reverse
+proxy (nginx) on the VPS host to expose this interface to a public endpoint so I can access it from anywhere with my
+credentials.
+
+# Configuration
+
+A fresh installation applies sane defaults to make Nagios work out-of-the-box. It even enables localhost monitoring.
+Unfortunately, I want to check this host like any other server in the infrastructure. The first thing I do is to disable
+the following include in */etc/nagios4/nagios.cfg* file:
+
+```
+#cfg_file=/etc/nagios4/objects/localhost.cfg
+```
+
+I don't want to be spammed by my monitoring system. Servers may be slower and take time to respond. The Wi-Fi connection
+of the monitoring system may hang for a while... until someone reboots the host physically. During this extended period
+of time (multiple hours), my family and I may sleep. I don't want to wake up with hundreds of notifications saying "Hey,
+the monitoring system is DOWN!". One or two notifications is enough.
+
+The following new templates can be defined in */etc/nagios4/conf.d/templates.cfg*:
+
+```
+define host {
+    name                    home-host
+    use                     generic-host
+    check_command           check-host-alive
+    contact_groups          admins
+    notification_options    d,u,r
+    check_interval          5
+    retry_interval          5               ; retry every 5 minutes
+    max_check_attempts      12              ; alert at 1 hour (12x5 minutes)
+    notification_interval   720             ; resend notifications every 12 hours
+    register                0               ; template
+}
+
+define service {
+    name                    home-service
+    use                     generic-service
+    check_interval          5
+    retry_interval          5               ; retry every 5 minutes
+    max_check_attempts      12              ; alert at 1 hour (12x5 minutes)
+    notification_interval   720             ; 12 hours
+    register                0               ; template
+}
+```
+
+There are multiple components to define:
+- **hosts** (*/etc/nagios4/conf.d/hosts.cfg*): every single host
+- **hostgroups** (*/etc/nagios4/conf.d/hostgroups.cfg*): groups of hosts
+- **services** (*/etc/nagios4/conf.d/services.cfg*): services that will be attached to hostgroups
+
+For example, I need to know ZFS usage of all storage servers:
+- **hosts**: *storage1*, *storage2*, *storage3* with their IP addresses
+- **hostgroups**: *storage-servers* that will regroup *storage1*, *storage2* and *storage3*
+- **services**: *zfs_capacity* that will be attached to *storage-servers*
+
+Host definition:
+
+```
+define host {
+    use         home-host
+    host_name   storage1
+    alias       storage1
+    address     XX.XX.XX.XX
+}
+```
+
+Hostgroup definition:
+
+```
+define hostgroup {
+    hostgroup_name  storage-servers
+    alias           Storage servers
+    members         storage1,storage2,storage3
+}
+```
+
+Service definition:
+
+```
+define service {
+    use                 home-service
+    hostgroup_name      storage-servers 
+    service_description zfs_capacity
+    check_command       check_nrpe!check_zfs_capacity
+}
+```
+
+On all storage servers, we also need to define a NRPE command:
+
+```
+command[check_zfs_capacity]=/usr/local/bin/sudo /usr/local/sbin/sanoid --monitor-capacity
+```
+
+ZFS usage is now monitored!
+
+I have repeated this process for all services I wanted to check to end up with:
+
+[![Monitoring services](/monitoring-services.png)](/monitoring-services.png)
+
+A single host can be in multiple hostgroups. For my tests, I always added features to *storage1*. I created a hostgroup
+for each new capability and added only *storage1* to it. That means *storage1* had the same services as *storage2* and
+*storage3*, and the new tested ones.
+
+# Notifications
+
+At work, we use [Opsgenie](https://www.atlassian.com/software/opsgenie) to define on call schedules within a team. Of
+course, I don't want to receive a push notification on my phone for my home servers. This is why I choose to be notified
+by e-mail. In the past, I hosted some e-mail boxes at home but I didn't want to deal with spam and SPF records to prove
+to the world that my service is legit.  I have a couple of [domain names](https://www.gandi.net/en/domain) with
+(limited) e-mail services included. For the monitoring purpose, this is more than enough to do the job.
+
+On Nagios, you can set the e-mail address in the contacts configuration file
+*/etc/nagios4/objects/contacts.cfg*.
+
+I followed this [great tutorial](https://www.linode.com/docs/email/postfix/postfix-smtp-debian7/) to configure
+[postfix](http://www.postfix.org/) to send e-mails using the SMTP server of the provider. Secure and no more spam. I
+have configured this new e-mail box on my phone so I can be alerted asynchronously and smoothly when something wrong
+happens.
+
+[^1]: Nagios Remote Plugin Executor
--- a/content/posts/state-of-internet-bandwidth-in-Belgium.md
+++ b/content/posts/state-of-internet-bandwidth-in-Belgium.md
@ -0,0 +1,47 @@
+---
+title: "State of Internet bandwidth in Belgium"
+date: 2020-07-31T18:00:00+02:00
+---
+
+I was born and raised in a little city next to Paris in **France**. In early 2000s, the unlimited "high-speed" Internet
+access revolutionized communications. No need to monopolize the phone line with a 56Kbps modem anymore. Since then, the
+bandwidth has always increased. We have seen the ADSL, ADSL2 and fiber technologies. We had something called "Triple
+play" offers where unlimited phone calls, TV and Internet were packed together. There were three major companies on the
+market: [France Telecom/Orange](https://www.orange.fr), [Bouygues](https://www.bouyguestelecom.fr/) and
+[Neuf/Cegetel/SFR](https://www.sfr.fr/) (depending on the year). Then [Free](https://www.free.fr) jumped into that
+alliance and broke prices with revolutionary offers. From this time, all French ISP have "low prices" – between 30 and
+50€/month – for "high-speed" – hundreds of Mbps for both down and up – thanks to the fiber deployment.
+
+Then I moved to **Belgium** for personal reasons. My parents-in-law have chosen
+[Belgacom/Proximus](https://www.proximus.be/en/personal/?) and they were happy with it so I followed their choice. This
+ISP has deployed the VDSL technology which can be "fast". My first apartment was very close to the DSLAM[^1] so my
+bandwidth was good enough, 50Mbps/15Mbps. The price was sensitively higher for Internet and TV only, 50€/month. If we
+wanted to have a phone line, we would have added 20€ to the monthly bill and pay each phone call! You can get unlimited
+phone calls for [1.19€/month](https://www.ovhtelecom.fr/telephonie/voip/decouverte.xml) only using VoIP which is the
+same technology our ISP use. There is a limit to the monthly Internet volume we can consume. It was something like
+600GB/month when I subscribed, to 3TB now.
+
+When I moved to my current house, I knew the bandwidth will drop. Proximus had failed to organize my move on time. You
+can do it yourself on the website but if you go to a shop to reschedule the appointment, they can't do anything because
+it has been scheduled online. I canceled the first rendez-vous online and they created a new one with an additional
+two-week delay, one month after the move. I subscribed to [Voo](https://www.voo.be/en), the *fastest Internet of
+Belgium* like they say in their [commercials](https://www.youtube.com/watch?v=LKv6LtaXIf4). Same price, better speed,
+120Mbps/10Mbps... for a week. Then I had three months of packet loss, 20% on average. It was unusable. The following two
+months were stable with a bandwidth drop, 70Mbps/10Mbps.  Then packet loss again, 80% on average this time! Horrible. I
+re-subscribed to Proximus again, with 20Mbps/6Mbps bandwidth, but it is stable since the change. All of that for
+60€/month.
+
+I called Proximus to be notified when the fiber will come to my street to finally catch up with our neighbors' speeds,
+kind of. They have no plan to install it. No date. Nothing. In the meantime, my father and my grand-parents have the
+**gigabit** fiber installed at home for a lower price than mine. And even if Proximus deploy it, [upload bandwidth is
+limited to 100Mbps](https://www.proximus.be/en/id_cr_fiber/personal/orphans/fiber-to-your-home.html) where it can be
+[200Mbps](https://www.sfr.fr/offre-internet/fibre-optique) or even [600](https://www.free.fr/freebox/freebox-delta)
+[Mbps](https://boutique.orange.fr/internet/offres-fibre/livebox-up) in France. As of today, the maximum bandwidth I
+could get at home is the 400Mbps/20Mbps promised by Voo, with the stability we know.
+
+Belgian ISP, Proximus and Voo, when will you stop to steal from our pockets and start to generalize very high-speed
+Internet bandwidth to the small country of ours? We are in 2020s, not in 2000s.
+
+[^1]: [Digital subscriber line access
+multiplexer](https://en.wikipedia.org/wiki/Digital_subscriber_line_access_multiplexer), the closer you are, the faster
+your bandwidth is.
--- a/content/posts/storage-servers-at-home.md
+++ b/content/posts/storage-servers-at-home.md
@ -0,0 +1,24 @@
+---
+title: "Storage servers at home"
+date: 2020-07-17T19:00:00+02:00
+---
+
+I was born in the 90s. I grew up with computers. Other generations call us "digital natives". I am lucky and proud to
+work with computers every day with a database specialization. People tend to generate lots of data. It might be
+administrative papers (bills, contracts, paychecks), sentimental photo albums or whatever the data is as long as it is
+**their** data. At work, we pay attention to back up every data though it was the most important thing in the world. At
+home, it should be the same but, in fact, nobody really cares about it unless the data is definitely gone.
+
+My family members used to buy a single USB hard drive and regularly copy their data to it and think it is safe. This
+highly depends on the frequency of the backup. Actually, they didn't copy very often. If the drive fails, they call me
+to the rescue but I'm not a magician.
+
+Another solution involves sending their data to "the cloud" because they have seen on TV this will solve all of their
+problems. Cloud providers can, intentionally or unintentionally, leak their data. If we materialize data, I'm not sure
+my family wants to send their physical storage cupboard to the United States for the sake of data safety. We live in
+Belgium and France. There is no point of sending our data to the other side of the planet in someone else's hands.
+
+So I decided to **self-host a set of storage servers at home** and offer this service to my own family. It has to be
+simple as my parents will be the main users. I am a full-time employee and a proud dad. I have a little bit of time to
+do the service maintenance. It is an opportunity for me to learn and share it to the world. Welcome to my self-hosting
+project. I hope you will learn something too.