Initial commit
Signed-off-by: Julien Riou <julien@riou.xyz>
This commit is contained in:
commit
8e018ba84d
43 changed files with 14239 additions and 0 deletions
136
content/posts/do-your-sensors-yourself.md
Normal file
136
content/posts/do-your-sensors-yourself.md
Normal file
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
title: "Do your sensors yourself"
|
||||
date: 2020-08-17T18:00:00+02:00
|
||||
---
|
||||
|
||||
A big question I've asked myself during this project is what is the best place to put my storage servers? There are
|
||||
multiple environmental variables to watch out: **temperature**, **humidity** and **noise**. If components are too hot,
|
||||
they could be damaged in the long run. Of course, water and electricity are not friends. You can add a fan to move air
|
||||
out of the case and reduce both temperature and humidity but the computer will become noisy. We need to measure those
|
||||
variables. Unfortunately, all systems have different set of built-in sensors but not all of them are exposed to the
|
||||
operating system. So I decided to build my own sensors.
|
||||
|
||||
# Sensors hardware
|
||||
|
||||
I'm a newbie in electronics. I never weld anything. In the DIY[^1] world, there is a open-source micro-controller, the
|
||||
[Arduino Uno](https://store.arduino.cc/arduino-uno-rev3), that costs only a few bucks (20€). There are cheaper
|
||||
alternatives available like the Elegoo Uno (11€). To build sensors, you'll need good sensors like the
|
||||
[DHT22](https://www.waveshare.com/wiki/DHT22_Temperature-Humidity_Sensor) for temperature and humidity and
|
||||
[KY-037](https://electropeak.com/learn/how-to-use-ky-037-sound-detection-sensor-with-arduino/) for capturing sound. To
|
||||
connect everything together, you'll need a [breadboard](https://en.wikipedia.org/wiki/Breadboard),
|
||||
[resistors](https://en.wikipedia.org/wiki/Resistor) and cables.
|
||||
|
||||
Components:
|
||||
- [Elegoo Uno R3](https://www.amazon.fr/dp/B01N91PVIS/ref=cm_sw_r_tw_dp_x_8NtkFbHZ6X6K9)
|
||||
- [DHT22 sensor](https://www.amazon.fr/dp/B07TTJNY1C/ref=cm_sw_r_tw_dp_x_QOtkFbBM2ZAAD)
|
||||
- [KY-037 sensor](https://www.amazon.fr/dp/B07ZHGX5T6/ref=cm_sw_r_tw_dp_x_kPtkFbXRRK7ZP)
|
||||
- [10k Ω resistor](https://www.amazon.fr/dp/B06XKQLPFV/ref=cm_sw_r_tw_dp_x_EPtkFbB24855X)
|
||||
- [breadboard](https://www.amazon.fr/dp/B06XKZWCJB/ref=cm_sw_r_tw_dp_x_.PtkFb01X4WNW)
|
||||
- [cables](https://www.amazon.fr/dp/B01JD5WCG2/ref=cm_sw_r_tw_dp_x_QQtkFbRA6PSG0)
|
||||
|
||||
In electronics, you need to build closed circuits going from the power supply ("+") to the ground ("-"). The Arduino
|
||||
card can be plugged on an USB port which provides power to the card, on the "5V" pin. The end of the circuit should
|
||||
return to the "GND" pin, which means "ground". The breadboard can help you extending the circuit and plug more than one
|
||||
element (resistors and sensors at the same time). The top and bottom parts are connected horizontally. The central part
|
||||
connects elements vertically. Horizontal and vertical parts are isolated from each other. Resistors role is to regulate
|
||||
electrical intensity. They act like a tap for distributing water. If there is too much water at a time, the glass can be
|
||||
full too quickly and water can spit everywhere. We'll put a resistor in front of the DHT22 to have valid values and to
|
||||
prevent damages.
|
||||
|
||||
The circuit looks like this:
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/sensors.svg" alt="Sensors circuit" style="width: 65%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
The DHT22 sensor has three pins: **power**, **digital** and **ground** (and not four like in the schema). The KY-037
|
||||
sensor has four pins: **analog**, **ground**, **power** and **digital** (and not three like in the schema). We'll use
|
||||
the analog pin to gather data from the sound sensor.
|
||||
|
||||
# Sensors software
|
||||
|
||||
The circuit is plugged to a computer via USB and it's ready to be used. To be able to read values, we need to compile
|
||||
low-level code and execute it on the board. For this purpose, you can install the [Arduino
|
||||
IDE](https://www.arduino.cc/en/Main/Software) which is available on multiple platforms. My personal computer runs on
|
||||
Ubuntu (no joke please) and I tried to use the packages from the repositories. However, they are too old to work. You
|
||||
should [install the IDE yourself](https://www.arduino.cc/en/Guide/Linux). I've added my own user to the "dialout" group
|
||||
to be able to use the serial interface to send compiled code to the board. The code itself is called a "sketch". You can
|
||||
find mine [here](https://github.com/jouir/arduino-sensors-toolkit/blob/master/sensors2serial.ino). Click on "Upload",
|
||||
job done.
|
||||
|
||||
# Multiplexing
|
||||
|
||||
Values are sent to the serial port but only one program can read this interface at a time. No luck, we would like to
|
||||
send those metrics to the alerting and trending systems. Both have their own schedules. They will try to access this
|
||||
interface at the same time. Moreover, programs that would like to read the serial port will have to wait for, at least,
|
||||
four seconds. In the IoT[^2] world, we often see the usage of [MQTT](https://en.wikipedia.org/wiki/MQTT), a queuing
|
||||
protocol. To solve the performance issue, I've developed a simple daemon that reads values from the serial interface and
|
||||
publishes them to an MQTT broker called [serial2mqtt](https://github.com/jouir/arduino-sensors-toolkit/#serial2mqtt).
|
||||
I've installed [Mosquitto](https://mosquitto.org/) on storage servers so the multiplexing happens locally.
|
||||
|
||||
# Thresholds
|
||||
|
||||
What is the **critical temperature**? I [found](https://www.apc.com/us/en/faqs/FA157464/) that UPS batteries should not
|
||||
run in an environment where temperatures exceed 25°C (warning) and must not go over 40°C (critical). This summer, I had
|
||||
multiple buzzer alerts on storage3 and the temperature was over 29°C every time.
|
||||
|
||||
What is the **critical humidity**? Humidity is the concentration of water in a volume of air. In tropical regions of the
|
||||
world, we often see a 100% humidity level, with working computers. Humidity is proportional to the temperature. The
|
||||
hotter it is, the more water can be contained in the air. Generally, temperature in a computer case is warmer than the
|
||||
ambient temperature. What is dangerous is not the quantity of water in the air, it's when water condense. A good rule of
|
||||
thumb is to avoid going over 80%. But 100% should not be a problem.
|
||||
|
||||
# Alerting
|
||||
|
||||
On Nagios, I use the [check-mqtt](https://github.com/jpmens/check-mqtt) script on the monitored storage host under an
|
||||
NRPE command:
|
||||
|
||||
```
|
||||
# Sensors
|
||||
command[check_ambient_temperature]=/usr/local/bin/python3.7 /usr/local/libexec/nagios/check-mqtt.py -m 10 --readonly -t sensors/temperature -H localhost -P 1883 -u nagios -p ***** -w "float(payload) > 25.0" -c "float(payload) > 40.0"
|
||||
command[check_ambient_humidity]=/usr/local/bin/python3.7 /usr/local/libexec/nagios/check-mqtt.py -m 10 --readonly -t sensors/humidity -H localhost -P 1883 -u nagios -p ***** -w "float(payload) > 80.0" -c "float(payload) > 95.0"
|
||||
```
|
||||
|
||||
[](/sensors-storage2-alert.png)
|
||||
|
||||
# Observability
|
||||
|
||||
Telegraf has a [mqtt_consumer](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mqtt_consumer) input
|
||||
plugin:
|
||||
|
||||
```
|
||||
[[inputs.mqtt_consumer]]
|
||||
servers = ["tcp://localhost:1883"]
|
||||
topics = [
|
||||
"sensors/humidity",
|
||||
"sensors/temperature",
|
||||
"sensors/sound"
|
||||
]
|
||||
persistent_session = true
|
||||
client_id = "telegraf"
|
||||
data_format = "value"
|
||||
data_type = "float"
|
||||
username = "telegraf"
|
||||
password = "*****"
|
||||
```
|
||||
|
||||
Grafana is able to display environmental variables now:
|
||||
|
||||
[](/sensors-storage1.png)
|
||||
[](/sensors-storage2.png)
|
||||
[](/sensors-storage3.png)
|
||||
|
||||
# In the end
|
||||
|
||||
I tried to measure noise but I failed. The KY-037 sensor is designed to detect sound variations like a big noise for a
|
||||
short period of time. When we try to measure the ambient noise level, it requires a lot of conversions to get values in
|
||||
[decibel](https://en.wikipedia.org/wiki/Decibel). So I decided to ignore values coming from the sensor and to hear it
|
||||
myself.
|
||||
|
||||
I can put my storage servers in the attic, in a room or in the cellar. The attic is right under the roof which is too
|
||||
hot in the summer (over 40°C). Rooms are occupied during the night and noise is a problem. I am lucky to have a free
|
||||
room right now but it's too hot during the summer (over 25°C). There is the cellar left, where all the conditions are
|
||||
optimal, even humidity. Remote locations all have a cellar which is perfect!
|
||||
|
||||
[^1]: Do It Yourself
|
||||
[^2]: Internet of Things
|
136
content/posts/geographic-distribution-with-sanoid-and-syncoid.md
Normal file
136
content/posts/geographic-distribution-with-sanoid-and-syncoid.md
Normal file
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
title: "Geographic distribution with Sanoid and Syncoid"
|
||||
date: 2020-08-03T18:00:00+02:00
|
||||
---
|
||||
|
||||
Failures happen at multiple levels: a single disk can fail, as well as multiple disks, a single server, multiple
|
||||
servers, a geographic region, a country, the world, the universe. The probability decreases with the number of
|
||||
simultaneous events. Costs and complexity increase with the number of failure events you want to handle. It's up to you
|
||||
to find the right balance between all those variables.
|
||||
|
||||
For my own infrastructure at home, I was able to put storage servers into three different locations. Two in Belgium
|
||||
(with 10Km distance from one another), one in France. They all share the same data. Up to two storage servers can burn
|
||||
or be flooded entirely without data loss. There are different redundant solutions at the host level but I will not cover
|
||||
them in this article.
|
||||
|
||||
{{< rawhtml >}}
|
||||
<script src="https://unpkg.com/leaflet@latest/dist/leaflet.js"></script>
|
||||
<link href="https://unpkg.com/leaflet@latest/dist/leaflet.css" rel="stylesheet"/>
|
||||
<div id="osm-map"></div>
|
||||
<script type="text/javascript">
|
||||
var element = document.getElementById('osm-map');
|
||||
element.style = 'height:500px;';
|
||||
var map = L.map(element);
|
||||
L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
|
||||
attribution: '© <a href="http://osm.org/copyright">OpenStreetMap</a> contributors'
|
||||
}).addTo(map);
|
||||
var center = L.latLng('49.708', '2.516');
|
||||
map.setView(center, 7);
|
||||
L.marker(L.latLng('48.8566969', '2.3514616')).addTo(map); // storage france
|
||||
L.marker(L.latLng('50.4549568', '3.9519580')).addTo(map); // storage belgium (x2)
|
||||
</script>
|
||||
<p><!-- space --></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
# Backup management
|
||||
|
||||
Storage layer relies on ZFS pools. There is a wonderful free software called
|
||||
[Sanoid](https://github.com/jimsalterjrs/sanoid) to take snapshots of your datasets and manage their retention. Here is
|
||||
an example of configuration on a storage host:
|
||||
|
||||
```
|
||||
[zroot]
|
||||
hourly = 0
|
||||
daily = 0
|
||||
monthly = 0
|
||||
yearly = 0
|
||||
autosnap = no
|
||||
autoprune = no
|
||||
|
||||
[storage/xxx]
|
||||
use_template = storage
|
||||
|
||||
[storage/yyy]
|
||||
use_template = storage
|
||||
|
||||
[storage/zzz]
|
||||
use_template = storage
|
||||
|
||||
[template_storage]
|
||||
hourly = 0
|
||||
daily = 31
|
||||
monthly = 12
|
||||
yearly = 10
|
||||
autosnap = yes
|
||||
autoprune = yes
|
||||
```
|
||||
|
||||
Where *storage/xxx*, *storage/yyy*, and *storage/zzz* are datasets exposed to my family computers. With this
|
||||
configuration, I am able to keep 10 years of snapshots. This may change over time depending on disk space, performance
|
||||
or retention requirements. The *zroot* dataset has no snapshot nor prune policy but is declared in the configuration for
|
||||
monitoring purpose.
|
||||
|
||||
Sanoid is compatible with FreeBSD but it requires [system
|
||||
changes](https://github.com/jimsalterjrs/sanoid/blob/master/FREEBSD.readme). You'll need an "sh" compatible shell to be
|
||||
compatible with mbuffer. I've chosen to install and use "bash" because I'm familiar with it on GNU/Linux servers.
|
||||
|
||||
To automatically create and prune snapshots, I've created a cron job that runs every minute:
|
||||
|
||||
```
|
||||
* * * * * /usr/local/sbin/sanoid --cron --verbose >> /var/log/sanoid.log
|
||||
```
|
||||
|
||||
# Remote sync
|
||||
|
||||
Sanoid comes with a tool to sync local snapshots with a remote host called
|
||||
[Syncoid](https://github.com/jimsalterjrs/sanoid#syncoid). It is similar to "rsync" but for ZFS snapshots. If the
|
||||
synchronization fails in the middle, Syncoid can **resume** the replication where it was left, without restarting from
|
||||
zero. It also supports **compression** on the wire. This is handy for low bandwidth networks like the one I have. To be
|
||||
able to send dataset to remote destination, I've set up direct SSH communication (via the VPN) with ed25519 keys.
|
||||
|
||||
Then cron jobs for automation:
|
||||
```
|
||||
0 2,6 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage2:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1
|
||||
0 3,7 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage3:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1
|
||||
```
|
||||
|
||||
Beware, I use the "root" user for this connection. This can be a **security flow**. You should create a user with low
|
||||
privileges and possibly use "sudo" with a restriction to the command. You should disable root login over SSH. The
|
||||
countermeasure I've implemented is to disable password authentication on the root user ("*PermitRootLogin
|
||||
without-password*" in sshd_config file from OpenSSH server). I've also restricted SSH connections to the VPN and local
|
||||
networks only. No public network allowed.
|
||||
|
||||
# Local usage
|
||||
|
||||
Now, ZFS snapshots are automatically created and replicated. How can we start using the service? *I want to send my
|
||||
data!* Every location has its own storage server. The idea is to use the local network and send data to the local server
|
||||
and let the Sanoid/Syncoid couple handle the rest over the VPN for data safety.
|
||||
|
||||
At the beginning, all my family members were using [Microsoft Windows](https://en.wikipedia.org/wiki/Microsoft_Windows)
|
||||
(10). To provide the most user friendly experience, I thought it was a good idea to create a
|
||||
[CIFS](https://en.wikipedia.org/wiki/Server_Message_Block) share with
|
||||
[Samba](https://en.wikipedia.org/wiki/Samba_(software)). The authentication system was a pain to configure but the
|
||||
network drive was recognized and it worked... for a while. Every single Samba update on the storage server broke the
|
||||
share. I've lost countless hours debugging this s\*\*t.
|
||||
|
||||
I started to show them alternatives to Windows. One day, my wife accepted to change. She opted for
|
||||
[Kubuntu](https://kubuntu.org/). Then my parents-in-law changed too. I was able to remove the Samba share and use
|
||||
[NFS](https://en.wikipedia.org/wiki/Network_File_System) instead. This changed my life. The network folder has never
|
||||
stopped working since the switch. For my personal use, I use [rsync](https://en.wikipedia.org/wiki/Rsync) and cron to
|
||||
**automatically** send my local folders.
|
||||
|
||||
The storage infrastructure looks like this (storage1 example):
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/geographic-distribution-diagram.svg" alt="Geographic distribution diagram" style="width: 50%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
Syncoid is configured to replicate to other nodes:
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/geographic-distribution-diagram-2.svg" alt="Geographic distribution part 2" style="width: 50%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
The most important rule is to **strictly forbid writes** on the **same dataset** on two **different locations** at the
|
||||
**same time**. This
|
||||
setup is not "[multi-master](https://en.wikipedia.org/wiki/Multi-master_replication)" compliant at all.
|
||||
|
||||
In the end, the data management is fully automated. Data losses belong to the past.
|
|
@ -0,0 +1,76 @@
|
|||
---
|
||||
title: "Hardware adventures and operating systems installation"
|
||||
date: 2020-07-24T18:00:00+02:00
|
||||
---
|
||||
|
||||
At the beginning of the project, the goal was to create a single storage server at my apartment. So I bought a [fancy
|
||||
case](https://www.ldlc.com/fr-be/fiche/PB00181814.html) with racks in front to hot replace disks and I retrieved an
|
||||
[Intel NUC motherboard](https://www.intel.com/content/www/us/en/products/boards-kits/nuc/boards.html) from work. It had
|
||||
only two SATA ports available to connect disks which is not enough to plug at least four disks: one for the system and
|
||||
three for the storage. I bought a [PCI RAID card](https://www.amazon.fr/gp/product/B0001Y7PU8) to add four slots. I
|
||||
connected two small SSD for the system and four data disks, then installed FreeBSD without any issue. I started to copy
|
||||
data to the storage space when a noisy alarm[^1] began to wake everybody up in the building. This was unbearable. I
|
||||
decided to buy a *micro ATX* motherboard with processor and memory to replace the Intel NUC board. Wrong. I confused
|
||||
[micro ATX](https://en.wikipedia.org/wiki/MicroATX) with [mini ITX](https://en.wikipedia.org/wiki/Mini-ITX) formats. The
|
||||
first one was too big to fit in the box. So I bought a classic ATX case with a cheap power supply and 3x2TB disks from
|
||||
work. **Storage1** was born.
|
||||
|
||||
At that point, I had a working storage server and some pieces to build a second one. At the same time, my wife and I had
|
||||
a baby. My office at home became the newborn bedroom. I paused this project for a year to focus on my family. Then, we
|
||||
bought a house with plenty of space to handle life serenely.
|
||||
|
||||
During the move, I unpacked my very first computer that I had assembled in 2008. The only missing thing was a physical
|
||||
slot to rack the fourth disk. I bought a [low cost ATX case](https://www.amazon.fr/gp/product/B00LA7PC6Y/) and moved
|
||||
every piece into. I started before work on a Friday but didn't finish on time. My home office was covered with computers
|
||||
pieces all day long. When I finished work, I went back to the project when a friendly neighbor called on me for help
|
||||
because his computer crashed. Right before going to bed, I tried to connect the power button to the motherboard without
|
||||
instructions, and it didn't work. I finally found it on the web and made it work, at midnight. **Storage 2** was born.
|
||||
|
||||
It runs on a quite old hardware (10+ years). I thought it would be easy to install FreeBSD because it was created in the
|
||||
90s[^2]. I tried to boot from USB but the stick was not recognized. I burnt a CD-ROM with version 12, the latest release
|
||||
at that time. The installer was not able to load because of a [LUA
|
||||
error](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234031) in the bootloader. In the comments and on forums, some
|
||||
people managed to make version 11 work. I burnt a CD-ROM with version 11, same result. After having lost an afternoon of
|
||||
my time and two CD-ROMs, I went back into my comfort zone and installed a Debian 10 with success.
|
||||
|
||||
Recently, my family offered me the missing hardware pieces to finalize the third storage host. The big one with 4TB
|
||||
disks in the mini case. The one I had bought at the beginning of the project. In the end, it is not so practical. Disks
|
||||
are not fixed to the rack. They can move back and forth a few centimeters. Some disks were not recognized by the system
|
||||
because they were not connected. I pushed all of them with a screwdriver to ensure they were plugged into the SATA
|
||||
connector. For the price, I expected it to work out-of-the-box. I was surprised to find four SATA ports on the
|
||||
motherboard where I expected five or six. I removed one system disk. Goodbye dirty hack with adhesive tape to stick the
|
||||
second SSD! Let's join your friends in the stock. **Storage 3** was born.
|
||||
|
||||
Here is the detailed list of components:
|
||||
|
||||
| Host | Component | Reference |
|
||||
| -------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| storage1 | Case | [Antec One](https://media.ldlc.com/ld/products/00/01/00/62/LD0001006251_2.jpg) |
|
||||
| | Power supply | [Antec Basiq Series VP350P](https://media.ldlc.com/ld/products/00/00/89/95/LD0000899597_2.jpg) |
|
||||
| | Motherboard | [Gigabyte GA-B150M-DS3H](https://media.ldlc.com/ld/products/00/03/45/35/LD0003453579_2.jpg) |
|
||||
| | CPU | [Intel Celeron G3900 (2.8 GHz)](https://media.ldlc.com/ld/products/00/01/47/39/LD0001473956_2_0001473966_0001571304_0001571323_0003614881.jpg) |
|
||||
| | RAM | [G.Skill Aegis 4 Go (1 x 4 Go) DDR4 2133 MHz CL15](https://www.ldlc.com/fr-be/fiche/PB00202287.html) |
|
||||
| | System disks | [LDLC SSD F2 32 GB](https://media.ldlc.com/ld/products/00/03/42/11/LD0003421194_2_0003421246.jpg) (x2) |
|
||||
| | Data disks | 2TB HDD 3.5" (x3) |
|
||||
| storage2 | Case | [Advance Grafit](https://www.amazon.fr/gp/product/B00LA7PC6Y/) |
|
||||
| | Power supply | No reference found |
|
||||
| | Motherboard | Asus M2A-VM HDMI |
|
||||
| | CPU | AMD Athlon 64 X2 5000+ Socket AM2 |
|
||||
| | RAM | G.Skill Kit Extreme2 2 x 1 Go PC6400 PK (x2) |
|
||||
| | System disk | Recycled 160GB HDD 3.5" |
|
||||
| | Data disks | 1TB HDD 3.5" (x3) |
|
||||
| storage3 | Case | [In Win IW-MS04](https://www.ldlc.com/fr-be/fiche/PB00181814.html) |
|
||||
| | Motherboard | [ASRock H310CM-ITX/AC](https://www.ldlc.com/fr-be/fiche/PB00275155.html) |
|
||||
| | CPU | [Intel Celeron G4920 (3.2 GHz)](https://www.ldlc.com/fr-be/fiche/PB00247186.html) |
|
||||
| | RAM | [G.Skill Aegis 4 Go (1 x 4 Go) DDR4 2133 MHz CL15](https://www.ldlc.com/fr-be/fiche/PB00202287.html) |
|
||||
| | System disk | [LDLC SSD F2 32 GB](https://media.ldlc.com/ld/products/00/03/42/11/LD0003421194_2_0003421246.jpg) |
|
||||
| | Data disks | 4TB HDD 3.5" (x3) |
|
||||
|
||||
Despite heterogeneous components, storage servers have been successfully running for a while now.
|
||||
|
||||
[^1]: Later, I found out that the noise was coming from the disk backplane and not the motherboard. There is a buzzer
|
||||
that emits a sound sequence depending on the detected anomaly. At the apartment and at my current house in the summer,
|
||||
the temperature in the room was too high (more than 29°C). I moved the host in a cold place. Problem solved.
|
||||
|
||||
[^2]: FreeBSD [initial release](https://en.wikipedia.org/wiki/FreeBSD) was on November 1, 1993.
|
||||
|
141
content/posts/increased-observability-with-the-TIG-stack.md
Normal file
141
content/posts/increased-observability-with-the-TIG-stack.md
Normal file
|
@ -0,0 +1,141 @@
|
|||
---
|
||||
title: "Increased observability with the TIG stack"
|
||||
date: 2020-08-10T18:00:00+02:00
|
||||
---
|
||||
|
||||
[Observability](https://en.wikipedia.org/wiki/Observability) has become a buzzword lately. I must admit, this is one of
|
||||
the many reasons why I use it in the title. In reality, this article will talk about fetching measurements and creating
|
||||
beautiful graphs to feel like [detective Derrick](https://en.wikipedia.org/wiki/Derrick_(TV_series)), an *old* and wise
|
||||
detective solving cases by encouraging criminals to confess by themselves.
|
||||
|
||||
With the recent [Go](https://golang.org/) programming language [gain of
|
||||
popularity](https://opensource.com/article/17/11/why-go-grows), we have seen a lot of new software coming into the
|
||||
database world: [CockroachDB](https://www.cockroachlabs.com/), [TiDB](https://pingcap.com/products/tidb),
|
||||
[Vitess](https://vitess.io/), etc. Among them, the **TIG stack**
|
||||
([**T**elegraf](https://github.com/influxdata/telegraf), [**I**nfluxDB](https://github.com/influxdata/influxdb) and
|
||||
[**G**rafana](https://github.com/grafana/grafana)) has become a reference to gather and display metrics.
|
||||
|
||||
The goal is to see the evolution of different resources usage (memory, processor, storage space), power consumption,
|
||||
environment variables (temperature, humidity), on every single host of the infrastructure.
|
||||
|
||||
# Telegraf
|
||||
|
||||
The first component of the stack is Telegraf, an agent that can fetch metrics from multiple sources
|
||||
([input](https://github.com/influxdata/telegraf/tree/master/plugins/inputs)) and write them to multiple destinations
|
||||
([output](https://github.com/influxdata/telegraf/tree/master/plugins/outputs)). There are tens of built-in plugins
|
||||
available! You can even gather a custom source of data with
|
||||
[exec](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) with an expected
|
||||
[format](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md).
|
||||
|
||||
I configured Telegraf to fetch and send metrics every minute (*interval* and *flush_interval* in the *agent* section is
|
||||
*"60s"*) which is enough for my personal usage. Most of the plugins I use are built-in: cpu, disk, diskio, kernel, mem,
|
||||
processes, system, zfs, net, smart, ping, etc.
|
||||
|
||||
The [zfs](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zfs) plugin fetches ZFS pool statistics like
|
||||
size, allocation, free space, etc, on FreeBSD but not [on Linux](https://github.com/influxdata/telegraf/issues/2616).
|
||||
The issue is known but has not been merged upstream yet. So I have developed a simple Python snippet to fill the gap on
|
||||
my only storage server running on Linux:
|
||||
|
||||
```
|
||||
#!/usr/bin/python
|
||||
|
||||
import subprocess
|
||||
|
||||
def parse_int(s):
|
||||
return str(int(s)) + 'i'
|
||||
|
||||
def parse_float_with_x(s):
|
||||
return float(s.replace('x', ''))
|
||||
|
||||
def parse_pct_int(s):
|
||||
return parse_int(s.replace('%', ''))
|
||||
|
||||
if __name__ == '__main__':
|
||||
measurement = 'zfs_pool'
|
||||
|
||||
pools = subprocess.check_output(['/usr/sbin/zpool', 'list', '-Hp']).splitlines()
|
||||
output = []
|
||||
for pool in pools:
|
||||
col = pool.split("\t")
|
||||
tags = {'pool': col[0], 'health': col[9]}
|
||||
fields = {}
|
||||
|
||||
if tags['health'] == 'UNAVAIL':
|
||||
fields['size'] = 0
|
||||
else:
|
||||
fields['size'] = parse_int(col[1])
|
||||
fields['allocated'] = parse_int(col[2])
|
||||
fields['free'] = parse_int(col[3])
|
||||
fields['fragmentation'] = '0i' if col[6] == '-' else parse_pct_int(col[6])
|
||||
fields['capacity'] = parse_int(col[7])
|
||||
fields['dedupratio'] = parse_float_with_x(col[8])
|
||||
|
||||
tags = ','.join(['{}={}'.format(k, v) for k, v in tags.items()])
|
||||
fields = ','.join(['{}={}'.format(k, v) for k, v in fields.items()])
|
||||
print('{},{} {}'.format(measurement, tags, fields))
|
||||
```
|
||||
|
||||
Called by the following input:
|
||||
|
||||
```
|
||||
[[inputs.exec]]
|
||||
commands = ['/opt/telegraf-plugins/zfs.py']
|
||||
data_format = "influx"
|
||||
```
|
||||
|
||||
This exec plugin does exactly the same job as the zfs input running on FreeBSD.
|
||||
|
||||
All those metrics are sent to a single output, InfluxDB, hosted on the monitoring server.
|
||||
|
||||
# InfluxDB
|
||||
|
||||
Measurements can be stored in a time series database which is designed to organize data around time. InfluxDB is a
|
||||
perfect use case for what we need. Of course, there are other time series databases. I've chosen this one because it is
|
||||
well documented, it fits my needs and I wanted to learn new things.
|
||||
[Installation](https://docs.influxdata.com/influxdb/v1.8/introduction/install/) is straightforward. I've enabled
|
||||
[HTTPS](https://docs.influxdata.com/influxdb/v1.8/administration/https_setup/) and
|
||||
[authentication](https://docs.influxdata.com/influxdb/v1.8/administration/authentication_and_authorization/#set-up-authentication).
|
||||
I use a simple setup with only one node in the *cluster*. No sharding. Only one database. Even if there is not so many
|
||||
metrics sent by Telegraf, I've created a default [retention
|
||||
policy](https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#retention-policy-management) to store
|
||||
two years of data which is more than enough. A new default retention policy will become the default route to store all
|
||||
your new points. Don't be afraid to see all the existing measurements vanished. Nothing has been deleted. They just are
|
||||
under the previous policy and need to be
|
||||
[moved](https://community.influxdata.com/t/applying-retention-policies-to-existing-measurments/802). You should define a
|
||||
[backup](https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/) policy too.
|
||||
|
||||
# Grafana
|
||||
|
||||
Now that we are able to gather and store metrics, we need to visualize them. This is the role of
|
||||
[Grafana](https://grafana.com/). During my career, I played with
|
||||
[Graylog](https://docs.graylog.org/en/3.2/pages/dashboards.html), [Kibana](https://www.elastic.co/kibana) and Grafana.
|
||||
The last one is my favorite. It is generally blazing fast! Even on a Raspberry Pi. The look and feel is amazing. The
|
||||
theme is dark by default but I like the light one.
|
||||
|
||||
I have created four dashboards:
|
||||
- **system**: load, processor, memory, system disk usage, disk i/o, network quality and bandwidth
|
||||
- **storage**: ZFS pool allocation, capacity, fragmentation and uptime for each disk
|
||||
- **power consumption**: kWh used per day, week, month, year, current UPS load, price per year (more details on a next
|
||||
post)
|
||||
- **sensors**: ambient temperature, humidity and noise (more details on a next post)
|
||||
|
||||
Every single graph has a *$host* [variable](https://grafana.com/docs/grafana/latest/variables/templates-and-variables/)
|
||||
at the dashboard level to be able to filter metrics per host. On top of the screen, a dropdown menu is automatically
|
||||
created to select the host based on an InfluxDB query.
|
||||
|
||||
And because a picture is worth a thousand words, here are some screenshots of my own graphs:
|
||||
|
||||
[](/grafana-system.png)
|
||||
[](/grafana-storage.png)
|
||||
[](/grafana-power-consumption.png)
|
||||
[](/grafana-sensors.png)
|
||||
|
||||
# Infrastructure
|
||||
|
||||
To sum this up, the infrastructure looks like this:
|
||||
|
||||

|
||||
|
||||
Whenever I want, I can sit back on a comfortable sofa, open a web browser and let the infrastructure speak for itself.
|
||||
Easy, right?
|
||||
|
67
content/posts/infrastructure-overview.md
Normal file
67
content/posts/infrastructure-overview.md
Normal file
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
title: "Infrastructure overview"
|
||||
date: 2020-07-20T18:30:00+02:00
|
||||
---
|
||||
|
||||
The idea behind this infrastructure is to run on commodity servers. No need to buy big racks of expensive servers as we
|
||||
see in data centers. Simple homemade computers will do the job. At work, I have access to cheap hard drives that were
|
||||
used in servers and either are out of warranty or not suitable for enterprise workload. They generally are half their
|
||||
market price. I have a mix of brand new and re-used drives to reduce the risk of having two disks failing at the same
|
||||
time in the same host.
|
||||
|
||||
There are three components in the infrastructure:
|
||||
* **storage** servers that hold the data
|
||||
* **monitoring** server that grabs metrics and sends alerts
|
||||
* **vps**[^1] server used to create a VPN[^2] and watch for monitoring server availability
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/infrastructure-overview.svg" alt="Infrastructure overview" style="width: 65%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
# Storage
|
||||
|
||||
Every storage server is designed to be hosted on a different location. Each one could be unplugged from a location then
|
||||
plugged somewhere else and work the same way as before. They require an Internet access to be able to contact the VPS to
|
||||
join the VPN.
|
||||
|
||||
The technology that holds data is **[ZFS](https://en.wikipedia.org/wiki/ZFS)**. I have the chance to use it at work for
|
||||
production workloads and it makes life way easier. I am used to manage GNU/Linux servers
|
||||
([Debian](https://www.debian.org/)) and I know that [FreeBSD](https://www.freebsd.org/) has built-in ZFS support, so I
|
||||
wanted to give it a try. I didn't choose [FreeNAS](https://www.freenas.org/) because I wanted to do everything by myself
|
||||
to learn and use only the features I needed.
|
||||
|
||||
The right balance I found to maximize available disk space while keeping data safe is to use **three disks** in a
|
||||
[RAID-Z](https://en.wikipedia.org/wiki/ZFS#RAID_(%22RaidZ%22)). Storage servers are allowed to lose one disk at a time
|
||||
without breaking the service. In the meantime, almost all the cumulative space is available to use. Datasets are
|
||||
configured to use **lz4** compression because it saves disk space without pushing too much pressure on the CPU.
|
||||
|
||||
| Host | Disk capacity |
|
||||
| -------- | ------------: |
|
||||
| storage1 | 5.44T |
|
||||
| storage2 | 2.72T |
|
||||
| storage3 | 10.9T |
|
||||
|
||||
# Monitoring
|
||||
|
||||
Like any system administrator, I want to be alerted when something goes wrong on the infrastructure. I also want to
|
||||
browse the history with graphs to see trends. There was a [Raspberry Pi](https://www.raspberrypi.org/) waiting to be
|
||||
used in a drawer. It is now connected to the Wi-Fi network somewhere in the house, perfectly hidden, to do this job in
|
||||
the background.
|
||||
|
||||
# VPS
|
||||
|
||||
I am not a network engineer. Actually, this is not my job and I don't want it to be. There are numerous experts in the
|
||||
field that do this very well and I am thankful to them. But a computer without network connectivity is not very useful.
|
||||
When self-hosting, you have to deal with your ISP modem settings. There is no standard as far as I know. Mine has no
|
||||
fixed public IPv4 address. I tried to develop scripts to automatically update a subdomain name with the current public
|
||||
IP address and try to contact it from the outside. The name worked, but the communication always failed.
|
||||
|
||||
To solve this problem, I [rent a VPS](https://www.ovhcloud.com/fr/vps/) hosted close to storage locations and I have
|
||||
configured an [OpenVPN](https://openvpn.net/) server. This is a single point of failure and a *bottleneck* because all
|
||||
the traffic goes to this server to communicate with others. In fact, Internet bandwidth at home is the real bottleneck
|
||||
so the VPS should not be a problem. It also acts as the entry point from the outside world for metrics and monitoring
|
||||
websites.
|
||||
|
||||
[^1]: [Virtual Private Server](https://en.wikipedia.org/wiki/Virtual_private_server)
|
||||
|
||||
[^2]: [Virtual Private Network](https://en.wikipedia.org/wiki/Virtual_private_network)
|
112
content/posts/network-configuration-with-openvpn.md
Normal file
112
content/posts/network-configuration-with-openvpn.md
Normal file
|
@ -0,0 +1,112 @@
|
|||
---
|
||||
title: "Network configuration with OpenVPN"
|
||||
date: 2020-07-27T18:00:00+02:00
|
||||
---
|
||||
|
||||
Networking is hard. Dealing with ISP modem settings is even harder. Mine doesn't have a static public IP address by
|
||||
default. If the modem reboots, it is likely that it will be assigned a new one. For regular people, it is not a problem
|
||||
for browsing the Internet. But for hackers like us, that means we cannot use the IP address itself to reach the private
|
||||
network from the outside world. It becomes a problem when we try to join hosts in different networks.
|
||||
|
||||
For your information, this is the price my ISP would like me to pay for this "option":
|
||||
|
||||

|
||||
|
||||
This is insane!
|
||||
|
||||
The first idea was to deploy a script on each host that discover the public IP address and register an A record on a
|
||||
given subdomain name. This job could have been run by a cron daemon. It would transform a dynamic IP address into a
|
||||
predictable name. It was like the [no-ip](https://www.noip.com/) service. It worked. I was able to know the home public
|
||||
IP address.
|
||||
|
||||
Then, I started to use [port
|
||||
mapping](https://www.proximus.be/support/en/id_sfaqr_ports_mapping/personal/support/internet/internet-at-home/advanced-settings/internet-port-mapping-on-your-modem.html#/bbox3)
|
||||
to redirect a given port on my router to a host in the private network. By default, some protocols like SSH, HTTP and
|
||||
HTTPS are [not
|
||||
open](https://www.proximus.be/support/en/id_sfaqr_ports_unblock_secu/personal/support/internet/security-and-protection/internet-ports-and-security/open-internet-ports.html),
|
||||
even if you configure port mapping correctly. You have to go on the ISP website and lower your *security level* from
|
||||
high to low. At my apartment, I successfully managed to reach some port from the outside, but never on my current house.
|
||||
The major problem of this procedure is its **complexity** and the fact it **highly depends on your ISP
|
||||
devices/settings**. I had to find a simpler solution.
|
||||
|
||||
Here comes [OpenVPN](https://openvpn.net/). It's an open-source software which creates private networks on public
|
||||
networks. It uses encryption to secure connection between each host to keep your transport safe. The initial setup is
|
||||
quite long and complex but you just have to follow this [great
|
||||
tutorial](https://www.digitalocean.com/community/tutorials/how-to-set-up-an-openvpn-server-on-debian-10) and it will
|
||||
work like a charm. The drawback is you'll need a single point to act as a server. I choose to [rent a
|
||||
VPS](https://www.ovhcloud.com/fr/vps/) for a few euros per month. It has a fixed IP address and a decent bandwidth for
|
||||
our usage. It runs on Debian but there are plenty of operating systems available.
|
||||
|
||||
The OpenVPN certificate management can be a bit disturbing at first. I use my monitoring host as CA[^1] to keep trust at
|
||||
home and every host has its own client certificate. I've set static IP addressing up to always assign the same address
|
||||
to clients. I've enabled direct communication between clients because storage servers will send snapshots to each
|
||||
others. I didn't configure clients to forward all their packets to the VPN server because the goal here is not to hide
|
||||
behind it for privacy.
|
||||
|
||||
I have changed the following settings on the VPN server:
|
||||
|
||||
```
|
||||
topology subnet ; declare a subnet like home
|
||||
server 10.xx.xx.xx 255.xx.xx.xx ; with the range you like
|
||||
client-to-client ; allow clients to talk to each other
|
||||
client-config-dir /etc/openvpn/ccd ; static IP configuration per client
|
||||
ifconfig-pool-persist /var/log/openvpn/ipp.txt ; IP lease settings
|
||||
```
|
||||
|
||||
Example of *ipp.txt* file:
|
||||
|
||||
```
|
||||
storage1,10.xx.xx.xx
|
||||
storage2,10.yy.yy.yy
|
||||
storage3,10.zz.zz.zz
|
||||
```
|
||||
|
||||
Example of */etc/openvpn/ccd/storage1.user* file:
|
||||
|
||||
```
|
||||
ifconfig-push 10.xx.xx.xx 255.xx.xx.xx
|
||||
```
|
||||
|
||||
The network configuration declared in *client-config-dir* must match the one in *ipp.txt*.
|
||||
|
||||
The configuration generated by the *make_config.sh* script (see the tutorial mentioned above) can be written to:
|
||||
* */etc/openvpn/client.conf* (Debian)
|
||||
* */usr/local/etc/openvpn/openvpn.conf* (FreeBSD)
|
||||
|
||||
When the OpenVPN service is started, you should be able to see the tun interface up and running.
|
||||
|
||||
```
|
||||
tun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1500
|
||||
options=80000<LINKSTATE>
|
||||
inet6 fe80::xxxx:xxxx:xxxx:xxxx%tun0 prefixlen 64 scopeid 0x3
|
||||
inet 10.xx.xx.xx --> 10.xx.xx.xx netmask 0xffffff00
|
||||
groups: tun
|
||||
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
|
||||
Opened by PID 962
|
||||
```
|
||||
```
|
||||
3: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 100
|
||||
link/none
|
||||
inet 10.xx.xx.xx/xx brd 10.xx.xx.xx scope global tun0
|
||||
valid_lft forever preferred_lft forever
|
||||
```
|
||||
|
||||
Et voilà! Every server is now part of a private network:
|
||||
|
||||
```
|
||||
monitoring ~ # nmap -sn 10.xx.xx.xx/xx
|
||||
Starting Nmap 7.70 ( https://nmap.org ) at 2020-07-13 17:28 CEST
|
||||
Nmap scan report for vps (10.xx.xx.xx)
|
||||
Host is up (0.018s latency).
|
||||
Nmap scan report for 10.xx.xx.xx
|
||||
Host is up (0.032s latency).
|
||||
Nmap scan report for 10.xx.xx.xx
|
||||
Host is up (0.24s latency).
|
||||
Nmap scan report for 10.xx.xx.xx
|
||||
Host is up (0.22s latency).
|
||||
Nmap scan report for 10.xx.xx.xx
|
||||
Host is up.
|
||||
Nmap done: xx IP addresses (5 hosts up) scanned in 13.11 seconds
|
||||
```
|
||||
|
||||
[^1]: [Certificate Authority](https://en.wikipedia.org/wiki/Certificate_authority)
|
146
content/posts/power-consumption-and-failures-prevention.md
Normal file
146
content/posts/power-consumption-and-failures-prevention.md
Normal file
|
@ -0,0 +1,146 @@
|
|||
---
|
||||
title: "Power consumption and failures prevention"
|
||||
date: 2020-08-14T18:00:00+02:00
|
||||
---
|
||||
|
||||
Providing a full storage service means having computers up 24x7. On one hand, if we power off the local storage server
|
||||
when we aren't using it, we'll have to find a solution to respect the backup policy and synchronize with remote servers
|
||||
that could be down at the moment. On the other hand, if we let the storage server up all the time, it will consume
|
||||
unnecessary resources and throw money down the drain. I deeply know that a personal computer, which is idle most of the
|
||||
time, doesn't consume so much power. This is my conviction. But how to verify it?
|
||||
|
||||
With [observability]({{< ref "posts/increased-observability-with-the-tig-stack" >}}), I thought it would be easy to
|
||||
gather power consumption via built-in sensors. I tried something that I know,
|
||||
[lm_sensors](https://hwmon.wiki.kernel.org/lm_sensors), which is included in the Linux kernel and exposes CPU
|
||||
temperatures, fans speed, power voltages, etc.
|
||||
|
||||
```
|
||||
storage2 ~ # sensors
|
||||
k8temp-pci-00c3
|
||||
Adapter: PCI adapter
|
||||
Core0 Temp: +30.0°C
|
||||
Core0 Temp: +22.0°C
|
||||
Core1 Temp: +30.0°C
|
||||
Core1 Temp: +16.0°C
|
||||
|
||||
acpitz-acpi-0
|
||||
Adapter: ACPI interface
|
||||
temp1: +40.0°C (crit = +75.0°C)
|
||||
|
||||
atk0110-acpi-0
|
||||
Adapter: ACPI interface
|
||||
Vcore Voltage: +1.10 V (min = +1.45 V, max = +1.75 V)
|
||||
+3.3 Voltage: +3.39 V (min = +3.00 V, max = +3.60 V)
|
||||
+5.0 Voltage: +4.97 V (min = +4.50 V, max = +5.50 V)
|
||||
+12.0 Voltage: +12.22 V (min = +11.20 V, max = +13.20 V)
|
||||
CPU FAN Speed: 3391 RPM (min = 0 RPM, max = 1800 RPM)
|
||||
CHASSIS FAN Speed: 0 RPM (min = 0 RPM, max = 1800 RPM)
|
||||
POWER FAN Speed: 1662 RPM (min = 0 RPM, max = 1800 RPM)
|
||||
CPU Temperature: +26.0°C (high = +90.0°C, crit = +125.0°C)
|
||||
MB Temperature: +37.0°C (high = +70.0°C, crit = +125.0°C)
|
||||
```
|
||||
|
||||
The ACPI interface returns some voltages measurements. But I doubt they can be used to find the instant consumption in
|
||||
watt (W) and extrapolate the consumption over time in kilowatt-hour (kWh). On laptops, such information can be computed
|
||||
from battery statistics. Unfortunately, all computers of the infrastructure are desktops without batteries.
|
||||
|
||||
I needed to buy a product. A [lot](https://modernsurvivalblog.com/alternative-energy/kill-a-watt-meter/)
|
||||
[of](https://www.howtogeek.com/107854/the-how-to-geek-guide-to-measuring-your-energy-use/)
|
||||
[websites](https://www.pcmag.com/news/how-to-measure-home-power-usage)
|
||||
[talk](https://michaelbluejay.com/electricity/measure.html) about how to measure power consumption for computers and
|
||||
even for the whole house. The common thing that comes out is the recommendation to use a
|
||||
[wattmeter](https://en.wikipedia.org/wiki/Wattmeter).
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/zaeel-wattmetre.jpg" alt="Wattmeter" style="width: 25%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
It's an instrument which can be plugged between the power outlet and your device to measure how much energy is consumed
|
||||
instantaneously (W), over time (kWh) and even the total price if you have configured the kWh price. There is a LCD to
|
||||
display the results. A wattmeter is cheap. I've bought [this
|
||||
model](https://www.amazon.fr/dp/B07GN5NPDJ/ref=cm_sw_r_tw_dp_x_FMMiFb2911HN7) which does a good job. Sadly, we cannot
|
||||
gather the data from the LCD to load them to the metrics infrastructure easily. It also lacks of precision for the
|
||||
price. We can enter only two digits after the floating point while the energy provider gives us a price with five
|
||||
digits.
|
||||
|
||||
Speaking of the price, my [energy provider](https://www.engie.be/fr/) publishes a beautiful but beyond understanding
|
||||
[grid of prices](https://www.engie.be/fr/energie/electricite-gaz/prix-conditions). They are dependent on the pack of
|
||||
products, the region and the distributor. They change over time. You can have an electrical counter for the day and for
|
||||
the night. Moreover, price is displayed in cents and not euros! I called them to have a price estimation. Come on, we
|
||||
are in a digitized world. They should, at least, display the current price of the contract somewhere in the customer
|
||||
panel.
|
||||
|
||||
During my researches, I found that we could use an uninterruptible power supply (UPS) to gather power consumption
|
||||
metrics. As a bonus, it is able to protect from power variations and interruptions that could harm computers. However,
|
||||
they are quite expensive. Their prices range from 50€ to hundreds of euros. As I'm a total newbie in this domain, I've
|
||||
read this detailed [guide](https://www.materiel.net/guide-achat/g13-les-onduleurs-et-prises-parafoudre/1/) (FR) to gain
|
||||
some knowledge. So I decided to buy an [APC Back-UPS Pro
|
||||
550](https://www.apc.com/shop/be/en/products/APC-Power-Saving-Back-UPS-Pro-550/P-BR550GI).
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/apc-back-ups-pro-550.jpg" alt="UPS" style="width: 25%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
It has an USB interface to control it with [apcupsd](https://en.wikipedia.org/wiki/Apcupsd) and display power
|
||||
information with the "apcaccess" binary. It's compatible with both Debian and FreeBSD and it even has a [telegraf
|
||||
plugin](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/apcupsd)!
|
||||
|
||||
```
|
||||
storage1 ~ # /usr/local/sbin/apcaccess
|
||||
APC : 001,036,0867
|
||||
DATE : 2020-07-30 15:56:46 +0200
|
||||
HOSTNAME : storage1
|
||||
VERSION : 3.14.14 (31 May 2016) freebsd
|
||||
UPSNAME : storage1
|
||||
CABLE : USB Cable
|
||||
DRIVER : USB UPS Driver
|
||||
UPSMODE : Stand Alone
|
||||
STARTTIME: 2020-07-26 18:28:21 +0200
|
||||
MODEL : Back-UPS RS 550G
|
||||
STATUS : ONLINE
|
||||
LINEV : 234.0 Volts
|
||||
LOADPCT : 10.0 Percent
|
||||
BCHARGE : 100.0 Percent
|
||||
TIMELEFT : 37.5 Minutes
|
||||
MBATTCHG : 5 Percent
|
||||
MINTIMEL : 3 Minutes
|
||||
MAXTIME : 0 Seconds
|
||||
SENSE : Medium
|
||||
LOTRANS : 176.0 Volts
|
||||
HITRANS : 282.0 Volts
|
||||
ALARMDEL : No alarm
|
||||
BATTV : 13.7 Volts
|
||||
LASTXFER : No transfers since turnon
|
||||
NUMXFERS : 0
|
||||
TONBATT : 0 Seconds
|
||||
CUMONBATT: 0 Seconds
|
||||
XOFFBATT : N/A
|
||||
SELFTEST : NO
|
||||
STATFLAG : 0x05000008
|
||||
SERIALNO : 4B1939P01928
|
||||
BATTDATE : 2019-09-23
|
||||
NOMINV : 230 Volts
|
||||
NOMBATTV : 12.0 Volts
|
||||
NOMPOWER : 330 Watts
|
||||
FIRMWARE : 857.L7 .I USB FW:L7
|
||||
END APC : 2020-07-30 15:57:32 +0200
|
||||
```
|
||||
|
||||
The 550 is the first model of the Back-UPS Pro range so it has [IEC C13 power
|
||||
plugs](https://en.wikipedia.org/wiki/IEC_60320#C13/C14_coupler) only, suitable for computers, but no [Euro/French
|
||||
plugs](https://en.wikipedia.org/wiki/AC_power_plugs_and_sockets#CEE_7.2F5_socket_and_CEE_7.2F6_plug_.28French.3B_Type_E.29)
|
||||
compatible with any electrical device. As I connect only a single computer to the UPS, this is the most economical
|
||||
solution.
|
||||
|
||||
Once the data had been fed to the observability platform, I was able to import this [beautiful
|
||||
dashboard](https://grafana.com/grafana/dashboards/10835) from the Grafana community. I've customized it to my own needs
|
||||
and here is the result:
|
||||
|
||||
[](/power-consumption-storage1.png)
|
||||
[](/power-consumption-storage2.png)
|
||||
[](/power-consumption-storage3.png)
|
||||
|
||||
You can download my dashboard [here](/grafana-power-consumption.json).
|
||||
|
||||
Our winner is *storage3* which costs less than a kebab per year! The worst case is *storage2*, the old hardware, that
|
||||
consumes the equivalent of an incandescent light bulb. See, the power consumption is not so bad after all.
|
200
content/posts/problem-detection-and-alerting.md
Normal file
200
content/posts/problem-detection-and-alerting.md
Normal file
|
@ -0,0 +1,200 @@
|
|||
---
|
||||
title: "Problem detection and alerting"
|
||||
date: 2020-08-07T18:00:00+02:00
|
||||
---
|
||||
|
||||
Everything is distributed, automated and runs in perfect harmony with a common goal: protect your data. But bad things
|
||||
happen, and rarely when you expect them. This is why you need to watch for services states and send a notification when
|
||||
something goes wrong. Monitoring systems are well-known in the enterprise world. For our use case, we don't need to
|
||||
deploy a complex infrastructure to check couple of hosts. For this reason, I choose to use the good old [Nagios
|
||||
Core](https://www.nagios.org/projects/nagios-core/). It even provides a web interface for humans like us.
|
||||
|
||||
# How it works
|
||||
|
||||
There are two types of checks:
|
||||
- **host**: check if host is alive or not
|
||||
- **service**: check if service of a host is healthy or not
|
||||
|
||||
To check if a host is available, the simplest implementation is to use ping:
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/monitoring-host-check.svg" alt="Monitoring host check " style="width: 50%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
For services, there is a tool to execute remote plugins called
|
||||
[NRPE](https://support.nagios.com/kb/article/nrpe-agent-and-plugin-explained-612.html)[^1]. It works with a client on
|
||||
the monitoring host and an agent on the remote host that executes commands on demand. The return code defines the check
|
||||
result.
|
||||
|
||||
{{< rawhtml >}}
|
||||
<p style="text-align: center;"><img src="/monitoring-service-check.svg" alt="Monitoring service check " style="width: 65%;"></p>
|
||||
{{< /rawhtml >}}
|
||||
|
||||
Services states can be:
|
||||
- **OK**: it works as expected
|
||||
- **WARNING**: it works but we should take a look
|
||||
- **CRITICAL**: it's broken
|
||||
- **UNKNOWN**: something is wrong with the plugin configuration or communication
|
||||
|
||||
Plugins can define a warning and/or critical threshold to manage the expected state. For example, I would like to know
|
||||
when disk space usage of a storage host goes over, say, 80% (warning) and 100% (critical). I have time to take action to
|
||||
free some space or order new hard drives before it becomes critical. And if I do nothing, a higher alert will be sent if
|
||||
the disk becomes full.
|
||||
|
||||
# Installation
|
||||
|
||||
My monitoring host runs on Raspbian 10:
|
||||
```
|
||||
apt update
|
||||
apt install nagios4 monitoring-plugins
|
||||
```
|
||||
|
||||
Installed.
|
||||
|
||||
By default, the web interface was broken. I had to disable the following block in the */etc/nagios4/apache2.conf* file:
|
||||
|
||||
```
|
||||
# <Files "cmd.cgi">
|
||||
# ...
|
||||
# </Files>
|
||||
```
|
||||
|
||||
For security reasons, I enabled a basic authentication (a.k.a *htaccess*) in the *DirectoryMatch* block of the same file
|
||||
and created an *admin* user:
|
||||
|
||||
```
|
||||
AuthUserFile "/etc/nagios4/htdigest.users"
|
||||
AuthType Basic
|
||||
AuthName "Restricted Files"
|
||||
AuthBasicProvider file
|
||||
Require user admin
|
||||
```
|
||||
|
||||
In the CGI configuration file */etc/nagios4/cgi.cfg*, the default user can be set to *admin* as it is now protected by
|
||||
basic security:
|
||||
|
||||
```
|
||||
default_user_name=admin
|
||||
```
|
||||
|
||||
Now the web interface should be up and running at http://monitoring-ip/nagios4. For my own usage, I've set up a reverse
|
||||
proxy (nginx) on the VPS host to expose this interface to a public endpoint so I can access it from anywhere with my
|
||||
credentials.
|
||||
|
||||
# Configuration
|
||||
|
||||
A fresh installation applies sane defaults to make Nagios work out-of-the-box. It even enables localhost monitoring.
|
||||
Unfortunately, I want to check this host like any other server in the infrastructure. The first thing I do is to disable
|
||||
the following include in */etc/nagios4/nagios.cfg* file:
|
||||
|
||||
```
|
||||
#cfg_file=/etc/nagios4/objects/localhost.cfg
|
||||
```
|
||||
|
||||
I don't want to be spammed by my monitoring system. Servers may be slower and take time to respond. The Wi-Fi connection
|
||||
of the monitoring system may hang for a while... until someone reboots the host physically. During this extended period
|
||||
of time (multiple hours), my family and I may sleep. I don't want to wake up with hundreds of notifications saying "Hey,
|
||||
the monitoring system is DOWN!". One or two notifications is enough.
|
||||
|
||||
The following new templates can be defined in */etc/nagios4/conf.d/templates.cfg*:
|
||||
|
||||
```
|
||||
define host {
|
||||
name home-host
|
||||
use generic-host
|
||||
check_command check-host-alive
|
||||
contact_groups admins
|
||||
notification_options d,u,r
|
||||
check_interval 5
|
||||
retry_interval 5 ; retry every 5 minutes
|
||||
max_check_attempts 12 ; alert at 1 hour (12x5 minutes)
|
||||
notification_interval 720 ; resend notifications every 12 hours
|
||||
register 0 ; template
|
||||
}
|
||||
|
||||
define service {
|
||||
name home-service
|
||||
use generic-service
|
||||
check_interval 5
|
||||
retry_interval 5 ; retry every 5 minutes
|
||||
max_check_attempts 12 ; alert at 1 hour (12x5 minutes)
|
||||
notification_interval 720 ; 12 hours
|
||||
register 0 ; template
|
||||
}
|
||||
```
|
||||
|
||||
There are multiple components to define:
|
||||
- **hosts** (*/etc/nagios4/conf.d/hosts.cfg*): every single host
|
||||
- **hostgroups** (*/etc/nagios4/conf.d/hostgroups.cfg*): groups of hosts
|
||||
- **services** (*/etc/nagios4/conf.d/services.cfg*): services that will be attached to hostgroups
|
||||
|
||||
For example, I need to know ZFS usage of all storage servers:
|
||||
- **hosts**: *storage1*, *storage2*, *storage3* with their IP addresses
|
||||
- **hostgroups**: *storage-servers* that will regroup *storage1*, *storage2* and *storage3*
|
||||
- **services**: *zfs_capacity* that will be attached to *storage-servers*
|
||||
|
||||
Host definition:
|
||||
|
||||
```
|
||||
define host {
|
||||
use home-host
|
||||
host_name storage1
|
||||
alias storage1
|
||||
address XX.XX.XX.XX
|
||||
}
|
||||
```
|
||||
|
||||
Hostgroup definition:
|
||||
|
||||
```
|
||||
define hostgroup {
|
||||
hostgroup_name storage-servers
|
||||
alias Storage servers
|
||||
members storage1,storage2,storage3
|
||||
}
|
||||
```
|
||||
|
||||
Service definition:
|
||||
|
||||
```
|
||||
define service {
|
||||
use home-service
|
||||
hostgroup_name storage-servers
|
||||
service_description zfs_capacity
|
||||
check_command check_nrpe!check_zfs_capacity
|
||||
}
|
||||
```
|
||||
|
||||
On all storage servers, we also need to define a NRPE command:
|
||||
|
||||
```
|
||||
command[check_zfs_capacity]=/usr/local/bin/sudo /usr/local/sbin/sanoid --monitor-capacity
|
||||
```
|
||||
|
||||
ZFS usage is now monitored!
|
||||
|
||||
I have repeated this process for all services I wanted to check to end up with:
|
||||
|
||||
[](/monitoring-services.png)
|
||||
|
||||
A single host can be in multiple hostgroups. For my tests, I always added features to *storage1*. I created a hostgroup
|
||||
for each new capability and added only *storage1* to it. That means *storage1* had the same services as *storage2* and
|
||||
*storage3*, and the new tested ones.
|
||||
|
||||
# Notifications
|
||||
|
||||
At work, we use [Opsgenie](https://www.atlassian.com/software/opsgenie) to define on call schedules within a team. Of
|
||||
course, I don't want to receive a push notification on my phone for my home servers. This is why I choose to be notified
|
||||
by e-mail. In the past, I hosted some e-mail boxes at home but I didn't want to deal with spam and SPF records to prove
|
||||
to the world that my service is legit. I have a couple of [domain names](https://www.gandi.net/en/domain) with
|
||||
(limited) e-mail services included. For the monitoring purpose, this is more than enough to do the job.
|
||||
|
||||
On Nagios, you can set the e-mail address in the contacts configuration file
|
||||
*/etc/nagios4/objects/contacts.cfg*.
|
||||
|
||||
I followed this [great tutorial](https://www.linode.com/docs/email/postfix/postfix-smtp-debian7/) to configure
|
||||
[postfix](http://www.postfix.org/) to send e-mails using the SMTP server of the provider. Secure and no more spam. I
|
||||
have configured this new e-mail box on my phone so I can be alerted asynchronously and smoothly when something wrong
|
||||
happens.
|
||||
|
||||
[^1]: Nagios Remote Plugin Executor
|
47
content/posts/state-of-internet-bandwidth-in-Belgium.md
Normal file
47
content/posts/state-of-internet-bandwidth-in-Belgium.md
Normal file
|
@ -0,0 +1,47 @@
|
|||
---
|
||||
title: "State of Internet bandwidth in Belgium"
|
||||
date: 2020-07-31T18:00:00+02:00
|
||||
---
|
||||
|
||||
I was born and raised in a little city next to Paris in **France**. In early 2000s, the unlimited "high-speed" Internet
|
||||
access revolutionized communications. No need to monopolize the phone line with a 56Kbps modem anymore. Since then, the
|
||||
bandwidth has always increased. We have seen the ADSL, ADSL2 and fiber technologies. We had something called "Triple
|
||||
play" offers where unlimited phone calls, TV and Internet were packed together. There were three major companies on the
|
||||
market: [France Telecom/Orange](https://www.orange.fr), [Bouygues](https://www.bouyguestelecom.fr/) and
|
||||
[Neuf/Cegetel/SFR](https://www.sfr.fr/) (depending on the year). Then [Free](https://www.free.fr) jumped into that
|
||||
alliance and broke prices with revolutionary offers. From this time, all French ISP have "low prices" – between 30 and
|
||||
50€/month – for "high-speed" – hundreds of Mbps for both down and up – thanks to the fiber deployment.
|
||||
|
||||
Then I moved to **Belgium** for personal reasons. My parents-in-law have chosen
|
||||
[Belgacom/Proximus](https://www.proximus.be/en/personal/?) and they were happy with it so I followed their choice. This
|
||||
ISP has deployed the VDSL technology which can be "fast". My first apartment was very close to the DSLAM[^1] so my
|
||||
bandwidth was good enough, 50Mbps/15Mbps. The price was sensitively higher for Internet and TV only, 50€/month. If we
|
||||
wanted to have a phone line, we would have added 20€ to the monthly bill and pay each phone call! You can get unlimited
|
||||
phone calls for [1.19€/month](https://www.ovhtelecom.fr/telephonie/voip/decouverte.xml) only using VoIP which is the
|
||||
same technology our ISP use. There is a limit to the monthly Internet volume we can consume. It was something like
|
||||
600GB/month when I subscribed, to 3TB now.
|
||||
|
||||
When I moved to my current house, I knew the bandwidth will drop. Proximus had failed to organize my move on time. You
|
||||
can do it yourself on the website but if you go to a shop to reschedule the appointment, they can't do anything because
|
||||
it has been scheduled online. I canceled the first rendez-vous online and they created a new one with an additional
|
||||
two-week delay, one month after the move. I subscribed to [Voo](https://www.voo.be/en), the *fastest Internet of
|
||||
Belgium* like they say in their [commercials](https://www.youtube.com/watch?v=LKv6LtaXIf4). Same price, better speed,
|
||||
120Mbps/10Mbps... for a week. Then I had three months of packet loss, 20% on average. It was unusable. The following two
|
||||
months were stable with a bandwidth drop, 70Mbps/10Mbps. Then packet loss again, 80% on average this time! Horrible. I
|
||||
re-subscribed to Proximus again, with 20Mbps/6Mbps bandwidth, but it is stable since the change. All of that for
|
||||
60€/month.
|
||||
|
||||
I called Proximus to be notified when the fiber will come to my street to finally catch up with our neighbors' speeds,
|
||||
kind of. They have no plan to install it. No date. Nothing. In the meantime, my father and my grand-parents have the
|
||||
**gigabit** fiber installed at home for a lower price than mine. And even if Proximus deploy it, [upload bandwidth is
|
||||
limited to 100Mbps](https://www.proximus.be/en/id_cr_fiber/personal/orphans/fiber-to-your-home.html) where it can be
|
||||
[200Mbps](https://www.sfr.fr/offre-internet/fibre-optique) or even [600](https://www.free.fr/freebox/freebox-delta)
|
||||
[Mbps](https://boutique.orange.fr/internet/offres-fibre/livebox-up) in France. As of today, the maximum bandwidth I
|
||||
could get at home is the 400Mbps/20Mbps promised by Voo, with the stability we know.
|
||||
|
||||
Belgian ISP, Proximus and Voo, when will you stop to steal from our pockets and start to generalize very high-speed
|
||||
Internet bandwidth to the small country of ours? We are in 2020s, not in 2000s.
|
||||
|
||||
[^1]: [Digital subscriber line access
|
||||
multiplexer](https://en.wikipedia.org/wiki/Digital_subscriber_line_access_multiplexer), the closer you are, the faster
|
||||
your bandwidth is.
|
24
content/posts/storage-servers-at-home.md
Normal file
24
content/posts/storage-servers-at-home.md
Normal file
|
@ -0,0 +1,24 @@
|
|||
---
|
||||
title: "Storage servers at home"
|
||||
date: 2020-07-17T19:00:00+02:00
|
||||
---
|
||||
|
||||
I was born in the 90s. I grew up with computers. Other generations call us "digital natives". I am lucky and proud to
|
||||
work with computers every day with a database specialization. People tend to generate lots of data. It might be
|
||||
administrative papers (bills, contracts, paychecks), sentimental photo albums or whatever the data is as long as it is
|
||||
**their** data. At work, we pay attention to back up every data though it was the most important thing in the world. At
|
||||
home, it should be the same but, in fact, nobody really cares about it unless the data is definitely gone.
|
||||
|
||||
My family members used to buy a single USB hard drive and regularly copy their data to it and think it is safe. This
|
||||
highly depends on the frequency of the backup. Actually, they didn't copy very often. If the drive fails, they call me
|
||||
to the rescue but I'm not a magician.
|
||||
|
||||
Another solution involves sending their data to "the cloud" because they have seen on TV this will solve all of their
|
||||
problems. Cloud providers can, intentionally or unintentionally, leak their data. If we materialize data, I'm not sure
|
||||
my family wants to send their physical storage cupboard to the United States for the sake of data safety. We live in
|
||||
Belgium and France. There is no point of sending our data to the other side of the planet in someone else's hands.
|
||||
|
||||
So I decided to **self-host a set of storage servers at home** and offer this service to my own family. It has to be
|
||||
simple as my parents will be the main users. I am a full-time employee and a proud dad. I have a little bit of time to
|
||||
do the service maintenance. It is an opportunity for me to learn and share it to the world. Welcome to my self-hosting
|
||||
project. I hope you will learn something too.
|
Loading…
Add table
Add a link
Reference in a new issue