self-hosting.riou.xyz/content/posts/increased-observability-with-the-TIG-stack.md

---
title: "Increased observability with the TIG stack"
date: 2020-08-10T18:00:00+02:00
---

[Observability](https://en.wikipedia.org/wiki/Observability) has become a buzzword lately. I must admit, this is one of
the many reasons why I use it in the title. In reality, this article will talk about fetching measurements and creating
beautiful graphs to feel like [detective Derrick](https://en.wikipedia.org/wiki/Derrick_(TV_series)), an *old* and wise
detective solving cases by encouraging criminals to confess by themselves.

With the recent [Go](https://golang.org/) programming language [gain of
popularity](https://opensource.com/article/17/11/why-go-grows), we have seen a lot of new software coming into the
database world: [CockroachDB](https://www.cockroachlabs.com/), [TiDB](https://pingcap.com/products/tidb),
[Vitess](https://vitess.io/), etc. Among them, the **TIG stack**
([**T**elegraf](https://github.com/influxdata/telegraf), [**I**nfluxDB](https://github.com/influxdata/influxdb) and
[**G**rafana](https://github.com/grafana/grafana)) has become a reference to gather and display metrics.

The goal is to see the evolution of different resources usage (memory, processor, storage space), power consumption,
environment variables (temperature, humidity), on every single host of the infrastructure.

# Telegraf

The first component of the stack is Telegraf, an agent that can fetch metrics from multiple sources
([input](https://github.com/influxdata/telegraf/tree/master/plugins/inputs)) and write them to multiple destinations
([output](https://github.com/influxdata/telegraf/tree/master/plugins/outputs)). There are tens of built-in plugins
available! You can even gather a custom source of data with
[exec](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) with an expected
[format](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md).

I configured Telegraf to fetch and send metrics every minute (*interval* and *flush_interval* in the *agent* section is
*"60s"*) which is enough for my personal usage. Most of the plugins I use are built-in: cpu, disk, diskio, kernel, mem,
processes, system, zfs, net, smart, ping, etc.

The [zfs](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zfs) plugin fetches ZFS pool statistics like
size, allocation, free space, etc, on FreeBSD but not [on Linux](https://github.com/influxdata/telegraf/issues/2616).
The issue is known but has not been merged upstream yet. So I have developed a simple Python snippet to fill the gap on
my only storage server running on Linux:

```
#!/usr/bin/python

import subprocess

def parse_int(s):
    return str(int(s)) + 'i'

def parse_float_with_x(s):
    return float(s.replace('x', ''))

def parse_pct_int(s):
    return parse_int(s.replace('%', ''))

if __name__ == '__main__':
    measurement = 'zfs_pool'

    pools = subprocess.check_output(['/usr/sbin/zpool', 'list', '-Hp']).splitlines()
    output = []
    for pool in pools:
        col = pool.split("\t")
        tags = {'pool': col[0], 'health': col[9]}
        fields = {}

        if tags['health'] == 'UNAVAIL':
            fields['size'] = 0
        else:
            fields['size'] = parse_int(col[1])
            fields['allocated'] = parse_int(col[2])
            fields['free'] = parse_int(col[3])
            fields['fragmentation'] = '0i' if col[6] == '-' else parse_pct_int(col[6])
            fields['capacity'] = parse_int(col[7])
            fields['dedupratio'] = parse_float_with_x(col[8])

        tags = ','.join(['{}={}'.format(k, v) for k, v in tags.items()])
        fields = ','.join(['{}={}'.format(k, v) for k, v in fields.items()])
        print('{},{} {}'.format(measurement, tags, fields))
```

Called by the following input:

```
[[inputs.exec]]
  commands = ['/opt/telegraf-plugins/zfs.py']
  data_format = "influx"
```

This exec plugin does exactly the same job as the zfs input running on FreeBSD.

All those metrics are sent to a single output, InfluxDB, hosted on the monitoring server.

# InfluxDB

Measurements can be stored in a time series database which is designed to organize data around time. InfluxDB is a
perfect use case for what we need. Of course, there are other time series databases. I've chosen this one because it is
well documented, it fits my needs and I wanted to learn new things.
[Installation](https://docs.influxdata.com/influxdb/v1.8/introduction/install/) is straightforward. I've enabled
[HTTPS](https://docs.influxdata.com/influxdb/v1.8/administration/https_setup/) and
[authentication](https://docs.influxdata.com/influxdb/v1.8/administration/authentication_and_authorization/#set-up-authentication).
I use a simple setup with only one node in the *cluster*. No sharding. Only one database. Even if there is not so many
metrics sent by Telegraf, I've created a default [retention
policy](https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#retention-policy-management) to store
two years of data which is more than enough. A new default retention policy will become the default route to store all
your new points. Don't be afraid to see all the existing measurements vanished. Nothing has been deleted. They just are
under the previous policy and need to be
[moved](https://community.influxdata.com/t/applying-retention-policies-to-existing-measurments/802). You should define a
[backup](https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/) policy too.

# Grafana

Now that we are able to gather and store metrics, we need to visualize them. This is the role of
[Grafana](https://grafana.com/). During my career, I played with
[Graylog](https://docs.graylog.org/en/3.2/pages/dashboards.html), [Kibana](https://www.elastic.co/kibana) and Grafana.
The last one is my favorite. It is generally blazing fast! Even on a Raspberry Pi. The look and feel is amazing. The
theme is dark by default but I like the light one.

I have created four dashboards:
- **system**: load, processor, memory, system disk usage, disk i/o, network quality and bandwidth
- **storage**: ZFS pool allocation, capacity, fragmentation and uptime for each disk
- **power consumption**: kWh used per day, week, month, year, current UPS load, price per year (more details on a next
  post)
- **sensors**: ambient temperature, humidity and noise (more details on a next post)

Every single graph has a *$host* [variable](https://grafana.com/docs/grafana/latest/variables/templates-and-variables/)
at the dashboard level to be able to filter metrics per host. On top of the screen, a dropdown menu is automatically
created to select the host based on an InfluxDB query.

And because a picture is worth a thousand words, here are some screenshots of my own graphs:

[![System](/grafana-system.png)](/grafana-system.png)
[![Storage](/grafana-storage.png)](/grafana-storage.png)
[![Power consumption](/grafana-power-consumption.png)](/grafana-power-consumption.png)
[![Sensors](/grafana-sensors.png)](/grafana-sensors.png)

# Infrastructure

To sum this up, the infrastructure looks like this:

![TIG stack](/monitoring-tig.svg)

Whenever I want, I can sit back on a comfortable sofa, open a web browser and let the infrastructure speak for itself.
Easy, right?