--- title: "Increased observability with the TIG stack" date: 2020-08-10T18:00:00+02:00 --- [Observability](https://en.wikipedia.org/wiki/Observability) has become a buzzword lately. I must admit, this is one of the many reasons why I use it in the title. In reality, this article will talk about fetching measurements and creating beautiful graphs to feel like [detective Derrick](https://en.wikipedia.org/wiki/Derrick_(TV_series)), an *old* and wise detective solving cases by encouraging criminals to confess by themselves. With the recent [Go](https://golang.org/) programming language [gain of popularity](https://opensource.com/article/17/11/why-go-grows), we have seen a lot of new software coming into the database world: [CockroachDB](https://www.cockroachlabs.com/), [TiDB](https://pingcap.com/products/tidb), [Vitess](https://vitess.io/), etc. Among them, the **TIG stack** ([**T**elegraf](https://github.com/influxdata/telegraf), [**I**nfluxDB](https://github.com/influxdata/influxdb) and [**G**rafana](https://github.com/grafana/grafana)) has become a reference to gather and display metrics. The goal is to see the evolution of different resources usage (memory, processor, storage space), power consumption, environment variables (temperature, humidity), on every single host of the infrastructure. # Telegraf The first component of the stack is Telegraf, an agent that can fetch metrics from multiple sources ([input](https://github.com/influxdata/telegraf/tree/master/plugins/inputs)) and write them to multiple destinations ([output](https://github.com/influxdata/telegraf/tree/master/plugins/outputs)). There are tens of built-in plugins available! You can even gather a custom source of data with [exec](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) with an expected [format](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md). I configured Telegraf to fetch and send metrics every minute (*interval* and *flush_interval* in the *agent* section is *"60s"*) which is enough for my personal usage. Most of the plugins I use are built-in: cpu, disk, diskio, kernel, mem, processes, system, zfs, net, smart, ping, etc. The [zfs](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zfs) plugin fetches ZFS pool statistics like size, allocation, free space, etc, on FreeBSD but not [on Linux](https://github.com/influxdata/telegraf/issues/2616). The issue is known but has not been merged upstream yet. So I have developed a simple Python snippet to fill the gap on my only storage server running on Linux: ``` #!/usr/bin/python import subprocess def parse_int(s): return str(int(s)) + 'i' def parse_float_with_x(s): return float(s.replace('x', '')) def parse_pct_int(s): return parse_int(s.replace('%', '')) if __name__ == '__main__': measurement = 'zfs_pool' pools = subprocess.check_output(['/usr/sbin/zpool', 'list', '-Hp']).splitlines() output = [] for pool in pools: col = pool.split("\t") tags = {'pool': col[0], 'health': col[9]} fields = {} if tags['health'] == 'UNAVAIL': fields['size'] = 0 else: fields['size'] = parse_int(col[1]) fields['allocated'] = parse_int(col[2]) fields['free'] = parse_int(col[3]) fields['fragmentation'] = '0i' if col[6] == '-' else parse_pct_int(col[6]) fields['capacity'] = parse_int(col[7]) fields['dedupratio'] = parse_float_with_x(col[8]) tags = ','.join(['{}={}'.format(k, v) for k, v in tags.items()]) fields = ','.join(['{}={}'.format(k, v) for k, v in fields.items()]) print('{},{} {}'.format(measurement, tags, fields)) ``` Called by the following input: ``` [[inputs.exec]] commands = ['/opt/telegraf-plugins/zfs.py'] data_format = "influx" ``` This exec plugin does exactly the same job as the zfs input running on FreeBSD. All those metrics are sent to a single output, InfluxDB, hosted on the monitoring server. # InfluxDB Measurements can be stored in a time series database which is designed to organize data around time. InfluxDB is a perfect use case for what we need. Of course, there are other time series databases. I've chosen this one because it is well documented, it fits my needs and I wanted to learn new things. [Installation](https://docs.influxdata.com/influxdb/v1.8/introduction/install/) is straightforward. I've enabled [HTTPS](https://docs.influxdata.com/influxdb/v1.8/administration/https_setup/) and [authentication](https://docs.influxdata.com/influxdb/v1.8/administration/authentication_and_authorization/#set-up-authentication). I use a simple setup with only one node in the *cluster*. No sharding. Only one database. Even if there is not so many metrics sent by Telegraf, I've created a default [retention policy](https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#retention-policy-management) to store two years of data which is more than enough. A new default retention policy will become the default route to store all your new points. Don't be afraid to see all the existing measurements vanished. Nothing has been deleted. They just are under the previous policy and need to be [moved](https://community.influxdata.com/t/applying-retention-policies-to-existing-measurments/802). You should define a [backup](https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/) policy too. # Grafana Now that we are able to gather and store metrics, we need to visualize them. This is the role of [Grafana](https://grafana.com/). During my career, I played with [Graylog](https://docs.graylog.org/en/3.2/pages/dashboards.html), [Kibana](https://www.elastic.co/kibana) and Grafana. The last one is my favorite. It is generally blazing fast! Even on a Raspberry Pi. The look and feel is amazing. The theme is dark by default but I like the light one. I have created four dashboards: - **system**: load, processor, memory, system disk usage, disk i/o, network quality and bandwidth - **storage**: ZFS pool allocation, capacity, fragmentation and uptime for each disk - **power consumption**: kWh used per day, week, month, year, current UPS load, price per year (more details on a next post) - **sensors**: ambient temperature, humidity and noise (more details on a next post) Every single graph has a *$host* [variable](https://grafana.com/docs/grafana/latest/variables/templates-and-variables/) at the dashboard level to be able to filter metrics per host. On top of the screen, a dropdown menu is automatically created to select the host based on an InfluxDB query. And because a picture is worth a thousand words, here are some screenshots of my own graphs: [![System](/grafana-system.png)](/grafana-system.png) [![Storage](/grafana-storage.png)](/grafana-storage.png) [![Power consumption](/grafana-power-consumption.png)](/grafana-power-consumption.png) [![Sensors](/grafana-sensors.png)](/grafana-sensors.png) # Infrastructure To sum this up, the infrastructure looks like this: ![TIG stack](/monitoring-tig.svg) Whenever I want, I can sit back on a comfortable sofa, open a web browser and let the infrastructure speak for itself. Easy, right?