142 lines
7.1 KiB
Markdown
142 lines
7.1 KiB
Markdown
|
---
|
||
|
title: "Increased observability with the TIG stack"
|
||
|
date: 2020-08-10T18:00:00+02:00
|
||
|
---
|
||
|
|
||
|
[Observability](https://en.wikipedia.org/wiki/Observability) has become a buzzword lately. I must admit, this is one of
|
||
|
the many reasons why I use it in the title. In reality, this article will talk about fetching measurements and creating
|
||
|
beautiful graphs to feel like [detective Derrick](https://en.wikipedia.org/wiki/Derrick_(TV_series)), an *old* and wise
|
||
|
detective solving cases by encouraging criminals to confess by themselves.
|
||
|
|
||
|
With the recent [Go](https://golang.org/) programming language [gain of
|
||
|
popularity](https://opensource.com/article/17/11/why-go-grows), we have seen a lot of new software coming into the
|
||
|
database world: [CockroachDB](https://www.cockroachlabs.com/), [TiDB](https://pingcap.com/products/tidb),
|
||
|
[Vitess](https://vitess.io/), etc. Among them, the **TIG stack**
|
||
|
([**T**elegraf](https://github.com/influxdata/telegraf), [**I**nfluxDB](https://github.com/influxdata/influxdb) and
|
||
|
[**G**rafana](https://github.com/grafana/grafana)) has become a reference to gather and display metrics.
|
||
|
|
||
|
The goal is to see the evolution of different resources usage (memory, processor, storage space), power consumption,
|
||
|
environment variables (temperature, humidity), on every single host of the infrastructure.
|
||
|
|
||
|
# Telegraf
|
||
|
|
||
|
The first component of the stack is Telegraf, an agent that can fetch metrics from multiple sources
|
||
|
([input](https://github.com/influxdata/telegraf/tree/master/plugins/inputs)) and write them to multiple destinations
|
||
|
([output](https://github.com/influxdata/telegraf/tree/master/plugins/outputs)). There are tens of built-in plugins
|
||
|
available! You can even gather a custom source of data with
|
||
|
[exec](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec) with an expected
|
||
|
[format](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md).
|
||
|
|
||
|
I configured Telegraf to fetch and send metrics every minute (*interval* and *flush_interval* in the *agent* section is
|
||
|
*"60s"*) which is enough for my personal usage. Most of the plugins I use are built-in: cpu, disk, diskio, kernel, mem,
|
||
|
processes, system, zfs, net, smart, ping, etc.
|
||
|
|
||
|
The [zfs](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zfs) plugin fetches ZFS pool statistics like
|
||
|
size, allocation, free space, etc, on FreeBSD but not [on Linux](https://github.com/influxdata/telegraf/issues/2616).
|
||
|
The issue is known but has not been merged upstream yet. So I have developed a simple Python snippet to fill the gap on
|
||
|
my only storage server running on Linux:
|
||
|
|
||
|
```
|
||
|
#!/usr/bin/python
|
||
|
|
||
|
import subprocess
|
||
|
|
||
|
def parse_int(s):
|
||
|
return str(int(s)) + 'i'
|
||
|
|
||
|
def parse_float_with_x(s):
|
||
|
return float(s.replace('x', ''))
|
||
|
|
||
|
def parse_pct_int(s):
|
||
|
return parse_int(s.replace('%', ''))
|
||
|
|
||
|
if __name__ == '__main__':
|
||
|
measurement = 'zfs_pool'
|
||
|
|
||
|
pools = subprocess.check_output(['/usr/sbin/zpool', 'list', '-Hp']).splitlines()
|
||
|
output = []
|
||
|
for pool in pools:
|
||
|
col = pool.split("\t")
|
||
|
tags = {'pool': col[0], 'health': col[9]}
|
||
|
fields = {}
|
||
|
|
||
|
if tags['health'] == 'UNAVAIL':
|
||
|
fields['size'] = 0
|
||
|
else:
|
||
|
fields['size'] = parse_int(col[1])
|
||
|
fields['allocated'] = parse_int(col[2])
|
||
|
fields['free'] = parse_int(col[3])
|
||
|
fields['fragmentation'] = '0i' if col[6] == '-' else parse_pct_int(col[6])
|
||
|
fields['capacity'] = parse_int(col[7])
|
||
|
fields['dedupratio'] = parse_float_with_x(col[8])
|
||
|
|
||
|
tags = ','.join(['{}={}'.format(k, v) for k, v in tags.items()])
|
||
|
fields = ','.join(['{}={}'.format(k, v) for k, v in fields.items()])
|
||
|
print('{},{} {}'.format(measurement, tags, fields))
|
||
|
```
|
||
|
|
||
|
Called by the following input:
|
||
|
|
||
|
```
|
||
|
[[inputs.exec]]
|
||
|
commands = ['/opt/telegraf-plugins/zfs.py']
|
||
|
data_format = "influx"
|
||
|
```
|
||
|
|
||
|
This exec plugin does exactly the same job as the zfs input running on FreeBSD.
|
||
|
|
||
|
All those metrics are sent to a single output, InfluxDB, hosted on the monitoring server.
|
||
|
|
||
|
# InfluxDB
|
||
|
|
||
|
Measurements can be stored in a time series database which is designed to organize data around time. InfluxDB is a
|
||
|
perfect use case for what we need. Of course, there are other time series databases. I've chosen this one because it is
|
||
|
well documented, it fits my needs and I wanted to learn new things.
|
||
|
[Installation](https://docs.influxdata.com/influxdb/v1.8/introduction/install/) is straightforward. I've enabled
|
||
|
[HTTPS](https://docs.influxdata.com/influxdb/v1.8/administration/https_setup/) and
|
||
|
[authentication](https://docs.influxdata.com/influxdb/v1.8/administration/authentication_and_authorization/#set-up-authentication).
|
||
|
I use a simple setup with only one node in the *cluster*. No sharding. Only one database. Even if there is not so many
|
||
|
metrics sent by Telegraf, I've created a default [retention
|
||
|
policy](https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#retention-policy-management) to store
|
||
|
two years of data which is more than enough. A new default retention policy will become the default route to store all
|
||
|
your new points. Don't be afraid to see all the existing measurements vanished. Nothing has been deleted. They just are
|
||
|
under the previous policy and need to be
|
||
|
[moved](https://community.influxdata.com/t/applying-retention-policies-to-existing-measurments/802). You should define a
|
||
|
[backup](https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/) policy too.
|
||
|
|
||
|
# Grafana
|
||
|
|
||
|
Now that we are able to gather and store metrics, we need to visualize them. This is the role of
|
||
|
[Grafana](https://grafana.com/). During my career, I played with
|
||
|
[Graylog](https://docs.graylog.org/en/3.2/pages/dashboards.html), [Kibana](https://www.elastic.co/kibana) and Grafana.
|
||
|
The last one is my favorite. It is generally blazing fast! Even on a Raspberry Pi. The look and feel is amazing. The
|
||
|
theme is dark by default but I like the light one.
|
||
|
|
||
|
I have created four dashboards:
|
||
|
- **system**: load, processor, memory, system disk usage, disk i/o, network quality and bandwidth
|
||
|
- **storage**: ZFS pool allocation, capacity, fragmentation and uptime for each disk
|
||
|
- **power consumption**: kWh used per day, week, month, year, current UPS load, price per year (more details on a next
|
||
|
post)
|
||
|
- **sensors**: ambient temperature, humidity and noise (more details on a next post)
|
||
|
|
||
|
Every single graph has a *$host* [variable](https://grafana.com/docs/grafana/latest/variables/templates-and-variables/)
|
||
|
at the dashboard level to be able to filter metrics per host. On top of the screen, a dropdown menu is automatically
|
||
|
created to select the host based on an InfluxDB query.
|
||
|
|
||
|
And because a picture is worth a thousand words, here are some screenshots of my own graphs:
|
||
|
|
||
|
[![System](/grafana-system.png)](/grafana-system.png)
|
||
|
[![Storage](/grafana-storage.png)](/grafana-storage.png)
|
||
|
[![Power consumption](/grafana-power-consumption.png)](/grafana-power-consumption.png)
|
||
|
[![Sensors](/grafana-sensors.png)](/grafana-sensors.png)
|
||
|
|
||
|
# Infrastructure
|
||
|
|
||
|
To sum this up, the infrastructure looks like this:
|
||
|
|
||
|
![TIG stack](/monitoring-tig.svg)
|
||
|
|
||
|
Whenever I want, I can sit back on a comfortable sofa, open a web browser and let the infrastructure speak for itself.
|
||
|
Easy, right?
|
||
|
|