InfluxDB and Home Assistant (Part 2)

In the long-awaited second part, we will learn how to manually create InfluxDB queries and will use this knowledge to create Continuous Queries which helps us to save disk space for long-term data. On top of that, we will install telegraf agent to gather various system metrics from our network hosts and store them in InfluxDB. We will use Home Assistant 0.92.2 and Hass.io addon InfluxDB 3.0.5 (InfluxDB 1.7.6).

Part one can be found here.

Manual InfluxDB queries

Let’s create a query to get data from an HA sensor. Open InfluxDB addon UI, go to Explore and type our first query, then click Submit Query button:

select * from home_assistant.autogen.temp group by entity_id

We should get something like this:

1558471912377

Using Visualization->Visualization Type it is possible to choose between Line Graph and Table representation of our data. When the Line Graph is selected, it makes sense to add the legend to our graph. This can be done in Visualization->Customize->Static Legend. In order to apply changes, it might be necessary to switch to another visualization type and then back to the Line Graph.

1558472482126

Now it’s time to try an aggregation query. The idea is to decrease the number of points by calculating the mean value for the specified time interval (e.g. hour) and use this calculated value which represents the whole hour of data for the new graph:

select mean(value) from home_assistant.autogen.temperature group by time(1h), entity_id

As one can see, the graph looks very similar to previous, but now we need significantly fewer points to draw it:

1558472683228

Aggregation becomes even more important for frequently changed data, e.g. CPU load or network speed. There are a bunch of another functions you may use instead of mean in your queries. Now we are ready to create our first Continuous Query which will run automatically every specified period of time, calculated aggregated results and store them into measurement with the required retention policy. If those terms just going over your head, I would recommend to refreshing InfluxDB terminology given in the part one.

Important note #1

If your query contains date/time, InfluxDB parser expects by default that it is in UTC format, as the data points itself stored as UTC in InfluxDB. Though, some clients (InfluxDB Web UI in our case), will convert query results to the local time using your browser settings. This may look confusing for the first time:

select value from home_assistant.year.temperature where time = '2019-05-21 21:00:00' group by entity_id 

1558536950676

As we can see, InfluxDB added 3 hours to the query results as my time zone is UTC+3. You may specify your timezone by means of tz clause added at the end of your query:

select value from home_assistant.year.temperature where time = '2019-05-22 00:00:00' group by entity_id tz('Europe/Moscow')

1558539126451

Now we have the same time zone both in the request and in the response. The list of available time zones can be found here.

Important note #2

It is essential to understand that the data points with the same timestamp will be stored as one point in case if they have equal tags or tags are missing. More specifically, the point which came later will overwrite previous points due to the time series database model.

Why it matters: this rule also works for any query, including Continuous Queries. When we create a CQ where we have more than one sensor in measurement, it is important to specify at least one tag in group by clause which will help InfluxDB to distinguish values of one sensor from another. Otherwise, we may get the list of “random“ values without any information about sensor they belong to:

>select mean(value) from home_assistant.autogen.temperature group by time(1d)
name: temperature
time                 mean
----                 ----
2019-05-19T00:00:00Z 17.75294117647059
2019-05-20T00:00:00Z 19.30921052631579
2019-05-21T00:00:00Z 21.379761904761903
2019-05-22T00:00:00Z 19.62985074626866

One can see that temperature values were grouped for each day and there is no way to distinguish between sensors they belong to. Most probably that’s not what you wanted to achieve. Now let’s add group by clause to our query:

>select mean(value) from home_assistant.autogen.temperature group by time(1d), entity_id
name: temperature
tags: entity_id=bedroom_temperature
time                 mean
----                 ----
2019-05-19T00:00:00Z 24.661538461538463
2019-05-20T00:00:00Z 24.804347826086957
2019-05-21T00:00:00Z 25.37941176470588
2019-05-22T00:00:00Z 25.353846153846156

name: temperature
tags: entity_id=yandex_weather_apparent_temperature
time                 mean
----                 ----
2019-05-19T00:00:00Z 12
2019-05-20T00:00:00Z 15.620689655172415
2019-05-21T00:00:00Z 18.653846153846153
2019-05-22T00:00:00Z 16.041666666666668

name: temperature
tags: entity_id=yandex_weather_temperature
time                 mean
----                 ----
2019-05-19T00:00:00Z 14.818181818181818
2019-05-20T00:00:00Z 18.5
2019-05-21T00:00:00Z 18.666666666666668
2019-05-22T00:00:00Z 15.941176470588236

Now results look better, they are grouped by sensor name (entity_id), values from each group are sorted by their timestamp.

Create a Retention Policy

First of all, we have to create a Retention Policy which, besides all, defines how long InfluxDB will hold our data. The query below will create a year RP:

CREATE RETENTION POLICY "year" ON "home_assistant" DURATION 52w REPLICATION 1

You may ignore the message “Your query is syntactically correct but returned no results” as our query actually returns no results, it just creates a new entity within InfluxDB.

Let’s create another RP which will hold data forever:

CREATE RETENTION POLICY "infinite" ON "home_assistant" DURATION INF REPLICATION 1

Make sure our RPs were created:

> show retention policies on home_assistant
name     duration  shardGroupDuration replicaN default
----     --------  ------------------ -------- -------
autogen  168h0m0s  168h0m0s           1        true
infinite 0s        168h0m0s           1        false
year     8736h0m0s 168h0m0s           1        false

More info on Retention Policies can be found in the official docs.

Create a Continuous Query

Continuous query - an InfluxDB facility, which allows running a specific query periodically. What is more important is that the result of this query may be stored to another measurement with its own retention policy. That is exactly how it works: we create a query to get aggregated hourly data from a sensor and store it to a new measurement with very long retention period (retention policy). Needless to say, you may get as many CQs and measurements as you would need. It is limited only by your imagination!

CREATE CONTINUOUS QUERY "cq_1h_temp10" ON "home_assistant" 
BEGIN
  SELECT mean("value") AS value
  INTO "year"."temperature"
  FROM "autogen"."temperature"
  GROUP BY time(1h), entity_id
  FILL(previous)
END

If no updates received from a sensor during last hour, by default CQ will return an empty point. We are using FILL(previous) to use the last known value from the sensor instead of empty.

Let’s make sure that CQ is created:

> show continuous queries
name: home_assistant
name          query
----          -----
cq_1h_temp10  CREATE CONTINUOUS QUERY cq_1h_temp10 ON home_assistant BEGIN SELECT mean(value) AS value INTO home_assistant.year.temperature FROM home_assistant.autogen.temperature GROUP BY time(1h), entity_id fill(none) END

Job is done, now we should wait a few hours and check how new measurement year.temperature will be filling up with our temperature data.

More on CQ you can get from the official documentation.

Display a graph upon a Continuous Query

This is easy, the data in a measurement received from a CQ is not so different from the source data from our sensors. Move on to Grafana, select a dashboard or create a new one, create a new panel (Add Panel->Add Query).

FROM field: 1558474408147

GROUP BY field: 1558475992633

ALIAS BY field:1558474995751

The last one tells Grafana to use entity_id tag from InfluxDB for legend labels, this is very convenient. Now we should see our graph which can be embedded to Home Assistant Lovelace card in the same way how it was described in the first part of this post.

Trying Telegraf

telegraf - is a tiny single executable file with no dependencies which has very low memory and CPU footprint, so does every program written in Go language, including InfluxDB. Telegraf is responsible for the collection of a huge list of metrics and deliver them to InfluxDB for storage and analysis.

Why

With telegraf we can gather important metrics from the machines on our home network, including those which are not supported by HA, e.g. docker containers memory and CPU usage by Hass.io docker containers.

We can specify which metrics to collect with the configuration file. Unfortunately, there is no hassio addon for telegraf so far so the only possibility to use it is the host OS direct access. You may choose between two options: manual install from a Linux command line or docker container. We will try the second one as it looks more fast and convenient

telegraf in a docker container

In order to use telegraf we need either of the following:

  • hassio, installed on a generic Linux server according to this guide
  • hassos with the host access
  • any other PC or single board computer with Linux and docker installed
  • hassio addon portainer

Below is the guide which will work for options 1,2 and 3 from the list above. Open OS prompt and make sure that docker is up and running:

user@host:~$ docker ps

You should see a list of docker containers along with their status and other auxiliary information. We need to create a folder for telegraf configuration file:

user@host:~$ mkdir -p ~/docker/telegraf && cd ~/docker/telegraf

Pull the latest version of telegraf.conf:

user@host:~$ wget https://github.com/influxdata/telegraf/raw/master/etc/telegraf.conf

Create telegraf user in InfluxDB in the same way like we did in the part one for homeassistant user: 1558988683556

Let’s also make telegraf database (check part one) and add a retention policy day which will keep our data for one day:

1558988936151

Short term storage will not allow the database to eat all disk space due to frequently updated data like CPU load. When both user and DB are ready, we should edit section outputs.influxdb of telegraf.conf as shown below:

[[outputs.influxdb]]
  urls = ["http://a0d7b954-influxdb:8086"] 
  database = "telegraf"
  retention_policy = "day"
  precision = "s"
  timeout = "5s"
  username = "telegraf"
  password = "password"

Now we should jump to the inputs section. There are already some metrics activated by default like processor, mem and disk usage. You may leave them as they are or comment unnecessary things. At the same time, we want to add docker container metrics, for example, to know if a container consumes too many resources. Uncomment section inputs.docker and edit the following lines:

[[inputs.docker]]
  endpoint = "unix:///var/run/docker.sock"
  container_names = []
  timeout = "5s"

Run our docker container:

user@host:~$ docker run -d --name telegraf --restart=always --net="hassio" -v /etc/localtime:/etc/localtime:ro -v /home/user/docker/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro -v /var/run/docker.sock:/var/run/docker.sock telegraf:latest

The path/home/user/docker/telegraf/telegraf.conf should point to the folder where telegraf.conf is. The given command assumes that telegraf container runs on the same machine where hassio located, therefore --net="hassio" option will let telegraf to access other containers by their names in the hassio internal docker network. If the container runs on another machine, one should use --net="host", and use IP of host name of a InfluxDB machine as urls param of telegraf.conf. Let’s check telegraf logs:

user@host:~$ docker logs telegraf
# docker logs telegraf
2019-05-27T20:38:55Z I! Starting Telegraf 1.10.4
2019-05-27T20:38:55Z I! Using config file: /etc/telegraf/telegraf.conf
2019-05-27T20:38:55Z I! Loaded inputs: disk diskio kernel swap system cpu mem processes docker
2019-05-27T20:38:55Z I! Loaded aggregators:
2019-05-27T20:38:55Z I! Loaded processors:
2019-05-27T20:38:55Z I! Loaded outputs: influxdb
2019-05-27T20:38:55Z I! Tags enabled: host=c85cf92dfb59
2019-05-27T20:38:55Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"c85cf92dfb59", Flush Interval:10s

Everything is fine, we can see no errors, now it is time to open InfluxDB and check if data flows. Now we can see that in Explore there are new database telegraf and a set of new measurements holding the metrics of our host:

1558990070648

Below is the sample of query which can be used to build a graph of CPU usage per docker container:

SELECT usage_percent FROM "telegraf"."day"."docker_container_cpu" where time > :dashboardTime: group by container_name

Here we are using docker_container_cpu as a measurement and usage_percent as a field. This query can be easily modified by changing either field or measurement to other values from Measurement & Tags and Fields columns:

1558995518896

Now let’s try it in Grafana. First of all, we need to set up the additional data source for our telegraf DB in the same way we did in the first part for home_assistant DB. Once the data source is ready, choose it from the Queries to combo box and make the following query:

1558991928466

Go to Visualization tab and choose the following options:

1558992214768

If you plan to keep metrics collected by telegraf during a large amount of time, you can create another CQ for them to reduce the amount of disk space in the same way we did before. Needless to say that the graph can be integrated into Lovelace card as it was demonstrated in the part one.

Sources supported by telegraf

The full list of telegraf plugins is given here.

Nerd bonus: InfluxQL queries from the command line (CLI)

Most of InfluxQL queries can be executed from the InfluxDB Web UI called Chronograf. Sometimes it may be inconvenient, especially when you need to copy query results somewhere else.

Hopefully, you may run InfluxQL queries straight from the command line! The only trick is to run influx command line utility from within a Hass.io docker container running InfluxDB. This can be achieved in two ways :

  1. Using OS host command line (when Hass.io was installed on a generic Linux server by a script)
  2. Using hassio portainer addon (when you don’t have access to host OS, e.g. hassos)

We consider the first option, you may try Portainer container console for that, but the web console is quite buggy, it does not support copy and paste and works really bad with the queries which do not fit into one line so cannot be recommended. But you can try it to check whether this is what you need as it’s way easier than obtaining host access for a hassos machine.

Run from the host command line

user@host:~$ docker exec -it addon_a0d7b954_influxdb influx -precision rfc3339
Connected to http://localhost:8086 version 1.7.2
InfluxDB shell version: 1.7.2
Enter an InfluxQL query
>

We should authenticate and choose a database to work:

> auth
username: homeassistant
password:
> use home_assistant
Using database home_assistant

The important difference to InfluxDB(Capacitor) that you don’t have to specify DB in every query as it was already specified by use command. Let’s get the list of measurements in our DB:

> show measurements
name: measurements
name
----
binary
downsampled_temp
energy
humidity
temperature

List of the field names:

> show field keys
name: binary
fieldKey          fieldType
--------          ---------
No motion since   float
battery_level     float
device_class_str  string
friendly_name_str string
state             string
value             float

name: energy
fieldKey                fieldType
--------                ---------
friendly_name_str       string
icon_str                string
unit_of_measurement_str string
value                   float

name: humidity
fieldKey                fieldType
--------                ---------
battery_level           float
device_class_str        string
friendly_name_str       string
unit_of_measurement_str string
value                   float

name: temperature
fieldKey                fieldType
--------                ---------
battery_level           float
device_class_str        string
friendly_name_str       string
unit_of_measurement_str string
value                   float

Get the data from a measurement:

> select value, battery_level, entity_id from autogen.temperature WHERE time > now() - 3h
name: temperature
time                           value battery_level entity_id
----                           ----- ------------- ---------

2019-05-28T13:37:05.37566592Z  17                  yandex_weather_apparent_temperature
2019-05-28T14:37:06.288192Z    18                  yandex_weather_temperature
2019-05-28T15:37:05.333918976Z 17                  yandex_weather_temperature
2019-05-28T15:37:05.36456192Z  14                  yandex_weather_apparent_temperature

In the example above we get values of fields named value and battery_level, as well as the value of entity_id tag. We also narrow our results by the last 3 hours of data.

Please note that command line client does not automatically convert time to the local timezone as Capacitor does so the timestamps are in UTC. As my time zone is UTC+3, the local time of the last record should be read as 18:37:05.

We can use another retention policy:

> select value, entity_id from year.temperature
name: temperature
time                 value              entity_id
----                 -----              ---------
2019-05-21T21:00:00Z 25.3               bedroom_temperature
2019-05-21T21:00:00Z 12                 yandex_weather_apparent_temperature
2019-05-21T21:00:00Z 13                 yandex_weather_temperature
2019-05-21T22:00:00Z 25.4               bedroom_temperature
2019-05-21T22:00:00Z 12.5               yandex_weather_apparent_temperature
2019-05-21T22:00:00Z 13.5               yandex_weather_temperature
2019-05-22T00:00:00Z 25.5               bedroom_temperature
2019-05-22T00:00:00Z 11                 yandex_weather_apparent_temperature
2019-05-22T00:00:00Z 12                 yandex_weather_temperature

Full specs of DB schema exploration commands, as well as InfluxQL syntax description, are given here.

Conclusion

In the second part, we learned how to make an arbitrary number of our own Continuous Queries as well as measurements to hold the aggregated data forever. We also managed to set up telegraf agent and got insights on the performance characteristics of our hardware. If I ever come up with the part 3, we will discuss how to set up centralized logging for most of the devices in your home network, use it to detect anomalies and report errors using different communication channels. Stay tuned.