Introduction

With the growth of a microservices architecture the ability to monitor and respond to the environment is paramount.   Once you have more than a handful of K8s pods running,   a more scalable approach is needed. This section covers details on:

•    Prometheus components.

          •     Prometheus server for scraping and storing performance metrics

          •     Client libraries for instrumenting applications

          •     Alert Manager to create threshold rules and fire alerts to channels like email and Slack

          •     PromQL as the query language to form SELECTs and GROUPBYs on timeseries data

•    Grafana components.

          •     Datasources for reading performance metrics

          •     Dashboards for displaying and analyzing performance metrics

•    TimescaleDB components.

          •     Extensions to PostgresSQL for long-term data persistence

          •     API's for exposing time-oriented queries

Prometheus is not designed to be a long-term metrics store.   All metrics recorded into Prometheus are first written to the local node,   and then written to TimescaleDB.   This means all metrics are immediately backed up so any failure on a Prometheus node will be less painful.

Container Network

Create a bridge network.

docker network create -d bridge prometheus_timescale_network


Inspect the network.

docker network inspect prometheus_timescale_network

Returns.

{
        "Name": "prometheus_timescale_network",
        "Id": "5a151e35057af8c8ffd779f44806e9033e62e663857a731948a73d3ab736fcef",
        "Created": "2019-07-13T15:34:18.87603738Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {}
    }


PostgreSQL and TimescaleDB Extension

Deploy the Postgres Pg_Prometheus and the TimescaleDB extension.

docker run -d --name timescaledb -p 5432:5432 \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=<Password_String> \
timescale/pg_prometheus \
-c synchronous_commit=off


Check the logs.

docker logs b1e32966add09ba166565d5b51aa8fbee43f366dc9d0694a04d893975f8e3cce

Returns:

WELCOME TO
 _____ _                               _     ____________
|_   _(_)                             | |    |  _  \ ___ \
  | |  _ _ __ ___   ___  ___  ___ __ _| | ___| | | | |_/ /
  | | | |  _ ` _ \ / _ \/ __|/ __/ _` | |/ _ \ | | | ___ \
  | | | | | | | | |  __/\__ \ (_| (_| | |  __/ |/ /| |_/ /
  |_| |_|_| |_| |_|\___||___/\___\__,_|_|\___|___/ \____/
               Running version 1.2.0
For more information on TimescaleDB, please visit the following links:

 1. Getting started: https://docs.timescale.com/getting-started
 2. API reference documentation: https://docs.timescale.com/api
 3. How TimescaleDB is designed: https://docs.timescale.com/introduction/architecture

Note: TimescaleDB collects anonymous reports to better understand and assist our users.
For more information and how to disable, please see our docs https://docs.timescaledb.com/using-timescaledb/telemetry.

CREATE EXTENSION
2019-07-13 15:40:04.374 UTC [46] ERROR:  TimescaleDB background worker connected to template database, exiting

/usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/001_reenable_auth.sh
2019-07-13 15:40:04.374 UTC [30] LOG:  worker process: TimescaleDB Background Worker Scheduler (PID 46) exited with exit code 1

/usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/002_timescaledb_tune.sh
Using postgresql.conf at this path:
/var/lib/postgresql/data/postgresql.conf

Writing backup to:
/tmp/timescaledb_tune.backup201907131540

Recommendations based on 6.79 GB of available memory and 2 CPUs for PostgreSQL 10
Saving changes to: /var/lib/postgresql/data/postgresql.conf
shared_buffers = 1739MB
effective_cache_size = 5217MB
maintenance_work_mem = 890385kB
work_mem = 44519kB
timescaledb.max_background_workers = 4
max_worker_processes = 9
max_parallel_workers_per_gather = 1
max_parallel_workers = 2
wal_buffers = 16MB
min_wal_size = 4GB
max_wal_size = 8GB
default_statistics_target = 500
random_page_cost = 1.1
checkpoint_completion_target = 0.9
max_connections = 20
max_locks_per_transaction = 64
effective_io_concurrency = 200
timescaledb.last_tuned = '2019-07-13T15:40:04Z'
timescaledb.last_tuned_version = '0.3.0'

.
.
.
PostgreSQL init process complete; ready for start up.

2019-07-13 15:40:05.027 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2019-07-13 15:40:05.027 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2019-07-13 15:40:05.054 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-07-13 15:40:05.129 UTC [55] LOG:  database system was shut down at 2019-07-13 15:40:04 UTC
2019-07-13 15:40:05.161 UTC [1] LOG:  database system is ready to accept connections
2019-07-13 15:40:05.163 UTC [61] LOG:  TimescaleDB background worker launcher connected to shared catalogs


Open port 9201.

az vm open-port --priority 1139 --resource-group Py4SAS --name Spark-node-vm  --port 9201


Deploy the Prometheus PostgreSQL adapter.

docker run --network prometheus_timescale_network --name prometheus_postgresql_adapter -d -p 9201:9201 \
timescale/prometheus-postgresql-adapter:master \
-pg.password=XXXXXXXXXX \
-pg.host=Py4SAS-vm \
-pg.prometheus-log-samples


Check the logs.

docker logs 45ef30828b42d7c2fb7a3742b57a6af2d154ec252587fd2fec5f1935b0f0e2a1

Returns.

level=info ts=2019-07-13T15:48:49.615288015Z caller=client.go:80 msg="host=Py4SAS-vm port=5432 user=postgres dbname=postgres password='XXXXXXXXXXX' sslmode=disable connect_timeout=10"
level=info ts=2019-07-13T15:48:49.831775128Z caller=client.go:146 storage=PostgreSQL msg="Initialized pg_prometheus extension"
level=info ts=2019-07-13T15:48:49.831815928Z caller=main.go:96 msg="Starting up..."
level=info ts=2019-07-13T15:48:49.831834028Z caller=main.go:240 msg=Listening addr=:9201


Node Exporter

Deploy the node exporter to collect metrics on host (vm) performance.

docker run --network prometheus_timescale_network --name node_exporter -p 9100:9100 quay.io/prometheus/node-exporter

Returns.

time="2019-07-13T15:56:44Z" level=info msg="Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)" source="node_exporter.go:156"
time="2019-07-13T15:56:44Z" level=info msg="Build context (go=go1.12.5, user=root@b50852a1acba, date=20190604-16:41:18)" source="node_exporter.go:157"
time="2019-07-13T15:56:44Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
time="2019-07-13T15:56:44Z" level=info msg=" - arp" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - bcache" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - bonding" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - conntrack" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - cpu" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - cpufreq" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - diskstats" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - edac" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - entropy" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - filefd" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - filesystem" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - hwmon" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - infiniband" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - ipvs" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - loadavg" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - mdadm" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - meminfo" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - netclass" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - netdev" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - netstat" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - nfs" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - nfsd" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - pressure" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - sockstat" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - stat" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - textfile" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - time" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - timex" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - uname" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - vmstat" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - xfs" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg=" - zfs" source="node_exporter.go:104"
time="2019-07-13T15:56:44Z" level=info msg="Listening on :9100" source="node_exporter.go:170"


Verify that system metrics are being exported by visiting http://localhost:9100/metrics.   Prometheus will scrape /metrics endpoint to get metrics.

PROC PRINT

Prometheus

Deploy Prometheus by creating the prometheus.yml config file is pointing to the Node Exporter instance created above.

Create the mm directory and change directory.

mkdir ~/mm && cd ~/mm


Edit /etc/promethueus/prometheus.yml file adding:

global:
 scrape_interval:     10s
 evaluation_interval: 10s
scrape_configs:
 - job_name: prometheus
   static_configs:
     - targets: ['node_exporter:9100']
remote_write:
 - url: "http://Py4SAS-vm:9201/write"
remote_read:
 - url: "http://PyforSAS-vm:9201/read"


Deploy Prometheus.

docker run --network prometheus_timescale_network -p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus

Returns.

level=info ts=2019-07-13T16:16:46.524Z caller=main.go:293 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2019-07-13T16:16:46.524Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.0, branch=HEAD, revision=4ef66003d9855ed2b7a41e987b33828ec36db34d)"
level=info ts=2019-07-13T16:16:46.524Z caller=main.go:330 build_context="(go=go1.12.7, user=root@0dc27cf95f36, date=20190709-09:54:35)"
level=info ts=2019-07-13T16:16:46.524Z caller=main.go:331 host_details="(Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 6c05d6d4038c (none))"
level=info ts=2019-07-13T16:16:46.524Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-13T16:16:46.524Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-13T16:16:46.526Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2019-07-13T16:16:46.527Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-07-13T16:16:46.534Z caller=main.go:667 fs_type=XFS_SUPER_MAGIC
level=info ts=2019-07-13T16:16:46.534Z caller=main.go:668 msg="TSDB started"
level=info ts=2019-07-13T16:16:46.534Z caller=main.go:738 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-07-13T16:16:46.535Z caller=wal_watcher.go:149 component=remote queue=0:http://Py4SAS-vm:9201/write msg="starting WAL watcher" queue=0:http://Py4SAS-vm:9201/write
level=info ts=2019-07-13T16:16:46.537Z caller=main.go:766 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-07-13T16:16:46.537Z caller=main.go:621 msg="Server is ready to receive web requests."


Test the ability to query the collected metrics by pointing a browser to http://localhost:9090/graph.   In the example below node memory available is graphed.

PROC PRINT

Query the TimescaleDB directly.

docker exec -it timescaledb bash
psql postgres postgres

Enter the query:

SELECT time_bucket('5 minutes', time) AS five_min_bucket, name, avg(value)
FROM metrics
WHERE (name='node_load5' OR name='node_memory_Active_bytes') AND
      time > NOW() - interval '1 day'
GROUP BY five_min_bucket,name
ORDER BY five_min_bucket;

Returns.

    five_min_bucket     |           name           |        avg
------------------------+--------------------------+-------------------
 2019-07-13 16:15:00+00 | node_load5               |  1.84263157894737
 2019-07-13 16:15:00+00 | node_memory_Active_bytes |  1055776552.42105
 2019-07-13 16:20:00+00 | node_load5               |  1.62533333333333
 2019-07-13 16:20:00+00 | node_memory_Active_bytes |      1070225817.6
 2019-07-13 16:25:00+00 | node_load5               |  1.06166666666667
 2019-07-13 16:25:00+00 | node_memory_Active_bytes |  1084650018.13333
 2019-07-13 16:30:00+00 | node_load5               | 0.593333333333333
 2019-07-13 16:30:00+00 | node_memory_Active_bytes |  1093145668.26667
 2019-07-13 16:35:00+00 | node_load5               | 0.322333333333333
 2019-07-13 16:35:00+00 | node_memory_Active_bytes |  1108816145.06667
 2019-07-13 16:40:00+00 | node_load5               | 0.180666666666667
 2019-07-13 16:40:00+00 | node_memory_Active_bytes |      1125213388.8
 2019-07-13 16:45:00+00 | node_load5               |             0.126
 2019-07-13 16:45:00+00 | node_memory_Active_bytes |  1134488098.13333
 2019-07-13 16:50:00+00 | node_load5               | 0.133333333333333
 2019-07-13 16:50:00+00 | node_memory_Active_bytes |        1162440704
(16 rows)

The time_bucket function as a part of the SELECT statement is a feature unique to the TimescaleDB function to Postgres.


Grafana

Pull the latest grafana image.

docker pull grafana/grafana

Returns.

Using default tag: latest
latest: Pulling from grafana/grafana
fc7181108d40: Pull complete
b3698e2f9d08: Pull complete
13ca9094bc72: Pull complete
8412253947ed: Pull complete
8efa19c8dbaf: Pull complete
afa031bd3fdb: Pull complete
Digest: sha256:a48dbcdd80f74465d98dbea58911fc450bc76cf2c89a5046b0551f7d260fe88a
Status: Downloaded newer image for grafana/grafana:latest


Run the docker image and deploy the Kubernetes plug-in.   The Postgres and Prometheus plug-ins come already configured.

docker run --network prometheus_timescale_network \
-d -p 3000:3000 \
--name=grafana \
-e "GF_INSTALL_PLUGINS=grafana-kubernetes-app" \
-e "GF_PATHS_LOGS=/var/log/grafana" \
-e "GF_PATHS_DATA=/var/lib/grafana" \
grafana/grafana:latest

Check the logs.

docker logs 41cb724ac1ba16a7ee3ecea7bece92ec0d2cc0febffae32e7526020cf586f56a

Returns.

Usage of grafana-server:
  -config string
        path to config file
  -convey-json
        When true, emits results in JSON blocks. Default: 'false'
  -convey-silent
        When true, all output from GoConvey is suppressed.
  -convey-story
        When true, emits story output, otherwise emits dot output. When not provided, this flag mirrors the value of the '-test.v' flag
  -homepath string
        path to grafana install/home path, defaults to working directory
  -packaging string
        describes the way Grafana was installed (default "unknown")
  -pidfile string
        path to pid file
  -profile
        Turn on pprof profiling
  -profile-port int
        Define custom port for profiling (default 6060)
  -test.bench regexp
        run only benchmarks matching regexp
  -test.benchmem
        print memory allocations for benchmarks
  -test.benchtime d
        run each benchmark for duration d (default 1s)
  -test.blockprofile file
        write a goroutine blocking profile to file
  -test.blockprofilerate rate
        set blocking profile rate (see runtime.SetBlockProfileRate) (default 1)
  -test.count n
        run tests and benchmarks n times (default 1)
  -test.coverprofile file
        write a coverage profile to file
  -test.cpu list
        comma-separated list of cpu counts to run each test with
  -test.cpuprofile file
        write a cpu profile to file
  -test.failfast
        do not start new tests after the first test failure
  -test.list regexp
        list tests, examples, and benchmarks matching regexp then exit
  -test.memprofile file
        write an allocation profile to file
  -test.memprofilerate rate
        set memory allocation profiling rate (see runtime.MemProfileRate)
  -test.mutexprofile string
        write a mutex contention profile to the named file after execution
  -test.mutexprofilefraction int
        if >= 0, calls runtime.SetMutexProfileFraction() (default 1)
  -test.outputdir dir
        write profiles to dir
  -test.parallel n
        run at most n tests in parallel (default 2)
  -test.run regexp
        run only tests and examples matching regexp
  -test.short
        run smaller test suite to save time
  -test.testlogfile file
        write test action log to file (for use only by cmd/go)
  -test.timeout d
        panic test binary after duration d (default 0, timeout disabled)
  -test.trace file
        write an execution trace to file
  -test.v
        verbose: print additional output
  -v    prints current version and exits


Add restricted user grafanareader to Postgres.

docker exec -it timescaledb /bin/bash

Log into the Postgres database.

psql postgres postgres


Execute.

CREATE USER grafanareader WITH PASSWORD '<Password_String>'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO grafanareader; GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO grafanareader;

Returns.

CREATE ROLE
GRANT
GRANT


Open port 3000.

az vm open-port --priority 1213 --resource-group Py4SAS --name Spark-node-vm  --port 3000


Point a browser to http://localhost/3000.

PROC PRINT



Point grafana to the Prometheus server listening on port 9090.

PROC PRINT



Grafana dashboard rendering Node Exporter metrics.

PROC PRINT