Observability

Device metrics

Edge devices can collect metrics from all deployed workloads and the system of the device itself. The collected metrics are sent to a metrics receiver. This document describes how that functionality can be configured.

System metrics

Collection frequency

System metrics collection is enabled by default and the Flotta agent will start gathering them when the device is started - with default intervals of 60 seconds. Said interval can be customized by setting desired frequency (in seconds) in an EdgeDevice CR.

For instance, following spec snippet would instruct the device worker to collect system metrics every 5 minutes.

spec:
  metrics:
    system:
      interval: 300

Allow-lists

By default, the device worker would collect only pre-defined, narrow list of system metrics; user can modify the set of collected metrics using system metrics allow-list.

Allow-list configuration comprises two elements:

  • ConfigMap containing a list of metrics to be collected (exclusively)
  • Reference to the above ConfigMap in the EdgeDevice system metrics configuration

Sample allow-list ConfigMap (mind metrics_list.yaml key):

apiVersion: v1
kind: ConfigMap
metadata:
  name: system-allow-list
  namespace: devices
data:
  metrics_list.yaml: |
    names:
      - node_disk_io_now
      - node_memory_Mapped_bytes
      - node_network_speed_bytes

Reference to the above ConfigMap in an EdgeDevice spec:

spec:
  metrics:
    system:
      allowList: 
          name: system-allow-list

Metrics receiver

Overview

The devices can be configured to write the metrics to a remote server. The client in the device uses Prometheus Remote Write API (see also Prometheus Integrations). The device writes metrics until it reaches the end of the TSDB contents. It then waits 5 minutes for more metrics to be collected.

Configuration

The feature is disabled by default. It is configured via EdgeDevice/EdgeDeviceSet CRs. Example with inline documentation and defaults:

spec:
    metrics:
      receiverConfiguration:
        caSecretName: receiver-tls # secret containing CA cert. Secret key is 'ca.crt'. Optional
        requestNumSamples: 10000 # maximum number of samples in each request from device to receiver. Optional
        timeoutSeconds: 10 # timeout for requests to receiver. Optional
        url: https://receiver:19291/api/v1/receive # the receiver's URL. Used to indicate HTTP/HTTPS. Set to empty in order to disable writing to receiver

Example receiver

We prepared an example for deploying a Thanos receiver. Example includes deployment with and without TLS. The receiver listens on port 19291 for incoming writes. The deployment’s pod includes a container that executes a Thanos querier. You can use it for querying the received metrics. It listens on port 9090.

Without TLS:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-receiver
  labels:
    app: thanos-receiver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-receiver
  template:
    metadata:
      labels:
        app: thanos-receiver
    spec:
      containers:
      - name: receive
        image: quay.io/thanos/thanos:v0.24.0
        command:
        - /bin/thanos
        - receive
        - --label
        - "receiver=\"0\""
      - name: query
        image: quay.io/thanos/thanos:v0.24.0
        command:
        - /bin/thanos
        - query
        - --http-address
        - 0.0.0.0:9090
        - --grpc-address
        - 0.0.0.0:11901
        - --endpoint
        - 127.0.0.1:10901

With TLS:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-receiver
  labels:
    app: thanos-receiver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-receiver
  template:
    metadata:
      labels:
        app: thanos-receiver
    spec:
      initContainers:
      - name: http-config
        image: fedora
        command: ["/bin/sh"]
        args: ["-c", "echo -e \"tls_server_config:\\n  cert_file: /etc/server-tls/tls.crt\\n  key_file: /etc/server-tls/tls.key\" > /etc/shared/http.config"]
        volumeMounts:
        - name: shared
          mountPath: /etc/shared
      containers:
      - name: receive
        image: quay.io/thanos/thanos:v0.24.0
        command:
        - /bin/thanos
        - receive
        - --label
        - "receiver=\"0\""
        - --remote-write.server-tls-cert
        - /etc/server-tls/tls.crt
        - --remote-write.server-tls-key
        - /etc/server-tls/tls.key
        volumeMounts:
        - name: server-tls
          mountPath: /etc/server-tls
      - name: query
        image: quay.io/thanos/thanos:v0.24.0
        command:
        - /bin/thanos
        - query
        - --http-address
        - 0.0.0.0:9090
        - --grpc-address
        - 0.0.0.0:11901
        - --endpoint
        - 127.0.0.1:10901
        - --http.config
        - /etc/shared/http.config
        volumeMounts:
        - name: server-tls
          mountPath: /etc/server-tls
        - name: shared
          mountPath: /etc/shared
      volumes:
      - name: server-tls
        secret:
          secretName: thanos-receiver-tls
      - name: shared
        emptyDir: {}

Operator Metrics

In order to publish the metrics to Openshift, the following step yaml need to to be applied:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
EOF

Grafana dashboard

In order to install Grafana in flotta namespace, the use the following script which will install the Grafana and Grafana Dashboard.

export KUBECONFIG=your-kubeconfig-file
tools/deploy_grafana.sh -d contrib/metrics/flotta-dashboard.json

To import any additional Grafana dashboard to existing Grafana in flotta namespace, use following script:

export KUBECONFIG=your-kubeconfig-file
tools/import_grafana_dashboards.sh -d <dashboard file>

Specifically, it can be used to install edge device health monitoring dashboard (flotta-operator/docs/metrics/flotta-devices-health.json):

tools/import_grafana_dashboards.sh -d contrib/metrics/flotta-devices-health.json

All these scripts are part of flotta-operator github repo.