Grafana Dashboards

Using and customising Grafana dashboards

You're viewing a development version of manager, the latest released version is 1.6.1

Overview

Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.

Prerequisites

Grafana is deployed and running (verify with kubectl get pods -l app.kubernetes.io/component=grafana)
A Zitadel user account is available for login
Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)

Accessing Grafana

Grafana is accessible via the manager ingress:

URL: https://<manager-host>/grafana

To log in:

Navigate to https://<manager-host>/grafana
Click the “Login with Zitadel” button
Authenticate with your Zitadel account credentials

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

For details on authentication and role configuration, see Grafana Authentication & Roles.

Standard Dashboards

Accessing Dashboards

After logging into Grafana:

Navigate to Dashboards in the left menu
Browse the folder structure to find the dashboard you need
Click on a dashboard to open it

Dashboards are organised into the following folders:

Alerting — alert state history and alerting system health
Billing — redirect counts for billing analytics
CDN Manager — ACD Manager API performance
Hardware — host-level CPU, memory, disk, and network telemetry
Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
Streaming — CDN routing, streamer performance, and QoE
Internal Debugging — low-level ACD Director diagnostics

Alerting

Active Alarms

A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.

Alert Statistics

Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.

vmalert

Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.

Billing

Billing Dashboard

Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.

CDN Manager

CDN Manager API

Health and performance dashboard for the ACD Manager REST API. Covers:

Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
Traffic: Request rate by pod, distribution across API endpoints
Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
Latency: P99 and average latency by endpoint, overall API response latency
Resources & Auth: Route validation API activity

Hardware

HW Metrics

Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.

An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.

Infrastructure

k3s Cluster Infrastructure

Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:

Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
Compute: CPU usage, memory usage, and load average per node
Network: Inbound and outbound bytes per node
Disk: Read/write throughput and I/O pressure per node
Longhorn PVC Disk Usage: Usage percentage per persistent volume
Workload Health: Pod restart counts and OOMKill occurrences

Kafka

Kafka broker health using JMX exporter metrics. Covers:

Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
Throughput: Bytes in/out and messages in by topic, replication bytes in/out
Internals: Request handler idle percentage, network processor idle percentage

Longhorn Storage

Persistent storage health for the Longhorn distributed block storage layer. Covers:

Overview: Total, healthy, degraded, and faulted volume counts; nodes down
Capacity: Total cluster capacity, used, and available storage
Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
Node & Disk: Disk usage percentage and available bytes per node, node condition checks

Redis

Redis instance health using redis-exporter metrics. Covers:

Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
Memory: Usage and fragmentation ratio
Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
Evictions & Persistence: Evictions, expirations, RDB unsaved changes
CPU & Connections: CPU usage and connection metrics
Command Analysis: Per-command breakdown

Streaming

Extended Monitoring

The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:

Latency Statistics: ACD router latency and CDN latency over time
Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
CDN Failovers and Retries: Failover events and retry rates by CDN
Host Selection: Endpoint request distribution
Session Statistics: Active session counts and session type breakdown
Client Responses: Client-facing HTTP status code distribution
Incoming Requests: Raw request volume
HTTPS Certificate Statistics: Certificate validity and expiry indicators
Warnings & Errors: Application-level warnings and errors over time
LUA Statistics: Lua exception counts and execution time
Configuration Change History: Timeline of routing configuration changes

Router Monitoring

External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.

QoE Monitoring

Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.

Streamer Statistics

Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.

An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.

Internal Debugging

These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.

Debugging Information

Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.

ACD: Incoming Internet Connections

SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.

Performance Metrics

ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.

Prometheus: ACD

ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.

CDN Failures

CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.

ACD: CDN Latencies Detail

Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.

ACD: Router Latencies

ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.

Prometheus/ACD: SubRunners

Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.

Advanced Dashboards

Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.

Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.

Enabling Advanced Dashboards

Once you have your licence key, add the following to your values.yaml:

dashboards:
  advanced:
    licenceKey: "<your-licence-key>"

Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.

HW Metrics (Advanced)

Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.

Streamer Statistics (Advanced)

Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:

OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
Detailed network error and drop counters from /proc/net/dev

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Customising Dashboards

Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.

The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:

In Grafana, navigate to Dashboards > New > New Dashboard
Add panels using the VictoriaMetrics or Prometheus datasource
Save the dashboard to a folder of your choice

Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.

Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.

Customising a Pre-provisioned Dashboard

If you want a modified version of one of the built-in dashboards as a starting point:

Open the dashboard you want to customise
Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
In Grafana, navigate to Dashboards > New > Import
Upload the downloaded JSON file
On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
Click Import

You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.

Troubleshooting

Dashboard Loading Issues

If dashboards fail to load:

Check Grafana pods:

kubectl get pods -l app.kubernetes.io/component=grafana

Review Grafana logs:

kubectl logs -l app.kubernetes.io/component=grafana

Verify datasource configuration in Grafana UI

For login and authentication issues, see Grafana Authentication & Roles.

Next Steps

Alerts & Alarms - Set up alerting and notifications
Operations Guide - Day-to-day operational procedures
Metrics & Monitoring Overview - Return to the monitoring overview