Grafana Dashboards
Page not available in that version
The current page Grafana Dashboards doesn't exist in version v1.4.1 of the documentation for this product.
Overview
Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.
Prerequisites
- Grafana is deployed and running (verify with
kubectl get pods -l app.kubernetes.io/component=grafana) - A Zitadel user account is available for login
- Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)
Accessing Grafana
Grafana is accessible via the manager ingress:
URL: https://<manager-host>/grafana
To log in:
- Navigate to
https://<manager-host>/grafana - Click the “Login with Zitadel” button
- Authenticate with your Zitadel account credentials
Important: Grafana must be accessed using the DNS name specified in the first entry of
global.hosts.managerin your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.
For details on authentication and role configuration, see Grafana Authentication & Roles.
Standard Dashboards
Accessing Dashboards
After logging into Grafana:
- Navigate to Dashboards in the left menu
- Browse the folder structure to find the dashboard you need
- Click on a dashboard to open it
Dashboards are organised into the following folders:
- Alerting — alert state history and alerting system health
- Billing — redirect counts for billing analytics
- CDN Manager — ACD Manager API performance
- Hardware — host-level CPU, memory, disk, and network telemetry
- Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
- Streaming — CDN routing, streamer performance, and QoE
- Internal Debugging — low-level ACD Director diagnostics
Alerting
Active Alarms
A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.
Alert Statistics
Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.
vmalert
Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.
Billing
Billing Dashboard
Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.
CDN Manager
CDN Manager API
Health and performance dashboard for the ACD Manager REST API. Covers:
- Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
- Traffic: Request rate by pod, distribution across API endpoints
- Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
- Latency: P99 and average latency by endpoint, overall API response latency
- Resources & Auth: Route validation API activity
Hardware
HW Metrics
Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.
An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.
Infrastructure
k3s Cluster Infrastructure
Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:
- Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
- Compute: CPU usage, memory usage, and load average per node
- Network: Inbound and outbound bytes per node
- Disk: Read/write throughput and I/O pressure per node
- Longhorn PVC Disk Usage: Usage percentage per persistent volume
- Workload Health: Pod restart counts and OOMKill occurrences
Kafka
Kafka broker health using JMX exporter metrics. Covers:
- Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
- Throughput: Bytes in/out and messages in by topic, replication bytes in/out
- Internals: Request handler idle percentage, network processor idle percentage
Longhorn Storage
Persistent storage health for the Longhorn distributed block storage layer. Covers:
- Overview: Total, healthy, degraded, and faulted volume counts; nodes down
- Capacity: Total cluster capacity, used, and available storage
- Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
- Node & Disk: Disk usage percentage and available bytes per node, node condition checks
Redis
Redis instance health using redis-exporter metrics. Covers:
- Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
- Memory: Usage and fragmentation ratio
- Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
- Evictions & Persistence: Evictions, expirations, RDB unsaved changes
- CPU & Connections: CPU usage and connection metrics
- Command Analysis: Per-command breakdown
Streaming
Extended Monitoring
The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:
- Latency Statistics: ACD router latency and CDN latency over time
- Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
- Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
- CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
- CDN Failovers and Retries: Failover events and retry rates by CDN
- Host Selection: Endpoint request distribution
- Session Statistics: Active session counts and session type breakdown
- Client Responses: Client-facing HTTP status code distribution
- Incoming Requests: Raw request volume
- HTTPS Certificate Statistics: Certificate validity and expiry indicators
- Warnings & Errors: Application-level warnings and errors over time
- LUA Statistics: Lua exception counts and execution time
- Configuration Change History: Timeline of routing configuration changes
Router Monitoring
External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.
QoE Monitoring
Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.
Streamer Statistics
Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.
An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.
Internal Debugging
These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.
Debugging Information
Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.
ACD: Incoming Internet Connections
SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.
Performance Metrics
ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.
Prometheus: ACD
ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.
CDN Failures
CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.
ACD: CDN Latencies Detail
Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.
ACD: Router Latencies
ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.
Prometheus/ACD: SubRunners
Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.
Advanced Dashboards
Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.
Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.
Enabling Advanced Dashboards
Once you have your licence key, add the following to your values.yaml:
dashboards:
advanced:
licenceKey: "<your-licence-key>"
Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.
HW Metrics (Advanced)
Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.
Streamer Statistics (Advanced)
Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:
- OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
- Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
- Detailed network error and drop counters from
/proc/net/dev
CDN Director Metrics
Director DNS Names in Grafana
CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:
global:
hosts:
routers:
- name: my-router-1
address: 192.0.2.1
The DNS name used in Grafana dashboards will be: my-router-1.external
This naming convention is automatically applied for all configured directors.
Customising Dashboards
Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.
The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:
- In Grafana, navigate to Dashboards > New > New Dashboard
- Add panels using the VictoriaMetrics or Prometheus datasource
- Save the dashboard to a folder of your choice
Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.
Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.
Customising a Pre-provisioned Dashboard
If you want a modified version of one of the built-in dashboards as a starting point:
- Open the dashboard you want to customise
- Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
- In Grafana, navigate to Dashboards > New > Import
- Upload the downloaded JSON file
- On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
- Click Import
You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.
Troubleshooting
Dashboard Loading Issues
If dashboards fail to load:
Check Grafana pods:
kubectl get pods -l app.kubernetes.io/component=grafanaReview Grafana logs:
kubectl logs -l app.kubernetes.io/component=grafanaVerify datasource configuration in Grafana UI
For login and authentication issues, see Grafana Authentication & Roles.
Next Steps
- Alerts & Alarms - Set up alerting and notifications
- Operations Guide - Day-to-day operational procedures
- Metrics & Monitoring Overview - Return to the monitoring overview