Metrics & Monitoring Guide
Monitoring architecture and metrics collection
Overview
The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.
Quick Links
Architecture
Components
| Component | Purpose |
|---|
| Telegraf | Metrics collector running on each node, gathering system and application metrics |
| VictoriaMetrics Agent | Metrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics |
| VictoriaMetrics (Short-term) | Time-series database for operational dashboards (30-90 day retention) |
| VictoriaMetrics (Long-term) | Time-series database for billing and compliance (1+ year retention) |
| Grafana | Visualization and dashboard platform; deployed as two replicas for high availability |
| Alertmanager | Alert routing and notification management |
Metrics Flow
The following diagram illustrates how metrics flow through the monitoring stack:
flowchart TB
subgraph External["External Sources"]
Streamers[Streamers/External Clients]
end
subgraph Cluster["Kubernetes Cluster"]
Telegraf[Telegraf DaemonSet]
subgraph Applications["Application Components"]
Director[CDN Director]
Kafka[Kafka]
Redis[Redis]
Manager[ACD Manager]
Alertmanager[Alertmanager]
end
VMAgent[VictoriaMetrics Agent]
subgraph Storage["Storage"]
VMShort[VictoriaMetrics<br/>Short-term]
VMLong[VictoriaMetrics<br/>Long-term]
end
Grafana[Grafana<br/>2 replicas, HA]
PostgreSQL[(PostgreSQL)]
Zitadel[Zitadel]
end
Streamers -->|Push metrics| Telegraf
Telegraf -->|remote_write| VMShort
Telegraf -->|remote_write| VMLong
Director -->|Scrape| VMAgent
Kafka -->|Scrape| VMAgent
Redis -->|Scrape| VMAgent
Manager -->|Scrape| VMAgent
Alertmanager -->|Scrape| VMAgent
VMAgent -->|remote_write| VMShort
VMAgent -->|remote_write| VMLong
VMShort -->|Query| Grafana
VMLong -->|Query| Grafana
Grafana <-->|Shared state| PostgreSQL
Grafana -->|OAuth2 / OIDC| ZitadelMetrics Flow Summary:
External metrics ingestion:
- External clients (streamers) push metrics to Telegraf
- Telegraf forwards metrics via
remote_write to both VictoriaMetrics instances
Internal metrics scraping:
- VictoriaMetrics Agent scrapes Prometheus endpoints from:
- CDN Director instances
- Kafka cluster
- Redis
- ACD Manager components
- Alertmanager
- VMAgent forwards scraped metrics via
remote_write to both VictoriaMetrics instances
Data visualization:
- Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
- Operational dashboards use short-term storage
- Billing and compliance dashboards use long-term storage
Metrics Collection
Application Metrics
Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.
System Metrics
Telegraf collects system-level metrics including:
- CPU usage
- Memory utilization
- Disk I/O
- Network statistics
- Process metrics
Kubernetes Metrics
Cluster metrics are collected including:
- Pod resource usage
- Node status
- Deployment status
- Persistent volume usage
Metrics Retention
VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:
acd-metrics:
victoria-metrics-single:
retentionPeriod: "3" # Retention period in months
Troubleshooting
Metrics Not Appearing
If metrics are not appearing in Grafana:
Check Telegraf pods:
kubectl get pods -l app.kubernetes.io/component=telegraf
Check Telegraf logs:
kubectl logs -l app.kubernetes.io/component=telegraf
Verify VictoriaMetrics is running:
kubectl get pods -l app.kubernetes.io/component=victoria-metrics
Check application metrics endpoints:
kubectl exec <pod-name> -- curl localhost:8080/metrics
For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.
Next Steps
After setting up monitoring:
- Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
- Grafana Dashboards - Explore and customise dashboards
- Alerts & Alarms - Set up alerting and notifications
- Operations Guide - Day-to-day operational procedures
- Troubleshooting Guide - Resolve monitoring issues
- API Guide - Access metrics via API
1 - Grafana Authentication & Roles
Configuring Grafana authentication, roles, and permissions via Zitadel
Overview
Grafana authentication is delegated entirely to Zitadel via OAuth2/OIDC. Local username/password login is not available to end users. When a user logs into Grafana, they are redirected to Zitadel to authenticate, and their Grafana role is automatically determined by the Zitadel project roles assigned to their account.
The OIDC integration between Grafana and Zitadel is configured automatically at install time — no manual Zitadel application registration is required.
How It Works
During installation, an init container runs before Grafana starts and:
- Authenticates with Zitadel using a machine-account service key.
- Registers a
Grafana OIDC application in the Zitadel project (or re-uses an existing one if already registered). - Writes the resulting
client_id and client_secret into a Kubernetes Secret, which Grafana picks up on startup.
This means the Grafana OIDC application in Zitadel is managed automatically and does not need to be created or modified manually.
Role Mapping
Grafana roles are mapped from Zitadel project roles using the following rule:
| Zitadel Project Role | Grafana Role |
|---|
grafana_admin | Admin — full access, can manage users, datasources, and dashboards |
| (any other role, or no role) | Viewer — read-only access to dashboards |
Note: There is no Grafana Editor role mapped by default. All authenticated users who are not explicitly granted grafana_admin receive Viewer access. If you need an Editor tier, see Customising the Role Mapping.
The mapping is enforced on every login. If a user’s Zitadel role changes, the change takes effect the next time they log into Grafana.
Prerequisites
Accessing Grafana
Grafana is accessible at:
https://<manager-host>/grafana
Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.
To log in:
- Navigate to
https://<manager-host>/grafana - Click “Login with Zitadel”
- Authenticate with your Zitadel account credentials
Granting Admin Access
By default, all Zitadel users who log into Grafana receive Viewer access. To grant a user Admin access, assign them the grafana_admin project role in Zitadel.
Step 1: Ensure the grafana_admin Role Exists
- Log into the Zitadel Console at
https://<manager-host>/ui/console - Navigate to Projects and open the ZITADEL project
- Click the Roles tab
- Check whether a role named
grafana_admin already exists - If it does not exist, click New Role and create it:
- Key:
grafana_admin - Display Name: Grafana Admin (or any label you prefer)
- Click Save
Step 2: Assign the Role to a User
- In the Zitadel Console, navigate to Users and open the user you want to grant admin access to
- Click the Authorizations tab
- Click New Authorization
- Select the ZITADEL project
- Select the
grafana_admin role - Click Save
The user will have Grafana Admin access the next time they log in.
Revoking Admin Access
To demote a user back to Viewer, remove the grafana_admin authorization from their account:
- In the Zitadel Console, open the user’s Authorizations tab
- Find the
grafana_admin authorization on the ZITADEL project - Click the delete icon to remove it
The change takes effect on their next Grafana login.
Customising the Role Mapping
The role mapping expression is configured in values.yaml under grafana."grafana.ini".auth.generic_oauth.role_attribute_path. It uses JMESPath syntax evaluated against the OIDC token’s role claims.
The default expression is:
grafana:
"grafana.ini":
auth.generic_oauth:
role_attribute_path: >-
contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin' || 'Viewer'
Example: Adding an Editor Tier
To map a grafana_editor Zitadel role to Grafana’s Editor role, create the grafana_editor role in Zitadel (following the same steps as above) and extend the expression:
grafana:
"grafana.ini":
auth.generic_oauth:
role_attribute_path: >-
contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin'
|| contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_editor') && 'Editor'
|| 'Viewer'
Apply the change using the standard upgrade procedure in the Configuration Guide.
Blocking Unauthenticated Access
By default, role_attribute_strict is set to false, which means any authenticated Zitadel user can log into Grafana as a Viewer even if they have no explicit Grafana role assigned. To restrict Grafana access to only users who have been explicitly granted a role, set this to true:
grafana:
"grafana.ini":
auth.generic_oauth:
role_attribute_strict: true
With role_attribute_strict: true, users who do not match any role in the role_attribute_path expression will be denied access entirely.
Managing Users in Grafana
User accounts in Grafana are created automatically on first login via Zitadel. There is no need to pre-create users in the Grafana UI.
To view and manage users who have logged in:
- Log into Grafana as an Admin
- Navigate to Administration > Users and access > Users
From here you can see each user’s current role, last login time, and authentication provider. Role changes should always be made via Zitadel (as described above) rather than directly in Grafana, as they will be overwritten on the user’s next login.
Break-Glass Admin Access
A local Grafana admin account is available as a break-glass fallback for situations where Zitadel is unavailable. This account is not accessible via the standard login page (which only shows the Zitadel SSO button).
To use the local admin account, navigate directly to:
https://<manager-host>/grafana/login
The default credentials are listed in the Glossary. Change the default password immediately after installation.
Security recommendation: The break-glass account should be used only for emergency access. Do not use it for routine administration.
Troubleshooting
OAuth2 Redirect URI Mismatch / CORS Errors
Grafana is registered in Zitadel with the redirect URI https://<manager-host>/grafana/login/generic_oauth, derived from the first entry of global.hosts.manager. Accessing Grafana via a different hostname or IP address will not match this URI and will cause the login to fail.
Resolution: Always access Grafana via the configured hostname. If the hostname has changed, re-run the helm upgrade to re-register the application with the updated URI.
User Receives Viewer Instead of Admin
The grafana_admin role is not included in the user’s Zitadel token.
Resolution:
- Confirm the
grafana_admin role exists on the ZITADEL project in the Zitadel Console - Confirm the role is assigned to the user under their Authorizations tab
- Ask the user to log out of Grafana and log back in — role changes are applied on the next login, not the current session
Login Fails with “Role not found” or Access Denied
role_attribute_strict may be set to true and the user has no matching Zitadel role.
Resolution: Either assign the user an appropriate Zitadel project role, or set role_attribute_strict: false in values.yaml to allow all authenticated users Viewer access.
Admin Role Assigned in Zitadel but User Still Gets Viewer
The grafana_admin role is correctly assigned to the user in Zitadel, but Grafana still grants them Viewer access. This indicates that role claims are not being included in the Zitadel userinfo response.
Grafana determines roles by calling the Zitadel userinfo endpoint (/oidc/v1/userinfo) and evaluating the urn:zitadel:iam:org:project:roles claim. Zitadel only includes this claim when the Grafana OIDC application has Access Token Role Assertions enabled. If the claim is absent, the role_attribute_path expression always falls through to 'Viewer'.
To verify and fix:
- Log into the Zitadel Console at
https://<manager-host>/ui/console - Navigate to Projects > ZITADEL > Applications > Grafana
- Open the Token Settings tab
- Ensure Access Token Role Assertions is enabled
- Save the change
The fix takes effect on the user’s next login — no Grafana or Helm changes are required.
Grafana OIDC App Not Registered in Zitadel
If the init container failed during installation, the Grafana OIDC application may not have been created in Zitadel.
Resolution: Check the init container logs for errors:
kubectl logs -l app.kubernetes.io/component=grafana --previous -c zitadel-oauth-setup
Common causes are Zitadel not being ready when the init container ran, or a machine-key permission issue. Re-running the helm upgrade will re-trigger the init container and attempt registration again.
Next Steps
- Grafana Dashboards - Using and customising dashboards
- Alerts & Alarms - Configure alerting and notifications
- Metrics & Monitoring Overview - Return to the monitoring overview
2 - Grafana Dashboards
Using and customising Grafana dashboards
Overview
Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.
Prerequisites
- Grafana is deployed and running (verify with
kubectl get pods -l app.kubernetes.io/component=grafana) - A Zitadel user account is available for login
- Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)
Accessing Grafana
Grafana is accessible via the manager ingress:
URL: https://<manager-host>/grafana
To log in:
- Navigate to
https://<manager-host>/grafana - Click the “Login with Zitadel” button
- Authenticate with your Zitadel account credentials
Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.
For details on authentication and role configuration, see Grafana Authentication & Roles.
Standard Dashboards
Accessing Dashboards
After logging into Grafana:
- Navigate to Dashboards in the left menu
- Browse the folder structure to find the dashboard you need
- Click on a dashboard to open it
Dashboards are organised into the following folders:
- Alerting — alert state history and alerting system health
- Billing — redirect counts for billing analytics
- CDN Manager — ACD Manager API performance
- Hardware — host-level CPU, memory, disk, and network telemetry
- Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
- Streaming — CDN routing, streamer performance, and QoE
- Internal Debugging — low-level ACD Director diagnostics
Alerting
Active Alarms
A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.
Alert Statistics
Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.
vmalert
Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.
Billing
Billing Dashboard
Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.
CDN Manager
CDN Manager API
Health and performance dashboard for the ACD Manager REST API. Covers:
- Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
- Traffic: Request rate by pod, distribution across API endpoints
- Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
- Latency: P99 and average latency by endpoint, overall API response latency
- Resources & Auth: Route validation API activity
Hardware
HW Metrics
Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.
An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.
Infrastructure
k3s Cluster Infrastructure
Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:
- Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
- Compute: CPU usage, memory usage, and load average per node
- Network: Inbound and outbound bytes per node
- Disk: Read/write throughput and I/O pressure per node
- Longhorn PVC Disk Usage: Usage percentage per persistent volume
- Workload Health: Pod restart counts and OOMKill occurrences
Kafka
Kafka broker health using JMX exporter metrics. Covers:
- Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
- Throughput: Bytes in/out and messages in by topic, replication bytes in/out
- Internals: Request handler idle percentage, network processor idle percentage
Longhorn Storage
Persistent storage health for the Longhorn distributed block storage layer. Covers:
- Overview: Total, healthy, degraded, and faulted volume counts; nodes down
- Capacity: Total cluster capacity, used, and available storage
- Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
- Node & Disk: Disk usage percentage and available bytes per node, node condition checks
Redis
Redis instance health using redis-exporter metrics. Covers:
- Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
- Memory: Usage and fragmentation ratio
- Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
- Evictions & Persistence: Evictions, expirations, RDB unsaved changes
- CPU & Connections: CPU usage and connection metrics
- Command Analysis: Per-command breakdown
Streaming
Extended Monitoring
The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:
- Latency Statistics: ACD router latency and CDN latency over time
- Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
- Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
- CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
- CDN Failovers and Retries: Failover events and retry rates by CDN
- Host Selection: Endpoint request distribution
- Session Statistics: Active session counts and session type breakdown
- Client Responses: Client-facing HTTP status code distribution
- Incoming Requests: Raw request volume
- HTTPS Certificate Statistics: Certificate validity and expiry indicators
- Warnings & Errors: Application-level warnings and errors over time
- LUA Statistics: Lua exception counts and execution time
- Configuration Change History: Timeline of routing configuration changes
Router Monitoring
External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.
QoE Monitoring
Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.
Streamer Statistics
Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.
An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.
Internal Debugging
These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.
Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.
ACD: Incoming Internet Connections
SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.
ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.
Prometheus: ACD
ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.
CDN Failures
CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.
ACD: CDN Latencies Detail
Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.
ACD: Router Latencies
ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.
Prometheus/ACD: SubRunners
Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.
Advanced Dashboards
Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.
Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.
Enabling Advanced Dashboards
Once you have your licence key, add the following to your values.yaml:
dashboards:
advanced:
licenceKey: "<your-licence-key>"
Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.
HW Metrics (Advanced)
Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.
Streamer Statistics (Advanced)
Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:
- OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
- Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
- Detailed network error and drop counters from
/proc/net/dev
CDN Director Metrics
Director DNS Names in Grafana
CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:
global:
hosts:
routers:
- name: my-router-1
address: 192.0.2.1
The DNS name used in Grafana dashboards will be: my-router-1.external
This naming convention is automatically applied for all configured directors.
Customising Dashboards
Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.
The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:
- In Grafana, navigate to Dashboards > New > New Dashboard
- Add panels using the VictoriaMetrics or Prometheus datasource
- Save the dashboard to a folder of your choice
Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.
Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.
Customising a Pre-provisioned Dashboard
If you want a modified version of one of the built-in dashboards as a starting point:
- Open the dashboard you want to customise
- Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
- In Grafana, navigate to Dashboards > New > Import
- Upload the downloaded JSON file
- On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
- Click Import
You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.
Troubleshooting
Dashboard Loading Issues
If dashboards fail to load:
Check Grafana pods:
kubectl get pods -l app.kubernetes.io/component=grafana
Review Grafana logs:
kubectl logs -l app.kubernetes.io/component=grafana
Verify datasource configuration in Grafana UI
For login and authentication issues, see Grafana Authentication & Roles.
Next Steps
- Alerts & Alarms - Set up alerting and notifications
- Operations Guide - Day-to-day operational procedures
- Metrics & Monitoring Overview - Return to the monitoring overview
3 - Alerts & Alarms
Configuring and managing alerts and alarms
Overview
The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.
This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.
Alert Severity Levels
| Severity | Meaning |
|---|
| critical | Immediate action required. The condition poses a risk to data integrity, service availability, or active traffic. |
| warning | Investigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended. |
Alert Groups
Alerts are organised into the following groups, each evaluated on a 15-second interval.
infra-disk
Monitors disk space utilisation and I/O latency on cluster nodes.
StorageFillingUp
| Property | Value |
|---|
| Severity | warning |
| Condition | Root filesystem usage exceeds 85% |
| Must persist for | 2 minutes |
What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.
Recommended actions:
- Identify the node from the
host label in the alert. - Log into the node and check disk usage:
df -h /
du -sh /var/log/* | sort -rh | head -20
- Clear old log files, unused container images, or temporary files:
# On the node
journalctl --vacuum-size=500M
crictl rmi --prune
- If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.
HighDiskLatency
| Property | Value |
|---|
| Severity | warning |
| Condition | Average disk write latency exceeds 100 ms |
| Must persist for | 2 minutes |
What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.
Recommended actions:
- Identify the affected disk from the
name label in the alert. - Check for I/O-intensive processes on the node:
- Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
- If the issue persists on a production node, review whether the storage hardware meets the System Requirements.
infra-compute
Monitors CPU and memory utilisation on cluster nodes.
CpuSaturation
| Property | Value |
|---|
| Severity | warning |
| Condition | Total CPU usage exceeds 90% |
| Must persist for | 5 minutes |
What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.
Recommended actions:
- Identify the saturated node from the
host label in the alert. - Check which pods are consuming CPU:
kubectl top pods --sort-by=cpu -A
- Check for runaway processes on the node:
- If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.
MemoryCriticallyLow
| Property | Value |
|---|
| Severity | critical |
| Condition | Available RAM falls below 10% |
| Must persist for | 2 minutes |
What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.
Recommended actions:
- Identify the affected node from the
host label in the alert. - Immediately check for memory-leaking or oversized pods:
kubectl top pods --sort-by=memory -A
- Identify and restart any pods showing abnormal memory consumption:
kubectl rollout restart deployment/<name>
- Check kernel OOM kill log for any processes already killed:
dmesg | grep -i "oom\|killed"
- Review memory resource limits and requests for affected deployments and adjust if necessary.
SwapUsageDetected
| Property | Value |
|---|
| Severity | warning |
| Condition | Swap usage exceeds 5% |
| Must persist for | 1 minute |
What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.
Recommended actions:
- Treat this as an early warning for the same conditions as
MemoryCriticallyLow. - Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
- Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.
infra-network
Monitors network interface errors and traffic anomalies on cluster nodes.
NetworkInterfaceErrors
| Property | Value |
|---|
| Severity | critical |
| Condition | Any non-zero rate of inbound or outbound packet errors on a network interface |
| Must persist for | 1 minute |
What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.
Recommended actions:
- Identify the affected host and interface from the
host and interface labels in the alert. - Check interface error counters on the node:
ip -s link show <interface>
ethtool -S <interface> | grep -i error
- Check for duplex/speed mismatches between the node NIC and the upstream switch:
ethtool <interface> | grep -E "Speed|Duplex"
- Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.
SuddenNetworkEgressDrop
| Property | Value |
|---|
| Severity | critical |
| Condition | Egress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s |
| Must persist for | 1 minute |
What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.
Recommended actions:
- Identify the affected node and interface from the alert labels.
- Verify the node’s network connectivity:
ping <gateway-ip>
traceroute <upstream-endpoint>
- Check for interface errors or link-down events:
ip link show
dmesg | grep -i "link\|eth\|nic"
- Verify that upstream routing and firewall rules have not changed.
- If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.
SuddenNetworkIngressSpike
| Property | Value |
|---|
| Severity | warning |
| Condition | Ingress throughput exceeds twice the 5-minute baseline |
| Must persist for | 1 minute |
What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.
Recommended actions:
- Identify the affected node and interface from the alert labels.
- Review active connections and top talkers:
ss -s
netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
- Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
- If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.
longhorn
Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.
Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).
LonghornVolumeDegraded
| Property | Value |
|---|
| Severity | warning |
| Condition | A Longhorn volume’s robustness state is Degraded |
| Must persist for | 2 minutes |
What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.
Recommended actions:
- Identify the affected volume from the
volume label in the alert. - Open the Longhorn UI and inspect the volume’s replica status.
- Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
- If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
- Investigate the health of the node that hosted the failed replica:
kubectl get nodes
kubectl describe node <node-name>
LonghornVolumeFaulted
| Property | Value |
|---|
| Severity | critical |
| Condition | A Longhorn volume’s robustness state is Faulted |
| Must persist for | 1 minute |
What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.
Recommended actions:
- Identify the affected volume from the
volume label. - Immediately check which pods are using the volume:
kubectl get pods -A -o wide | grep -i <volume-name>
- Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
- Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
- Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.
LonghornNodeDown
| Property | Value |
|---|
| Severity | critical |
| Condition | A Longhorn node reports a non-ready state |
| Must persist for | 2 minutes |
What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.
Recommended actions:
- Identify the affected node from the
node label in the alert. - Check the node’s status in Kubernetes:
kubectl get nodes
kubectl describe node <node-name>
- Attempt to SSH to the node and check system health:
ssh root@<node-ip>
systemctl status k3s
- If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.
LonghornDiskSpaceLow
| Property | Value |
|---|
| Severity | warning |
| Condition | Available Longhorn disk space on a node falls below 15% |
| Must persist for | 2 minutes |
What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.
Recommended actions:
- Identify the affected node and disk from the
node and disk labels in the alert. - Open the Longhorn UI and check which volumes have replicas on this disk.
- Check for snapshots or backups that can be cleaned up to reclaim space.
- If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
- Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.
Adding Custom Alert Rules
Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.
Example: Adding a Custom Alert
The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:
victoria_metrics_alert:
server:
config:
alerts:
groups:
# ... existing groups are preserved alongside your additions ...
- name: kafka
interval: 15s
rules:
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag on {{ $labels.topic }}"
description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."
Apply the change using the standard upgrade procedure in the Configuration Guide.
Rule Fields Reference
| Field | Required | Description |
|---|
alert | Yes | Alert name. Must be unique within the group. |
expr | Yes | PromQL expression. The alert fires when this evaluates to a non-zero/non-empty result. |
for | No | How long the condition must hold before the alert fires. Omitting this fires immediately. |
labels.severity | Recommended | Set to critical or warning to match the built-in routing rules. |
annotations.summary | Recommended | Short human-readable description. Supports Go template labels (e.g. {{ $labels.host }}). |
annotations.description | Recommended | Detailed description with context for the on-call operator. |
Tip: Use the Alertmanager UI (https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.
Configuring Alert Routes
By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.
Route Structure
The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:
alertmanager:
config:
route:
receiver: 'null' # Default: discard unmatched alerts
group_by: ['alertname']
group_wait: 10s # Wait before sending first notification for a new group
group_interval: 10s # Wait before sending updated notifications for a group
repeat_interval: 1h # Re-notify if an alert is still firing after this period
routes:
- matchers:
- severity="critical"
receiver: 'slack'
- matchers:
- severity="warning"
receiver: 'email-warning'
Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.
Notification Channels
Email
Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.
alertmanager:
config:
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_require_tls: true
route:
routes:
- matchers:
- severity="critical"
receiver: 'email-critical'
- matchers:
- severity="warning"
receiver: 'email-warning'
receivers:
- name: 'null'
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
- name: 'email-warning'
email_configs:
- to: 'alerts@example.com'
send_resolved: true
Slack
Requires an incoming webhook URL created in your Slack workspace.
alertmanager:
config:
route:
routes:
- matchers:
- severity="critical"
receiver: 'slack'
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: |
*Severity:* {{ .CommonLabels.severity }}
*Host:* {{ .CommonLabels.host }}
{{ range .Alerts }}{{ .Annotations.description }}{{ end }}
Telegram
Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.
alertmanager:
config:
route:
routes:
- matchers:
- severity="critical"
receiver: 'telegram'
receivers:
- name: 'null'
- name: 'telegram'
telegram_configs:
- bot_token: 'your-bot-token'
chat_id: -1234567890
parse_mode: 'Markdown'
send_resolved: true
message: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Host:* {{ .CommonLabels.host }}
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
Finding your chat ID: Add your bot to the channel or group, send a message, then call https://api.telegram.org/bot<token>/getUpdates and read the chat.id from the response. Note that group and channel chat IDs are negative numbers.
Combining Multiple Receivers
Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:
alertmanager:
config:
route:
receiver: 'null'
routes:
- matchers:
- severity="critical"
receiver: 'slack'
continue: true # Continue matching so the next route also fires
- matchers:
- severity="critical"
receiver: 'email-critical'
- matchers:
- severity="warning"
receiver: 'email-warning'
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#critical-alerts'
send_resolved: true
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
- name: 'email-warning'
email_configs:
- to: 'alerts@example.com'
send_resolved: true
Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.
Silencing Alerts
Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.
Silences are managed via the Alertmanager UI, accessible at:
https://<manager-host>/alertmanager
Creating a Silence
- Navigate to the Alertmanager UI and click Silences in the top navigation.
- Click Create Silence.
- Set the Start and End times for the silence window.
- Add one or more matchers to scope which alerts are suppressed. For example:
alertname = StorageFillingUp — silence a specific alertseverity = warning — silence all warningshost = node-01 — silence all alerts from a specific host
- Add a Comment describing the reason for the silence (e.g.
Planned disk expansion on node-01). - Click Create. The silence takes effect immediately.
Expiring a Silence
Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.
Next Steps
- Operations Guide - Day-to-day operational procedures
- Troubleshooting Guide - Resolve underlying issues surfaced by alerts
- Metrics & Monitoring Overview - Return to the monitoring overview