This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Metrics & Monitoring Guide

Monitoring architecture and metrics collection

Overview

The CDN Manager includes a comprehensive monitoring stack based on VictoriaMetrics for time-series data storage, Telegraf for metrics collection, and Grafana for visualization. This guide describes the monitoring architecture and how to access and use the monitoring capabilities.

GuideDescription
Grafana DashboardsUsing and customising the built-in and advanced Grafana dashboards
Grafana Authentication & RolesConfiguring Grafana authentication, roles, and permissions
Alerts & AlarmsConfiguring and managing alerts and alarms

Architecture

Components

ComponentPurpose
TelegrafMetrics collector running on each node, gathering system and application metrics
VictoriaMetrics AgentMetrics scraper and forwarder; scrapes Prometheus endpoints and forwards to VictoriaMetrics
VictoriaMetrics (Short-term)Time-series database for operational dashboards (30-90 day retention)
VictoriaMetrics (Long-term)Time-series database for billing and compliance (1+ year retention)
GrafanaVisualization and dashboard platform; deployed as two replicas for high availability
AlertmanagerAlert routing and notification management

Metrics Flow

The following diagram illustrates how metrics flow through the monitoring stack:

flowchart TB
    subgraph External["External Sources"]
        Streamers[Streamers/External Clients]
    end

    subgraph Cluster["Kubernetes Cluster"]
        Telegraf[Telegraf DaemonSet]

        subgraph Applications["Application Components"]
            Director[CDN Director]
            Kafka[Kafka]
            Redis[Redis]
            Manager[ACD Manager]
            Alertmanager[Alertmanager]
        end

        VMAgent[VictoriaMetrics Agent]

        subgraph Storage["Storage"]
            VMShort[VictoriaMetrics<br/>Short-term]
            VMLong[VictoriaMetrics<br/>Long-term]
        end

        Grafana[Grafana<br/>2 replicas, HA]
        PostgreSQL[(PostgreSQL)]
        Zitadel[Zitadel]
    end

    Streamers -->|Push metrics| Telegraf
    Telegraf -->|remote_write| VMShort
    Telegraf -->|remote_write| VMLong

    Director -->|Scrape| VMAgent
    Kafka -->|Scrape| VMAgent
    Redis -->|Scrape| VMAgent
    Manager -->|Scrape| VMAgent
    Alertmanager -->|Scrape| VMAgent

    VMAgent -->|remote_write| VMShort
    VMAgent -->|remote_write| VMLong

    VMShort -->|Query| Grafana
    VMLong -->|Query| Grafana

    Grafana <-->|Shared state| PostgreSQL
    Grafana -->|OAuth2 / OIDC| Zitadel

Metrics Flow Summary:

  1. External metrics ingestion:

    • External clients (streamers) push metrics to Telegraf
    • Telegraf forwards metrics via remote_write to both VictoriaMetrics instances
  2. Internal metrics scraping:

    • VictoriaMetrics Agent scrapes Prometheus endpoints from:
      • CDN Director instances
      • Kafka cluster
      • Redis
      • ACD Manager components
      • Alertmanager
    • VMAgent forwards scraped metrics via remote_write to both VictoriaMetrics instances
  3. Data visualization:

    • Grafana queries both VictoriaMetrics databases depending on the dashboard requirements
    • Operational dashboards use short-term storage
    • Billing and compliance dashboards use long-term storage

Metrics Collection

Application Metrics

Applications expose metrics on Prometheus-compatible endpoints. VictoriaMetrics Agent (VMAgent) scrapes these endpoints and forwards metrics to VictoriaMetrics via remote_write.

System Metrics

Telegraf collects system-level metrics including:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network statistics
  • Process metrics

Kubernetes Metrics

Cluster metrics are collected including:

  • Pod resource usage
  • Node status
  • Deployment status
  • Persistent volume usage

Metrics Retention

VictoriaMetrics is configured with default retention policies. For custom retention settings, modify the VictoriaMetrics configuration in your values.yaml:

acd-metrics:
  victoria-metrics-single:
    retentionPeriod: "3"  # Retention period in months

Troubleshooting

Metrics Not Appearing

If metrics are not appearing in Grafana:

  1. Check Telegraf pods:

    kubectl get pods -l app.kubernetes.io/component=telegraf
    
  2. Check Telegraf logs:

    kubectl logs -l app.kubernetes.io/component=telegraf
    
  3. Verify VictoriaMetrics is running:

    kubectl get pods -l app.kubernetes.io/component=victoria-metrics
    
  4. Check application metrics endpoints:

    kubectl exec <pod-name> -- curl localhost:8080/metrics
    

For dashboard and authentication issues, see the Grafana Dashboards and Grafana Authentication & Roles guides.

Next Steps

After setting up monitoring:

  1. Grafana Authentication & Roles - Configure SSO and permissions before accessing Grafana
  2. Grafana Dashboards - Explore and customise dashboards
  3. Alerts & Alarms - Set up alerting and notifications
  4. Operations Guide - Day-to-day operational procedures
  5. Troubleshooting Guide - Resolve monitoring issues
  6. API Guide - Access metrics via API

1 - Grafana Authentication & Roles

Configuring Grafana authentication, roles, and permissions via Zitadel

Overview

Grafana authentication is delegated entirely to Zitadel via OAuth2/OIDC. Local username/password login is not available to end users. When a user logs into Grafana, they are redirected to Zitadel to authenticate, and their Grafana role is automatically determined by the Zitadel project roles assigned to their account.

The OIDC integration between Grafana and Zitadel is configured automatically at install time — no manual Zitadel application registration is required.

How It Works

During installation, an init container runs before Grafana starts and:

  1. Authenticates with Zitadel using a machine-account service key.
  2. Registers a Grafana OIDC application in the Zitadel project (or re-uses an existing one if already registered).
  3. Writes the resulting client_id and client_secret into a Kubernetes Secret, which Grafana picks up on startup.

This means the Grafana OIDC application in Zitadel is managed automatically and does not need to be created or modified manually.

Role Mapping

Grafana roles are mapped from Zitadel project roles using the following rule:

Zitadel Project RoleGrafana Role
grafana_adminAdmin — full access, can manage users, datasources, and dashboards
(any other role, or no role)Viewer — read-only access to dashboards

Note: There is no Grafana Editor role mapped by default. All authenticated users who are not explicitly granted grafana_admin receive Viewer access. If you need an Editor tier, see Customising the Role Mapping.

The mapping is enforced on every login. If a user’s Zitadel role changes, the change takes effect the next time they log into Grafana.

Prerequisites

Accessing Grafana

Grafana is accessible at:

https://<manager-host>/grafana

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

To log in:

  1. Navigate to https://<manager-host>/grafana
  2. Click “Login with Zitadel”
  3. Authenticate with your Zitadel account credentials

Granting Admin Access

By default, all Zitadel users who log into Grafana receive Viewer access. To grant a user Admin access, assign them the grafana_admin project role in Zitadel.

Step 1: Ensure the grafana_admin Role Exists

  1. Log into the Zitadel Console at https://<manager-host>/ui/console
  2. Navigate to Projects and open the ZITADEL project
  3. Click the Roles tab
  4. Check whether a role named grafana_admin already exists
  5. If it does not exist, click New Role and create it:
    • Key: grafana_admin
    • Display Name: Grafana Admin (or any label you prefer)
    • Click Save

Step 2: Assign the Role to a User

  1. In the Zitadel Console, navigate to Users and open the user you want to grant admin access to
  2. Click the Authorizations tab
  3. Click New Authorization
  4. Select the ZITADEL project
  5. Select the grafana_admin role
  6. Click Save

The user will have Grafana Admin access the next time they log in.

Revoking Admin Access

To demote a user back to Viewer, remove the grafana_admin authorization from their account:

  1. In the Zitadel Console, open the user’s Authorizations tab
  2. Find the grafana_admin authorization on the ZITADEL project
  3. Click the delete icon to remove it

The change takes effect on their next Grafana login.

Customising the Role Mapping

The role mapping expression is configured in values.yaml under grafana."grafana.ini".auth.generic_oauth.role_attribute_path. It uses JMESPath syntax evaluated against the OIDC token’s role claims.

The default expression is:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin' || 'Viewer'

Example: Adding an Editor Tier

To map a grafana_editor Zitadel role to Grafana’s Editor role, create the grafana_editor role in Zitadel (following the same steps as above) and extend the expression:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_path: >-
        contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_admin') && 'Admin'
        || contains(keys("urn:zitadel:iam:org:project:roles"), 'grafana_editor') && 'Editor'
        || 'Viewer'

Apply the change using the standard upgrade procedure in the Configuration Guide.

Blocking Unauthenticated Access

By default, role_attribute_strict is set to false, which means any authenticated Zitadel user can log into Grafana as a Viewer even if they have no explicit Grafana role assigned. To restrict Grafana access to only users who have been explicitly granted a role, set this to true:

grafana:
  "grafana.ini":
    auth.generic_oauth:
      role_attribute_strict: true

With role_attribute_strict: true, users who do not match any role in the role_attribute_path expression will be denied access entirely.

Managing Users in Grafana

User accounts in Grafana are created automatically on first login via Zitadel. There is no need to pre-create users in the Grafana UI.

To view and manage users who have logged in:

  1. Log into Grafana as an Admin
  2. Navigate to Administration > Users and access > Users

From here you can see each user’s current role, last login time, and authentication provider. Role changes should always be made via Zitadel (as described above) rather than directly in Grafana, as they will be overwritten on the user’s next login.

Break-Glass Admin Access

A local Grafana admin account is available as a break-glass fallback for situations where Zitadel is unavailable. This account is not accessible via the standard login page (which only shows the Zitadel SSO button).

To use the local admin account, navigate directly to:

https://<manager-host>/grafana/login

The default credentials are listed in the Glossary. Change the default password immediately after installation.

Security recommendation: The break-glass account should be used only for emergency access. Do not use it for routine administration.

Troubleshooting

OAuth2 Redirect URI Mismatch / CORS Errors

Grafana is registered in Zitadel with the redirect URI https://<manager-host>/grafana/login/generic_oauth, derived from the first entry of global.hosts.manager. Accessing Grafana via a different hostname or IP address will not match this URI and will cause the login to fail.

Resolution: Always access Grafana via the configured hostname. If the hostname has changed, re-run the helm upgrade to re-register the application with the updated URI.

User Receives Viewer Instead of Admin

The grafana_admin role is not included in the user’s Zitadel token.

Resolution:

  1. Confirm the grafana_admin role exists on the ZITADEL project in the Zitadel Console
  2. Confirm the role is assigned to the user under their Authorizations tab
  3. Ask the user to log out of Grafana and log back in — role changes are applied on the next login, not the current session

Login Fails with “Role not found” or Access Denied

role_attribute_strict may be set to true and the user has no matching Zitadel role.

Resolution: Either assign the user an appropriate Zitadel project role, or set role_attribute_strict: false in values.yaml to allow all authenticated users Viewer access.

Admin Role Assigned in Zitadel but User Still Gets Viewer

The grafana_admin role is correctly assigned to the user in Zitadel, but Grafana still grants them Viewer access. This indicates that role claims are not being included in the Zitadel userinfo response.

Grafana determines roles by calling the Zitadel userinfo endpoint (/oidc/v1/userinfo) and evaluating the urn:zitadel:iam:org:project:roles claim. Zitadel only includes this claim when the Grafana OIDC application has Access Token Role Assertions enabled. If the claim is absent, the role_attribute_path expression always falls through to 'Viewer'.

To verify and fix:

  1. Log into the Zitadel Console at https://<manager-host>/ui/console
  2. Navigate to Projects > ZITADEL > Applications > Grafana
  3. Open the Token Settings tab
  4. Ensure Access Token Role Assertions is enabled
  5. Save the change

The fix takes effect on the user’s next login — no Grafana or Helm changes are required.

Grafana OIDC App Not Registered in Zitadel

If the init container failed during installation, the Grafana OIDC application may not have been created in Zitadel.

Resolution: Check the init container logs for errors:

kubectl logs -l app.kubernetes.io/component=grafana --previous -c zitadel-oauth-setup

Common causes are Zitadel not being ready when the init container ran, or a machine-key permission issue. Re-running the helm upgrade will re-trigger the init container and attempt registration again.

Next Steps

  1. Grafana Dashboards - Using and customising dashboards
  2. Alerts & Alarms - Configure alerting and notifications
  3. Metrics & Monitoring Overview - Return to the monitoring overview

2 - Grafana Dashboards

Using and customising Grafana dashboards

Overview

Grafana is the primary visualization platform for the CDN Manager monitoring stack. It provides pre-built dashboards for cluster health, application performance, and billing analytics, and is accessible via the manager ingress.

Prerequisites

  • Grafana is deployed and running (verify with kubectl get pods -l app.kubernetes.io/component=grafana)
  • A Zitadel user account is available for login
  • Grafana is accessed via the correct DNS hostname (see Grafana Authentication & Roles)

Accessing Grafana

Grafana is accessible via the manager ingress:

URL: https://<manager-host>/grafana

To log in:

  1. Navigate to https://<manager-host>/grafana
  2. Click the “Login with Zitadel” button
  3. Authenticate with your Zitadel account credentials

Important: Grafana must be accessed using the DNS name specified in the first entry of global.hosts.manager in your configuration. Accessing Grafana via an IP address or an alternative hostname will cause OAuth2 redirect URI mismatches and CORS errors, preventing login from completing successfully.

For details on authentication and role configuration, see Grafana Authentication & Roles.

Standard Dashboards

Accessing Dashboards

After logging into Grafana:

  1. Navigate to Dashboards in the left menu
  2. Browse the folder structure to find the dashboard you need
  3. Click on a dashboard to open it

Dashboards are organised into the following folders:

  • Alerting — alert state history and alerting system health
  • Billing — redirect counts for billing analytics
  • CDN Manager — ACD Manager API performance
  • Hardware — host-level CPU, memory, disk, and network telemetry
  • Infrastructure — Kubernetes cluster, Kafka, Longhorn, and Redis health
  • Streaming — CDN routing, streamer performance, and QoE
  • Internal Debugging — low-level ACD Director diagnostics

Alerting

Active Alarms

A live view of all currently firing alerts. Shows the alert name, severity, affected host, and description. Use this as the first stop when investigating an active incident.

Alert Statistics

Historical view of alert firing activity over time. Shows which alert groups and individual rules have been firing, with timelines and trend charts. Useful for identifying recurring or flapping alerts.

vmalert

Operational health dashboard for the vmalert component itself. Covers evaluation rate, evaluation errors, alerting and recording rule counts, remote write throughput, and resource usage. Use this to verify the alerting pipeline is functioning correctly.


Billing

Billing Dashboard

Tracks redirect volumes for billing and usage analytics. Shows initial managed and unmanaged redirects, segment redirects, and endpoint redirects — both as totals and as ratios over time. Data is sourced from long-term VictoriaMetrics storage to support historical reporting.


CDN Manager

CDN Manager API

Health and performance dashboard for the ACD Manager REST API. Covers:

  • Overview: API health status, active pod count, total request volume, 5xx error rate, and average latency
  • Traffic: Request rate by pod, distribution across API endpoints
  • Errors: 5xx errors per endpoint, response code breakdown per endpoint, error rate by pod
  • Latency: P99 and average latency by endpoint, overall API response latency
  • Resources & Auth: Route validation API activity

Hardware

HW Metrics

Condensed host hardware overview covering CPU usage and load averages, memory utilisation, network interface throughput, swap usage, and root filesystem disk space. Suitable for day-to-day health checks across all cluster nodes.

An expanded HW Metrics (Advanced) dashboard is available as part of the Advanced Dashboards licence.


Infrastructure

k3s Cluster Infrastructure

Kubernetes cluster health overview using node-exporter and kube-state-metrics. Covers:

  • Cluster Overview: Node count, running pod count, OOMKilled containers, and overall cluster health status
  • Compute: CPU usage, memory usage, and load average per node
  • Network: Inbound and outbound bytes per node
  • Disk: Read/write throughput and I/O pressure per node
  • Longhorn PVC Disk Usage: Usage percentage per persistent volume
  • Workload Health: Pod restart counts and OOMKill occurrences

Kafka

Kafka broker health using JMX exporter metrics. Covers:

  • Cluster Health: Active controller, broker state, topic and partition counts, offline and under-replicated partitions, active and fenced broker counts, metadata log lag
  • Throughput: Bytes in/out and messages in by topic, replication bytes in/out
  • Internals: Request handler idle percentage, network processor idle percentage

Longhorn Storage

Persistent storage health for the Longhorn distributed block storage layer. Covers:

  • Overview: Total, healthy, degraded, and faulted volume counts; nodes down
  • Capacity: Total cluster capacity, used, and available storage
  • Volume Detail: Usage percentage per volume, actual size per volume, volume robustness state, volumes approaching capacity (>85%)
  • Node & Disk: Disk usage percentage and available bytes per node, node condition checks

Redis

Redis instance health using redis-exporter metrics. Covers:

  • Instance Health: Status, uptime, connected and blocked clients, slow log length, rejected connections
  • Memory: Usage and fragmentation ratio
  • Throughput & Keyspace: Commands processed, network I/O, keyspace hit rate, keys per database
  • Evictions & Persistence: Evictions, expirations, RDB unsaved changes
  • CPU & Connections: CPU usage and connection metrics
  • Command Analysis: Per-command breakdown

Streaming

Extended Monitoring

The primary operational dashboard for CDN routing activity. This is the home dashboard displayed on Grafana login. Covers:

  • Latency Statistics: ACD router latency and CDN latency over time
  • Redirects: Total redirect volume, status code breakdown, managed vs unmanaged ratio
  • Content Popularity: Top 10 requested content and top 10 most rapidly increasing popularity scores
  • CDN Selection: Redirect distribution across CDN endpoints, current and historical ratios
  • CDN Failovers and Retries: Failover events and retry rates by CDN
  • Host Selection: Endpoint request distribution
  • Session Statistics: Active session counts and session type breakdown
  • Client Responses: Client-facing HTTP status code distribution
  • Incoming Requests: Raw request volume
  • HTTPS Certificate Statistics: Certificate validity and expiry indicators
  • Warnings & Errors: Application-level warnings and errors over time
  • LUA Statistics: Lua exception counts and execution time
  • Configuration Change History: Timeline of routing configuration changes

Router Monitoring

External-facing view of ACD Director routing activity. Shows the number of initial routing decisions made, HTTP status code distribution, incoming HTTP/HTTPS request volumes, and selection input metrics. Useful for a high-level view of traffic hitting the directors.

QoE Monitoring

Quality of Experience scoring dashboard. Shows average QoE scores per host, per session group, per CDN, and per agent, as well as the initial CDN selection rate. Use this to identify CDN providers or content hosts that are delivering a degraded experience.

Streamer Statistics

Condensed view of streamer node performance, covering network ingress/egress throughput, TCP and HTTP connection counts, active session counts, HTTP request rates and response codes, response times (ingress and egress), and storage/memory/CPU. Suitable for routine streamer health monitoring.

An expanded Streamer Statistics (Advanced) dashboard is available as part of the Advanced Dashboards licence.


Internal Debugging

These dashboards expose low-level ACD Director internals and are primarily intended for advanced diagnostics and support investigations.

Debugging Information

Lua runtime statistics from the ACD Director: exception counts, active Lua context count, time spent in Lua execution, and router latency. Use when investigating unexpected Director behaviour or Lua errors.

ACD: Incoming Internet Connections

SSL-level connection statistics at the Director: SSL warnings and errors, valid and invalid HTTP and HTTPS request counts from external clients. Use when investigating TLS handshake failures or unexpected rejection rates.

Performance Metrics

ACD Director process-level resource usage: router CPU utilisation, router memory usage, and Lua memory consumption. Useful for identifying resource pressure on the Director process itself rather than the host.

Prometheus: ACD

ACD application metrics exposed via Prometheus: active and total session counts, session type breakdown, managed and unmanaged redirect counts, QoE corrections, manifest parse failures, initial endpoint request counts, HTTP request rates, and logged warnings and errors.

CDN Failures

CDN-level failure tracking: response code distribution from CDN backends, CDN-level failover events, host-level failovers, and host retry counts. Use when investigating CDN reliability issues or failover behaviour.

ACD: CDN Latencies Detail

Detailed CDN latency analysis with configurable percentile plots and a full latency histogram. Use when investigating tail latency issues on specific CDN backends.

ACD: Router Latencies

ACD Director routing latency distributions for both 2xx (successful) and 3xx (redirect) responses, visualised as heatmap buckets over time. Use alongside CDN Latencies Detail to separate router processing time from CDN response time.

Prometheus/ACD: SubRunners

Internal async processing queue depth and throughput metrics for the ACD Director subrunner system: client connection counts, low/medium/high priority queue depths (current and max), send/receive data block usage, wakeup counts, overload events, and autopause activations. Use when investigating Director throughput bottlenecks or queue backpressure.

Advanced Dashboards

Advanced dashboards are a paid add-on that unlocks more detailed variants of two standard dashboards, providing deeper visibility for performance investigation and capacity planning.

Licensing: Advanced dashboards require a separate licence key. To obtain a key for your deployment, contact your AgileTV account representative.

Enabling Advanced Dashboards

Once you have your licence key, add the following to your values.yaml:

dashboards:
  advanced:
    licenceKey: "<your-licence-key>"

Then apply the change by following the upgrade procedure in the Configuration Guide. The advanced dashboards will become available in Grafana automatically once the upgrade completes.

HW Metrics (Advanced)

Expanded hardware telemetry that supplements the standard HW Metrics dashboard with additional depth and additional sections: kernel metrics, TCP and UDP network stack statistics, per-interface error counters, disk IOPS, and metrics collection velocity. Use this when investigating performance anomalies surfaced by the standard dashboard or by alerts.

Streamer Statistics (Advanced)

Full streamer telemetry that supplements the standard Streamer Statistics dashboard. Includes all standard metrics plus:

  • OTT JCQ: Server group and node request rates and ratios, circuit breaker state (closed/open backends), global backend stats, pending requests per backend, and pop-out per backend
  • Account Records: Session counts, total traffic in/out, HTTP request rates (ingress/egress), cache hit ratio, backend request rates, and response times
  • Detailed network error and drop counters from /proc/net/dev

CDN Director Metrics

Director DNS Names in Grafana

CDN Director instances are identified in Grafana by their DNS name, which is derived from the name field in global.hosts.routers:

global:
  hosts:
    routers:
      - name: my-router-1
        address: 192.0.2.1

The DNS name used in Grafana dashboards will be: my-router-1.external

This naming convention is automatically applied for all configured directors.

Customising Dashboards

Permissions: Creating and importing dashboards requires Grafana Admin access. See Grafana Authentication & Roles for details on granting admin rights.

The pre-provisioned dashboards are read-only and managed by the Helm chart — changes made to them in the Grafana UI will not persist across upgrades. To create persistent custom dashboards:

  1. In Grafana, navigate to Dashboards > New > New Dashboard
  2. Add panels using the VictoriaMetrics or Prometheus datasource
  3. Save the dashboard to a folder of your choice

Custom dashboards saved this way are stored in the Grafana PostgreSQL database and are unaffected by Helm upgrades.

Note: Do not save custom dashboards into the provisioned folders (Alerting, Billing, CDN Manager, etc.). Grafana marks these folders as provisioned and may behave unexpectedly if user dashboards are mixed in.

Customising a Pre-provisioned Dashboard

If you want a modified version of one of the built-in dashboards as a starting point:

  1. Open the dashboard you want to customise
  2. Click the Share icon (top toolbar) > Export > Save to file to download the dashboard JSON
  3. In Grafana, navigate to Dashboards > New > Import
  4. Upload the downloaded JSON file
  5. On the import screen, give the dashboard a new name to distinguish it from the original, and choose a destination folder outside the provisioned set
  6. Click Import

You now have an independently editable copy. The original provisioned dashboard remains unchanged and will continue to be updated by future Helm upgrades. Your copy is stored in PostgreSQL and persists across upgrades independently.

Troubleshooting

Dashboard Loading Issues

If dashboards fail to load:

  1. Check Grafana pods:

    kubectl get pods -l app.kubernetes.io/component=grafana
    
  2. Review Grafana logs:

    kubectl logs -l app.kubernetes.io/component=grafana
    
  3. Verify datasource configuration in Grafana UI

For login and authentication issues, see Grafana Authentication & Roles.

Next Steps

  1. Alerts & Alarms - Set up alerting and notifications
  2. Operations Guide - Day-to-day operational procedures
  3. Metrics & Monitoring Overview - Return to the monitoring overview

3 - Alerts & Alarms

Configuring and managing alerts and alarms

Overview

The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.

This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.

Alert Severity Levels

SeverityMeaning
criticalImmediate action required. The condition poses a risk to data integrity, service availability, or active traffic.
warningInvestigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended.

Alert Groups

Alerts are organised into the following groups, each evaluated on a 15-second interval.


infra-disk

Monitors disk space utilisation and I/O latency on cluster nodes.

StorageFillingUp

PropertyValue
Severitywarning
ConditionRoot filesystem usage exceeds 85%
Must persist for2 minutes

What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.

Recommended actions:

  1. Identify the node from the host label in the alert.
  2. Log into the node and check disk usage:
    df -h /
    du -sh /var/log/* | sort -rh | head -20
    
  3. Clear old log files, unused container images, or temporary files:
    # On the node
    journalctl --vacuum-size=500M
    crictl rmi --prune
    
  4. If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.

HighDiskLatency

PropertyValue
Severitywarning
ConditionAverage disk write latency exceeds 100 ms
Must persist for2 minutes

What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.

Recommended actions:

  1. Identify the affected disk from the name label in the alert.
  2. Check for I/O-intensive processes on the node:
    iostat -x 2 5
    iotop -o
    
  3. Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
  4. If the issue persists on a production node, review whether the storage hardware meets the System Requirements.

infra-compute

Monitors CPU and memory utilisation on cluster nodes.

CpuSaturation

PropertyValue
Severitywarning
ConditionTotal CPU usage exceeds 90%
Must persist for5 minutes

What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.

Recommended actions:

  1. Identify the saturated node from the host label in the alert.
  2. Check which pods are consuming CPU:
    kubectl top pods --sort-by=cpu -A
    
  3. Check for runaway processes on the node:
    top -b -n 1 | head -20
    
  4. If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.

MemoryCriticallyLow

PropertyValue
Severitycritical
ConditionAvailable RAM falls below 10%
Must persist for2 minutes

What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.

Recommended actions:

  1. Identify the affected node from the host label in the alert.
  2. Immediately check for memory-leaking or oversized pods:
    kubectl top pods --sort-by=memory -A
    
  3. Identify and restart any pods showing abnormal memory consumption:
    kubectl rollout restart deployment/<name>
    
  4. Check kernel OOM kill log for any processes already killed:
    dmesg | grep -i "oom\|killed"
    
  5. Review memory resource limits and requests for affected deployments and adjust if necessary.

SwapUsageDetected

PropertyValue
Severitywarning
ConditionSwap usage exceeds 5%
Must persist for1 minute

What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.

Recommended actions:

  1. Treat this as an early warning for the same conditions as MemoryCriticallyLow.
  2. Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
  3. Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.

infra-network

Monitors network interface errors and traffic anomalies on cluster nodes.

NetworkInterfaceErrors

PropertyValue
Severitycritical
ConditionAny non-zero rate of inbound or outbound packet errors on a network interface
Must persist for1 minute

What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.

Recommended actions:

  1. Identify the affected host and interface from the host and interface labels in the alert.
  2. Check interface error counters on the node:
    ip -s link show <interface>
    ethtool -S <interface> | grep -i error
    
  3. Check for duplex/speed mismatches between the node NIC and the upstream switch:
    ethtool <interface> | grep -E "Speed|Duplex"
    
  4. Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.

SuddenNetworkEgressDrop

PropertyValue
Severitycritical
ConditionEgress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s
Must persist for1 minute

What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.

Recommended actions:

  1. Identify the affected node and interface from the alert labels.
  2. Verify the node’s network connectivity:
    ping <gateway-ip>
    traceroute <upstream-endpoint>
    
  3. Check for interface errors or link-down events:
    ip link show
    dmesg | grep -i "link\|eth\|nic"
    
  4. Verify that upstream routing and firewall rules have not changed.
  5. If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.

SuddenNetworkIngressSpike

PropertyValue
Severitywarning
ConditionIngress throughput exceeds twice the 5-minute baseline
Must persist for1 minute

What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.

Recommended actions:

  1. Identify the affected node and interface from the alert labels.
  2. Review active connections and top talkers:
    ss -s
    netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
    
  3. Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
  4. If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.

longhorn

Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.

Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).

LonghornVolumeDegraded

PropertyValue
Severitywarning
ConditionA Longhorn volume’s robustness state is Degraded
Must persist for2 minutes

What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.

Recommended actions:

  1. Identify the affected volume from the volume label in the alert.
  2. Open the Longhorn UI and inspect the volume’s replica status.
  3. Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
  4. If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
  5. Investigate the health of the node that hosted the failed replica:
    kubectl get nodes
    kubectl describe node <node-name>
    

LonghornVolumeFaulted

PropertyValue
Severitycritical
ConditionA Longhorn volume’s robustness state is Faulted
Must persist for1 minute

What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.

Recommended actions:

  1. Identify the affected volume from the volume label.
  2. Immediately check which pods are using the volume:
    kubectl get pods -A -o wide | grep -i <volume-name>
    
  3. Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
  4. Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
  5. Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.

LonghornNodeDown

PropertyValue
Severitycritical
ConditionA Longhorn node reports a non-ready state
Must persist for2 minutes

What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.

Recommended actions:

  1. Identify the affected node from the node label in the alert.
  2. Check the node’s status in Kubernetes:
    kubectl get nodes
    kubectl describe node <node-name>
    
  3. Attempt to SSH to the node and check system health:
    ssh root@<node-ip>
    systemctl status k3s
    
  4. If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.

LonghornDiskSpaceLow

PropertyValue
Severitywarning
ConditionAvailable Longhorn disk space on a node falls below 15%
Must persist for2 minutes

What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.

Recommended actions:

  1. Identify the affected node and disk from the node and disk labels in the alert.
  2. Open the Longhorn UI and check which volumes have replicas on this disk.
  3. Check for snapshots or backups that can be cleaned up to reclaim space.
  4. If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
  5. Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.

Adding Custom Alert Rules

Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.

Example: Adding a Custom Alert

The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:

victoria_metrics_alert:
  server:
    config:
      alerts:
        groups:
          # ... existing groups are preserved alongside your additions ...
          - name: kafka
            interval: 15s
            rules:
              - alert: KafkaConsumerLagHigh
                expr: kafka_consumer_group_lag > 10000
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "High consumer lag on {{ $labels.topic }}"
                  description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."

Apply the change using the standard upgrade procedure in the Configuration Guide.

Rule Fields Reference

FieldRequiredDescription
alertYesAlert name. Must be unique within the group.
exprYesPromQL expression. The alert fires when this evaluates to a non-zero/non-empty result.
forNoHow long the condition must hold before the alert fires. Omitting this fires immediately.
labels.severityRecommendedSet to critical or warning to match the built-in routing rules.
annotations.summaryRecommendedShort human-readable description. Supports Go template labels (e.g. {{ $labels.host }}).
annotations.descriptionRecommendedDetailed description with context for the on-call operator.

Tip: Use the Alertmanager UI (https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.


Configuring Alert Routes

By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.

Route Structure

The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:

alertmanager:
  config:
    route:
      receiver: 'null'          # Default: discard unmatched alerts
      group_by: ['alertname']
      group_wait: 10s           # Wait before sending first notification for a new group
      group_interval: 10s       # Wait before sending updated notifications for a group
      repeat_interval: 1h       # Re-notify if an alert is still firing after this period
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'

Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.


Notification Channels

Email

Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.

alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_require_tls: true
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Slack

Requires an incoming webhook URL created in your Slack workspace.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts'
            send_resolved: true
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: |
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}{{ .Annotations.description }}{{ end }}

Telegram

Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'telegram'
    receivers:
      - name: 'null'
      - name: 'telegram'
        telegram_configs:
          - bot_token: 'your-bot-token'
            chat_id: -1234567890
            parse_mode: 'Markdown'
            send_resolved: true
            message: |
              *Alert:* {{ .CommonLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}
                {{ .Annotations.description }}
              {{ end }}

Finding your chat ID: Add your bot to the channel or group, send a message, then call https://api.telegram.org/bot<token>/getUpdates and read the chat.id from the response. Note that group and channel chat IDs are negative numbers.


Combining Multiple Receivers

Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:

alertmanager:
  config:
    route:
      receiver: 'null'
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
          continue: true        # Continue matching so the next route also fires
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#critical-alerts'
            send_resolved: true
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.


Silencing Alerts

Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.

Silences are managed via the Alertmanager UI, accessible at:

https://<manager-host>/alertmanager

Creating a Silence

  1. Navigate to the Alertmanager UI and click Silences in the top navigation.
  2. Click Create Silence.
  3. Set the Start and End times for the silence window.
  4. Add one or more matchers to scope which alerts are suppressed. For example:
    • alertname = StorageFillingUp — silence a specific alert
    • severity = warning — silence all warnings
    • host = node-01 — silence all alerts from a specific host
  5. Add a Comment describing the reason for the silence (e.g. Planned disk expansion on node-01).
  6. Click Create. The silence takes effect immediately.

Expiring a Silence

Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.

Next Steps

  1. Operations Guide - Day-to-day operational procedures
  2. Troubleshooting Guide - Resolve underlying issues surfaced by alerts
  3. Metrics & Monitoring Overview - Return to the monitoring overview