Alerts & Alarms

Configuring and managing alerts and alarms

You're viewing a development version of manager, the latest released version is v1.4.1

Page not available in that version

The current page Alerts & Alarms doesn't exist in version v1.4.1 of the documentation for this product.

We can take you to the closest parent section instead: /docs/acd/components/manager/v1.4.1/metrics_and_monitoring/

Overview

The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.

This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.

Alert Severity Levels

Severity	Meaning
critical	Immediate action required. The condition poses a risk to data integrity, service availability, or active traffic.
warning	Investigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended.

Alert Groups

Alerts are organised into the following groups, each evaluated on a 15-second interval.

infra-disk — Disk space and I/O
infra-compute — CPU and memory
infra-network — Network errors and traffic anomalies
longhorn — Persistent storage health

infra-disk

Monitors disk space utilisation and I/O latency on cluster nodes.

StorageFillingUp

Property	Value
Severity	warning
Condition	Root filesystem usage exceeds 85%
Must persist for	2 minutes

What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.

Recommended actions:

Identify the node from the host label in the alert.

Log into the node and check disk usage:

df -h /
du -sh /var/log/* | sort -rh | head -20

Clear old log files, unused container images, or temporary files:

# On the node
journalctl --vacuum-size=500M
crictl rmi --prune

If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.

HighDiskLatency

Property	Value
Severity	warning
Condition	Average disk write latency exceeds 100 ms
Must persist for	2 minutes

What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.

Recommended actions:

Identify the affected disk from the name label in the alert.
Check for I/O-intensive processes on the node:
```
iostat -x 2 5
iotop -o
```
Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
If the issue persists on a production node, review whether the storage hardware meets the System Requirements.

infra-compute

Monitors CPU and memory utilisation on cluster nodes.

CpuSaturation

Property	Value
Severity	warning
Condition	Total CPU usage exceeds 90%
Must persist for	5 minutes

What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.

Recommended actions:

Identify the saturated node from the host label in the alert.
Check which pods are consuming CPU:
```
kubectl top pods --sort-by=cpu -A
```
Check for runaway processes on the node:
```
top -b -n 1 | head -20
```
If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.

MemoryCriticallyLow

Property	Value
Severity	critical
Condition	Available RAM falls below 10%
Must persist for	2 minutes

What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.

Recommended actions:

Identify the affected node from the host label in the alert.
Immediately check for memory-leaking or oversized pods:
```
kubectl top pods --sort-by=memory -A
```
Identify and restart any pods showing abnormal memory consumption:
```
kubectl rollout restart deployment/<name>
```
Check kernel OOM kill log for any processes already killed:
```
dmesg | grep -i "oom\|killed"
```
Review memory resource limits and requests for affected deployments and adjust if necessary.

SwapUsageDetected

Property	Value
Severity	warning
Condition	Swap usage exceeds 5%
Must persist for	1 minute

What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.

Recommended actions:

Treat this as an early warning for the same conditions as MemoryCriticallyLow.
Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.

infra-network

Monitors network interface errors and traffic anomalies on cluster nodes.

NetworkInterfaceErrors

Property	Value
Severity	critical
Condition	Any non-zero rate of inbound or outbound packet errors on a network interface
Must persist for	1 minute

What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.

Recommended actions:

Identify the affected host and interface from the host and interface labels in the alert.

Check interface error counters on the node:

ip -s link show <interface>
ethtool -S <interface> | grep -i error

Check for duplex/speed mismatches between the node NIC and the upstream switch:
```
ethtool <interface> | grep -E "Speed|Duplex"
```
Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.

SuddenNetworkEgressDrop

Property	Value
Severity	critical
Condition	Egress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s
Must persist for	1 minute

What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.

Recommended actions:

Identify the affected node and interface from the alert labels.

Verify the node’s network connectivity:

ping <gateway-ip>
traceroute <upstream-endpoint>

Check for interface errors or link-down events:

ip link show
dmesg | grep -i "link\|eth\|nic"

Verify that upstream routing and firewall rules have not changed.
If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.

SuddenNetworkIngressSpike

Property	Value
Severity	warning
Condition	Ingress throughput exceeds twice the 5-minute baseline
Must persist for	1 minute

What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.

Recommended actions:

Identify the affected node and interface from the alert labels.

Review active connections and top talkers:

ss -s
netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.

longhorn

Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.

Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).

LonghornVolumeDegraded

Property	Value
Severity	warning
Condition	A Longhorn volume’s robustness state is `Degraded`
Must persist for	2 minutes

What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.

Recommended actions:

Identify the affected volume from the volume label in the alert.
Open the Longhorn UI and inspect the volume’s replica status.
Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
Investigate the health of the node that hosted the failed replica:
```
kubectl get nodes
kubectl describe node <node-name>
```

LonghornVolumeFaulted

Property	Value
Severity	critical
Condition	A Longhorn volume’s robustness state is `Faulted`
Must persist for	1 minute

What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.

Recommended actions:

Identify the affected volume from the volume label.

Immediately check which pods are using the volume:

kubectl get pods -A -o wide | grep -i <volume-name>

Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.

LonghornNodeDown

Property	Value
Severity	critical
Condition	A Longhorn node reports a non-ready state
Must persist for	2 minutes

What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.

Recommended actions:

Identify the affected node from the node label in the alert.

Check the node’s status in Kubernetes:

kubectl get nodes
kubectl describe node <node-name>

Attempt to SSH to the node and check system health:
```
ssh root@<node-ip>
systemctl status k3s
```
If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.

LonghornDiskSpaceLow

Property	Value
Severity	warning
Condition	Available Longhorn disk space on a node falls below 15%
Must persist for	2 minutes

What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.

Recommended actions:

Identify the affected node and disk from the node and disk labels in the alert.
Open the Longhorn UI and check which volumes have replicas on this disk.
Check for snapshots or backups that can be cleaned up to reclaim space.
If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.

Adding Custom Alert Rules

Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.

Example: Adding a Custom Alert

The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:

victoria_metrics_alert:
  server:
    config:
      alerts:
        groups:
          # ... existing groups are preserved alongside your additions ...
          - name: kafka
            interval: 15s
            rules:
              - alert: KafkaConsumerLagHigh
                expr: kafka_consumer_group_lag > 10000
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "High consumer lag on {{ $labels.topic }}"
                  description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."

Apply the change using the standard upgrade procedure in the Configuration Guide.

Rule Fields Reference

Field	Required	Description
`alert`	Yes	Alert name. Must be unique within the group.
`expr`	Yes	PromQL expression. The alert fires when this evaluates to a non-zero/non-empty result.
`for`	No	How long the condition must hold before the alert fires. Omitting this fires immediately.
`labels.severity`	Recommended	Set to `critical` or `warning` to match the built-in routing rules.
`annotations.summary`	Recommended	Short human-readable description. Supports Go template labels (e.g. `{{ $labels.host }}`).
`annotations.description`	Recommended	Detailed description with context for the on-call operator.

Tip: Use the Alertmanager UI (https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.

Configuring Alert Routes

By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.

Route Structure

The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:

alertmanager:
  config:
    route:
      receiver: 'null'          # Default: discard unmatched alerts
      group_by: ['alertname']
      group_wait: 10s           # Wait before sending first notification for a new group
      group_interval: 10s       # Wait before sending updated notifications for a group
      repeat_interval: 1h       # Re-notify if an alert is still firing after this period
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'

Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.

Notification Channels

Email

Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.

alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_require_tls: true
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Slack

Requires an incoming webhook URL created in your Slack workspace.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts'
            send_resolved: true
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: |
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}{{ .Annotations.description }}{{ end }}

Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.

alertmanager:
  config:
    route:
      routes:
        - matchers:
            - severity="critical"
          receiver: 'telegram'
    receivers:
      - name: 'null'
      - name: 'telegram'
        telegram_configs:
          - bot_token: 'your-bot-token'
            chat_id: -1234567890
            parse_mode: 'Markdown'
            send_resolved: true
            message: |
              *Alert:* {{ .CommonLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              *Host:* {{ .CommonLabels.host }}
              {{ range .Alerts }}
                {{ .Annotations.description }}
              {{ end }}

Finding your chat ID: Add your bot to the channel or group, send a message, then call https://api.telegram.org/bot<token>/getUpdates and read the chat.id from the response. Note that group and channel chat IDs are negative numbers.

Combining Multiple Receivers

Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:

alertmanager:
  config:
    route:
      receiver: 'null'
      routes:
        - matchers:
            - severity="critical"
          receiver: 'slack'
          continue: true        # Continue matching so the next route also fires
        - matchers:
            - severity="critical"
          receiver: 'email-critical'
        - matchers:
            - severity="warning"
          receiver: 'email-warning'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#critical-alerts'
            send_resolved: true
      - name: 'email-critical'
        email_configs:
          - to: 'oncall@example.com'
            send_resolved: true
      - name: 'email-warning'
        email_configs:
          - to: 'alerts@example.com'
            send_resolved: true

Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.

Silencing Alerts

Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.

Silences are managed via the Alertmanager UI, accessible at:

https://<manager-host>/alertmanager

Creating a Silence

Navigate to the Alertmanager UI and click Silences in the top navigation.
Click Create Silence.
Set the Start and End times for the silence window.
Add one or more matchers to scope which alerts are suppressed. For example:
- alertname = StorageFillingUp — silence a specific alert
- severity = warning — silence all warnings
- host = node-01 — silence all alerts from a specific host
Add a Comment describing the reason for the silence (e.g. Planned disk expansion on node-01).
Click Create. The silence takes effect immediately.

Expiring a Silence

Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.

Next Steps

Operations Guide - Day-to-day operational procedures
Troubleshooting Guide - Resolve underlying issues surfaced by alerts
Metrics & Monitoring Overview - Return to the monitoring overview