Alerts & Alarms
Page not available in that version
The current page Alerts & Alarms doesn't exist in version v1.4.1 of the documentation for this product.
Overview
The CDN Manager ships a set of pre-configured alerting rules evaluated by vmalert against VictoriaMetrics. When a rule fires, the alert is routed to Alertmanager, which handles deduplication, grouping, silencing, and delivery to configured notification channels.
This page documents every built-in alert rule, what it means, its severity, and the recommended operator action.
Alert Severity Levels
| Severity | Meaning |
|---|---|
| critical | Immediate action required. The condition poses a risk to data integrity, service availability, or active traffic. |
| warning | Investigate soon. The condition is not immediately harmful but will degrade into a critical state if left unattended. |
Alert Groups
Alerts are organised into the following groups, each evaluated on a 15-second interval.
- infra-disk — Disk space and I/O
- infra-compute — CPU and memory
- infra-network — Network errors and traffic anomalies
- longhorn — Persistent storage health
infra-disk
Monitors disk space utilisation and I/O latency on cluster nodes.
StorageFillingUp
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Root filesystem usage exceeds 85% |
| Must persist for | 2 minutes |
What it means: A node’s root filesystem is running low on space. If left unchecked this will progress to a full disk, which can cause pod evictions, write failures, and potential data loss.
Recommended actions:
- Identify the node from the
hostlabel in the alert. - Log into the node and check disk usage:
df -h / du -sh /var/log/* | sort -rh | head -20 - Clear old log files, unused container images, or temporary files:
# On the node journalctl --vacuum-size=500M crictl rmi --prune - If disk usage is due to application data growth, consider expanding the volume or adjusting retention settings. See Metrics Retention.
HighDiskLatency
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Average disk write latency exceeds 100 ms |
| Must persist for | 2 minutes |
What it means: Disk write operations are taking longer than 100 ms on average. High disk latency can degrade database performance (PostgreSQL, VictoriaMetrics) and cause timeouts in write-heavy components.
Recommended actions:
- Identify the affected disk from the
namelabel in the alert. - Check for I/O-intensive processes on the node:
iostat -x 2 5 iotop -o - Check for Longhorn replica rebuilds or rebalancing activity, which can saturate disk I/O.
- If the issue persists on a production node, review whether the storage hardware meets the System Requirements.
infra-compute
Monitors CPU and memory utilisation on cluster nodes.
CpuSaturation
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Total CPU usage exceeds 90% |
| Must persist for | 5 minutes |
What it means: A node is running at near-full CPU capacity. Sustained CPU saturation causes request latency increases across all workloads on that node and may result in pod throttling.
Recommended actions:
- Identify the saturated node from the
hostlabel in the alert. - Check which pods are consuming CPU:
kubectl top pods --sort-by=cpu -A - Check for runaway processes on the node:
top -b -n 1 | head -20 - If saturation is caused by a legitimate workload spike (e.g. CDN traffic burst), consider scaling the deployment or redistributing load across nodes.
MemoryCriticallyLow
| Property | Value |
|---|---|
| Severity | critical |
| Condition | Available RAM falls below 10% |
| Must persist for | 2 minutes |
What it means: The node has very little free memory remaining. The Linux OOM killer may begin terminating processes, which can cause abrupt pod restarts, data corruption in in-memory caches, and service unavailability.
Recommended actions:
- Identify the affected node from the
hostlabel in the alert. - Immediately check for memory-leaking or oversized pods:
kubectl top pods --sort-by=memory -A - Identify and restart any pods showing abnormal memory consumption:
kubectl rollout restart deployment/<name> - Check kernel OOM kill log for any processes already killed:
dmesg | grep -i "oom\|killed" - Review memory resource limits and requests for affected deployments and adjust if necessary.
SwapUsageDetected
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Swap usage exceeds 5% |
| Must persist for | 1 minute |
What it means: The node is swapping memory to disk. Swap usage in a Kubernetes cluster is a strong indicator of memory pressure. It degrades performance significantly and may mask an underlying memory shortage that could escalate to a MemoryCriticallyLow event.
Recommended actions:
- Treat this as an early warning for the same conditions as
MemoryCriticallyLow. - Identify memory-intensive pods and investigate whether resource limits are configured appropriately.
- Swap should ideally never be active on a production Kubernetes node. If it persists, escalate to a memory capacity review.
infra-network
Monitors network interface errors and traffic anomalies on cluster nodes.
NetworkInterfaceErrors
| Property | Value |
|---|---|
| Severity | critical |
| Condition | Any non-zero rate of inbound or outbound packet errors on a network interface |
| Must persist for | 1 minute |
What it means: A network interface is dropping or corrupting packets. Even a low error rate can cause TCP retransmissions, increased latency, and connection failures — directly impacting CDN Director communication and external traffic delivery.
Recommended actions:
- Identify the affected host and interface from the
hostandinterfacelabels in the alert. - Check interface error counters on the node:
ip -s link show <interface> ethtool -S <interface> | grep -i error - Check for duplex/speed mismatches between the node NIC and the upstream switch:
ethtool <interface> | grep -E "Speed|Duplex" - Escalate to network/hardware team if errors are persistent and cannot be attributed to a software configuration issue.
SuddenNetworkEgressDrop
| Property | Value |
|---|---|
| Severity | critical |
| Condition | Egress throughput drops to less than 50% of the 2-minute baseline, when baseline traffic is above 1 Mbit/s |
| Must persist for | 1 minute |
What it means: A significant, sudden reduction in outbound traffic has been detected. This typically indicates a upstream network failure, link fault, or a routing issue. A CDN node that stops transmitting traffic is effectively out of service.
Recommended actions:
- Identify the affected node and interface from the alert labels.
- Verify the node’s network connectivity:
ping <gateway-ip> traceroute <upstream-endpoint> - Check for interface errors or link-down events:
ip link show dmesg | grep -i "link\|eth\|nic" - Verify that upstream routing and firewall rules have not changed.
- If the node is healthy and traffic has legitimately dropped (e.g. a CDN traffic shift), the alert can be silenced if the traffic reduction is expected and understood.
SuddenNetworkIngressSpike
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Ingress throughput exceeds twice the 5-minute baseline |
| Must persist for | 1 minute |
What it means: A sudden surge of inbound traffic has been detected. This may indicate a legitimate traffic event (e.g. a large stream audience spike), a DDoS attempt, or a misconfigured client sending unexpected volume.
Recommended actions:
- Identify the affected node and interface from the alert labels.
- Review active connections and top talkers:
ss -s netstat -an | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head - Correlate with CDN Director metrics in Grafana to determine whether the spike is legitimate CDN traffic.
- If the spike is unexpected and sustained, consider rate-limiting or blocking the source at the network edge.
longhorn
Monitors the health of Longhorn distributed block storage, which backs persistent volumes for PostgreSQL, VictoriaMetrics, and other stateful components.
Note: Longhorn alert rules are always present in the alert configuration, but will not fire in environments where Longhorn is not installed (e.g. cloud deployments using external storage).
LonghornVolumeDegraded
| Property | Value |
|---|---|
| Severity | warning |
| Condition | A Longhorn volume’s robustness state is Degraded |
| Must persist for | 2 minutes |
What it means: A Longhorn volume has fewer healthy replicas than its configured replication factor. Data is not at immediate risk, but the volume has reduced redundancy. A single additional node or disk failure could result in data loss or volume unavailability.
Recommended actions:
- Identify the affected volume from the
volumelabel in the alert. - Open the Longhorn UI and inspect the volume’s replica status.
- Check whether a replica is in the process of rebuilding (this is normal after a node restart). Rebuilding may take several minutes depending on volume size.
- If a replica has failed and is not rebuilding, attempt to evict and re-schedule it via the Longhorn UI.
- Investigate the health of the node that hosted the failed replica:
kubectl get nodes kubectl describe node <node-name>
LonghornVolumeFaulted
| Property | Value |
|---|---|
| Severity | critical |
| Condition | A Longhorn volume’s robustness state is Faulted |
| Must persist for | 1 minute |
What it means: A Longhorn volume has lost all healthy replicas and is no longer accessible. Any workload that depends on this volume (e.g. PostgreSQL, VictoriaMetrics) will be unable to write and may crash. Data may be at risk.
Recommended actions:
- Identify the affected volume from the
volumelabel. - Immediately check which pods are using the volume:
kubectl get pods -A -o wide | grep -i <volume-name> - Open the Longhorn UI. Check whether any replicas are still present and whether they can be recovered.
- Do not delete faulted replicas without first attempting recovery — they may contain the only copy of the data.
- Contact AgileTV support if the volume cannot be recovered, providing Longhorn UI screenshots and node logs.
LonghornNodeDown
| Property | Value |
|---|---|
| Severity | critical |
| Condition | A Longhorn node reports a non-ready state |
| Must persist for | 2 minutes |
What it means: A storage node is unreachable or unhealthy from Longhorn’s perspective. All volumes with replicas on this node are at reduced redundancy. If more than one node goes down simultaneously, faulted volumes and data loss become a risk.
Recommended actions:
- Identify the affected node from the
nodelabel in the alert. - Check the node’s status in Kubernetes:
kubectl get nodes kubectl describe node <node-name> - Attempt to SSH to the node and check system health:
ssh root@<node-ip> systemctl status k3s - If the node has crashed and cannot be recovered quickly, consider evicting its Longhorn replicas to allow rebuilding on healthy nodes — but only if the remaining healthy nodes have sufficient capacity.
LonghornDiskSpaceLow
| Property | Value |
|---|---|
| Severity | warning |
| Condition | Available Longhorn disk space on a node falls below 15% |
| Must persist for | 2 minutes |
What it means: A node’s Longhorn-managed disk is running low on storage. When Longhorn disk space is exhausted, it cannot schedule new replicas or accommodate volume growth, which can lead to LonghornVolumeDegraded or LonghornVolumeFaulted conditions.
Recommended actions:
- Identify the affected node and disk from the
nodeanddisklabels in the alert. - Open the Longhorn UI and check which volumes have replicas on this disk.
- Check for snapshots or backups that can be cleaned up to reclaim space.
- If space cannot be reclaimed, consider adding a disk to the node or expanding the underlying block device.
- Review Metrics Retention settings — reducing VictoriaMetrics retention is often the fastest way to reclaim Longhorn disk space in a monitoring-heavy deployment.
Adding Custom Alert Rules
Additional alert rules can be defined by extending the victoria_metrics_alert.server.config.alerts.groups list in your values.yaml. Rules follow the Prometheus alerting rule format.
Example: Adding a Custom Alert
The following example adds an alert group that fires when a Kafka consumer lag exceeds a threshold:
victoria_metrics_alert:
server:
config:
alerts:
groups:
# ... existing groups are preserved alongside your additions ...
- name: kafka
interval: 15s
rules:
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag on {{ $labels.topic }}"
description: "Consumer group {{ $labels.group }} is {{ $value }} messages behind on topic {{ $labels.topic }}."
Apply the change using the standard upgrade procedure in the Configuration Guide.
Rule Fields Reference
| Field | Required | Description |
|---|---|---|
alert | Yes | Alert name. Must be unique within the group. |
expr | Yes | PromQL expression. The alert fires when this evaluates to a non-zero/non-empty result. |
for | No | How long the condition must hold before the alert fires. Omitting this fires immediately. |
labels.severity | Recommended | Set to critical or warning to match the built-in routing rules. |
annotations.summary | Recommended | Short human-readable description. Supports Go template labels (e.g. {{ $labels.host }}). |
annotations.description | Recommended | Detailed description with context for the on-call operator. |
Tip: Use the Alertmanager UI (
https://<manager-host>/alertmanager) to verify that fired alerts are being received and routed correctly after adding new rules.
Configuring Alert Routes
By default, all alerts are routed to the built-in null receiver, which silently discards them. To receive alerts, configure one or more receivers and update the routing rules — all within the alertmanager.config section of your values.yaml.
Route Structure
The top-level route defines the default behaviour. Child routes under routes match alerts by label and direct them to specific receivers:
alertmanager:
config:
route:
receiver: 'null' # Default: discard unmatched alerts
group_by: ['alertname']
group_wait: 10s # Wait before sending first notification for a new group
group_interval: 10s # Wait before sending updated notifications for a group
repeat_interval: 1h # Re-notify if an alert is still firing after this period
routes:
- matchers:
- severity="critical"
receiver: 'slack'
- matchers:
- severity="warning"
receiver: 'email-warning'
Routes are evaluated top-to-bottom. The first matching route wins unless continue: true is set on the route.
Notification Channels
Email requires an SMTP server to be configured globally. Both a critical and warning receiver can be defined independently.
alertmanager:
config:
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_require_tls: true
route:
routes:
- matchers:
- severity="critical"
receiver: 'email-critical'
- matchers:
- severity="warning"
receiver: 'email-warning'
receivers:
- name: 'null'
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
- name: 'email-warning'
email_configs:
- to: 'alerts@example.com'
send_resolved: true
Slack
Requires an incoming webhook URL created in your Slack workspace.
alertmanager:
config:
route:
routes:
- matchers:
- severity="critical"
receiver: 'slack'
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: |
*Severity:* {{ .CommonLabels.severity }}
*Host:* {{ .CommonLabels.host }}
{{ range .Alerts }}{{ .Annotations.description }}{{ end }}
Telegram
Requires a Telegram bot token and a channel/group chat ID. Create a bot via @BotFather and add it to your alert channel before configuring.
alertmanager:
config:
route:
routes:
- matchers:
- severity="critical"
receiver: 'telegram'
receivers:
- name: 'null'
- name: 'telegram'
telegram_configs:
- bot_token: 'your-bot-token'
chat_id: -1234567890
parse_mode: 'Markdown'
send_resolved: true
message: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Host:* {{ .CommonLabels.host }}
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
Finding your chat ID: Add your bot to the channel or group, send a message, then call
https://api.telegram.org/bot<token>/getUpdatesand read thechat.idfrom the response. Note that group and channel chat IDs are negative numbers.
Combining Multiple Receivers
Routes and receivers can be combined to send different alert severities to different channels simultaneously. For example, critical alerts to PagerDuty and Slack, warnings to email only:
alertmanager:
config:
route:
receiver: 'null'
routes:
- matchers:
- severity="critical"
receiver: 'slack'
continue: true # Continue matching so the next route also fires
- matchers:
- severity="critical"
receiver: 'email-critical'
- matchers:
- severity="warning"
receiver: 'email-warning'
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#critical-alerts'
send_resolved: true
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
- name: 'email-warning'
email_configs:
- to: 'alerts@example.com'
send_resolved: true
Apply any receiver or routing changes using the standard upgrade procedure in the Configuration Guide.
Silencing Alerts
Silences suppress alert notifications for a defined time window without disabling the underlying alert rule. They are useful during planned maintenance, known incidents, or when investigating a non-urgent condition.
Silences are managed via the Alertmanager UI, accessible at:
https://<manager-host>/alertmanager
Creating a Silence
- Navigate to the Alertmanager UI and click Silences in the top navigation.
- Click Create Silence.
- Set the Start and End times for the silence window.
- Add one or more matchers to scope which alerts are suppressed. For example:
alertname = StorageFillingUp— silence a specific alertseverity = warning— silence all warningshost = node-01— silence all alerts from a specific host
- Add a Comment describing the reason for the silence (e.g.
Planned disk expansion on node-01). - Click Create. The silence takes effect immediately.
Expiring a Silence
Silences expire automatically at the configured end time. To remove a silence early, navigate to Silences in the Alertmanager UI, locate the silence, and click Expire.
Next Steps
- Operations Guide - Day-to-day operational procedures
- Troubleshooting Guide - Resolve underlying issues surfaced by alerts
- Metrics & Monitoring Overview - Return to the monitoring overview