Alert Storm Playbook
Who this is for
Users who are receiving a flood of alerts — many at once, or the same alert repeating frequently — and want to silence noise while investigating the real cause.
What Causes an Alert Storm?
Alert storms typically happen when:
- A server goes offline → all monitors on that server fire simultaneously
- A service crashes → health checks fail repeatedly at short intervals
- A disk fills up → multiple disk-related monitors cross their thresholds
- A threshold is set too sensitive → triggers on normal traffic spikes
Immediate Response: Snooze Active Alerts
While investigating, snooze the alerts to stop notification noise:
- Go to Monitoring → Alerts.
- Select all active alerts from the affected server.
- Click Snooze and choose a duration (e.g., 1 hour).
Snoozed alerts do not send notifications but remain visible in the Alerts panel.
Step 1 — Identify the Root Cause
Multiple alerts from the same server often share a single root cause. Look for patterns:
- All alerts are from one server → the server itself is the problem (offline, out of disk, OOM).
- Alerts are CPU and memory together → application runaway process.
- Alerts are disk + backup failed → disk is full.
- Alerts are all monitors, all at once → server is unreachable (network or SSH down).
Step 2 — Check Server Health
- Go to Server detail → Monitoring tab.
- Check current CPU, memory, and disk gauges.
- If metrics are not loading: the server is unreachable — see KB-12-05: SSH Unreachable Playbook.
Step 3 — Fix the Root Cause
| Root cause | Fix |
|---|---|
| Server offline | Start the server from CloudAIPilot or cloud provider console |
| Disk full | Delete old backups, logs, or temporary files. Resize disk if needed. |
| High CPU (runaway process) | Open Terminal on the server → top → identify and kill the runaway process |
| High memory (OOM) | Restart the memory-heavy service (sudo systemctl restart ) |
| Network issues | Check cloud provider for regional outages |
Step 4 — Acknowledge and Resolve Alerts
After fixing:
- Go to Monitoring → Alerts.
- Click Acknowledge on each alert to confirm you are aware of it.
- Once the condition is resolved and metrics are back to normal, click Resolve.
Preventing Future Alert Storms
To reduce sensitivity and prevent cascading alerts:
- Adjust thresholds — if CPU at 80% is normal for your workload, raise the threshold to 90%.
- Add hysteresis — require the condition to persist for N minutes before firing (prevents spikes from alerting).
- Group by server — configure notifications to send one summary per server per incident, not one alert per monitor.
- Create runbooks — link alert rules to response runbooks so the team knows what to do.
See KB-06-04: Severity, Hysteresis, and Noise Reduction.
Related Articles
- KB-12-05: SSH Unreachable Playbook
- KB-06-03: Create Alert Rules Effectively
- KB-06-04: Severity, Hysteresis, and Noise Reduction
- KB-06-05: Snooze, Acknowledge, Resolve Flows