Alert Storm Playbook

Who this is for

Users who are receiving a flood of alerts — many at once, or the same alert repeating frequently — and want to silence noise while investigating the real cause.


What Causes an Alert Storm?

Alert storms typically happen when:

  • A server goes offline → all monitors on that server fire simultaneously
  • A service crashes → health checks fail repeatedly at short intervals
  • A disk fills up → multiple disk-related monitors cross their thresholds
  • A threshold is set too sensitive → triggers on normal traffic spikes

Immediate Response: Snooze Active Alerts

While investigating, snooze the alerts to stop notification noise:

  1. Go to Monitoring → Alerts.
  2. Select all active alerts from the affected server.
  3. Click Snooze and choose a duration (e.g., 1 hour).

Snoozed alerts do not send notifications but remain visible in the Alerts panel.


Step 1 — Identify the Root Cause

Multiple alerts from the same server often share a single root cause. Look for patterns:

  • All alerts are from one server → the server itself is the problem (offline, out of disk, OOM).
  • Alerts are CPU and memory together → application runaway process.
  • Alerts are disk + backup failed → disk is full.
  • Alerts are all monitors, all at once → server is unreachable (network or SSH down).

Step 2 — Check Server Health

  1. Go to Server detail → Monitoring tab.
  2. Check current CPU, memory, and disk gauges.
  3. If metrics are not loading: the server is unreachable — see KB-12-05: SSH Unreachable Playbook.

Step 3 — Fix the Root Cause

Root causeFix
Server offlineStart the server from CloudAIPilot or cloud provider console
Disk fullDelete old backups, logs, or temporary files. Resize disk if needed.
High CPU (runaway process)Open Terminal on the server → top → identify and kill the runaway process
High memory (OOM)Restart the memory-heavy service (sudo systemctl restart )
Network issuesCheck cloud provider for regional outages

Step 4 — Acknowledge and Resolve Alerts

After fixing:

  1. Go to Monitoring → Alerts.
  2. Click Acknowledge on each alert to confirm you are aware of it.
  3. Once the condition is resolved and metrics are back to normal, click Resolve.

Preventing Future Alert Storms

To reduce sensitivity and prevent cascading alerts:

  1. Adjust thresholds — if CPU at 80% is normal for your workload, raise the threshold to 90%.
  2. Add hysteresis — require the condition to persist for N minutes before firing (prevents spikes from alerting).
  3. Group by server — configure notifications to send one summary per server per incident, not one alert per monitor.
  4. Create runbooks — link alert rules to response runbooks so the team knows what to do.

See KB-06-04: Severity, Hysteresis, and Noise Reduction.


Related Articles