Operational Runbook for Red Alerts
Who this is for
On-call engineers or team leads who receive a Critical (red) alert and need a structured response procedure.
What you will complete
Follow a clear, step-by-step response procedure for any Critical severity alert, from initial triage through to resolution and post-incident notes.
Before you start responding
A Critical alert means something is failing or about to fail and requires immediate human attention. Your first priority is to understand the scope of impact, not to immediately fix something.
The 90-second triage rule: Before taking any action, spend 90 seconds answering these three questions:
- What is failing? (Which server, which metric)
- Who is affected? (Which sites or apps run on this server)
- Is this getting worse or stabilizing? (Look at the metric trend over the past 10 minutes)
Step-by-step response procedure
Step 1: Acknowledge and communicate (2 minutes)
- Go to Alerts → Events and click Acknowledge on the firing alert.
- Snooze it for 1 hour to suppress repeat notifications while you investigate.
- If your team has a status channel (Slack, etc.), post: "Investigating critical alert on [server-name] — [metric] at [value]. Will update in 10 minutes."
Step 2: Identify the scope (3 minutes)
- Go to Servers and check the status of the affected server.
- Identify which sites and apps run on this server.
- Check if those sites are reachable by visiting them in a browser.
- Note the exact start time of the alert — this tells you when the issue began.
Step 3: Diagnose with AI Pilot (5 minutes)
- Go to AI Pilot.
- Send: "I have a critical [metric] alert on [server-name] that started at [time]. Please diagnose what is causing it."
- Read the AI's diagnosis. The AI will check metrics, logs, recent operations, and active processes.
- If the AI cannot diagnose via read access alone, it will tell you what additional information it needs.
Step 4: Decide on a response
Based on the diagnosis, choose one of:
Option A — AI Pilot fix: If the AI proposes a specific action (service restart, rollback, cleanup), review the approval card and click Allow if it is correct.
Option B — Manual fix: If you know the fix (e.g., clear disk space, restart a specific service), do it manually via SSH or the server management interface.
Option C — Escalate: If the issue is beyond immediate fix capacity (data corruption, hardware failure, provider incident), escalate to the appropriate team or contact cloud provider support.
Option D — Rollback: If the alert started immediately after a deployment, roll back the deployment first. See KB-12-16.
Step 5: Verify resolution (5 minutes)
- After taking action, watch the metric in Monitoring.
- The metric should begin recovering within 1–2 minutes.
- The alert should auto-resolve within 2–5 minutes of recovery.
- Ask AI Pilot: "Is [server-name] healthy now? Please confirm metrics are back to normal."
Step 6: Post-incident notes (10 minutes after resolution)
- Go to the Alert Events log and make note of the full timeline.
- Ask AI Pilot: "Please summarize what happened on [server-name] during this incident and what preventive steps would reduce recurrence."
- Act on the AI's recommendations: adjust alert thresholds, create a goal trigger, scale the server, or add a monitoring rule.
Quick reference: critical alert response by metric
| Alert | Immediate priority | First check | Common fix |
|---|---|---|---|
| CPU Critical (>95%) | Sites may be slow/unresponsive | Recent deployments, runaway processes | Restart suspect service; rollback deployment |
| RAM Critical (>95%) | Risk of OOM kills | Memory leak pattern, PHP worker count | Restart PHP-FPM; reduce worker count |
| Disk Critical (>90%) | Risk of write failures | Log files, backup staging, uploads | Delete large log files; expand disk |
| Load Critical (>3× cores) | Server unresponsive under load | Process queue, I/O wait | Identify top process; reduce concurrent workers |
What success looks like
- Alert transitions from Firing to Resolved within 10–15 minutes of the fix.
- Affected sites are reachable and responding normally.
- Post-incident preventive action is identified and scheduled.