Operational Runbook for Red Alerts

Who this is for

On-call engineers or team leads who receive a Critical (red) alert and need a structured response procedure.

What you will complete

Follow a clear, step-by-step response procedure for any Critical severity alert, from initial triage through to resolution and post-incident notes.


Before you start responding

A Critical alert means something is failing or about to fail and requires immediate human attention. Your first priority is to understand the scope of impact, not to immediately fix something.

The 90-second triage rule: Before taking any action, spend 90 seconds answering these three questions:

  1. What is failing? (Which server, which metric)
  2. Who is affected? (Which sites or apps run on this server)
  3. Is this getting worse or stabilizing? (Look at the metric trend over the past 10 minutes)

Step-by-step response procedure

Step 1: Acknowledge and communicate (2 minutes)

  1. Go to Alerts → Events and click Acknowledge on the firing alert.
  2. Snooze it for 1 hour to suppress repeat notifications while you investigate.
  3. If your team has a status channel (Slack, etc.), post: "Investigating critical alert on [server-name] — [metric] at [value]. Will update in 10 minutes."

Step 2: Identify the scope (3 minutes)

  1. Go to Servers and check the status of the affected server.
  2. Identify which sites and apps run on this server.
  3. Check if those sites are reachable by visiting them in a browser.
  4. Note the exact start time of the alert — this tells you when the issue began.

Step 3: Diagnose with AI Pilot (5 minutes)

  1. Go to AI Pilot.
  2. Send: "I have a critical [metric] alert on [server-name] that started at [time]. Please diagnose what is causing it."
  3. Read the AI's diagnosis. The AI will check metrics, logs, recent operations, and active processes.
  4. If the AI cannot diagnose via read access alone, it will tell you what additional information it needs.

Step 4: Decide on a response

Based on the diagnosis, choose one of:

Option A — AI Pilot fix: If the AI proposes a specific action (service restart, rollback, cleanup), review the approval card and click Allow if it is correct.

Option B — Manual fix: If you know the fix (e.g., clear disk space, restart a specific service), do it manually via SSH or the server management interface.

Option C — Escalate: If the issue is beyond immediate fix capacity (data corruption, hardware failure, provider incident), escalate to the appropriate team or contact cloud provider support.

Option D — Rollback: If the alert started immediately after a deployment, roll back the deployment first. See KB-12-16.

Step 5: Verify resolution (5 minutes)

  1. After taking action, watch the metric in Monitoring.
  2. The metric should begin recovering within 1–2 minutes.
  3. The alert should auto-resolve within 2–5 minutes of recovery.
  4. Ask AI Pilot: "Is [server-name] healthy now? Please confirm metrics are back to normal."

Step 6: Post-incident notes (10 minutes after resolution)

  1. Go to the Alert Events log and make note of the full timeline.
  2. Ask AI Pilot: "Please summarize what happened on [server-name] during this incident and what preventive steps would reduce recurrence."
  3. Act on the AI's recommendations: adjust alert thresholds, create a goal trigger, scale the server, or add a monitoring rule.

Quick reference: critical alert response by metric

AlertImmediate priorityFirst checkCommon fix
CPU Critical (>95%)Sites may be slow/unresponsiveRecent deployments, runaway processesRestart suspect service; rollback deployment
RAM Critical (>95%)Risk of OOM killsMemory leak pattern, PHP worker countRestart PHP-FPM; reduce worker count
Disk Critical (>90%)Risk of write failuresLog files, backup staging, uploadsDelete large log files; expand disk
Load Critical (>3× cores)Server unresponsive under loadProcess queue, I/O waitIdentify top process; reduce concurrent workers

What success looks like

  • Alert transitions from Firing to Resolved within 10–15 minutes of the fix.
  • Affected sites are reachable and responding normally.
  • Post-incident preventive action is identified and scheduled.

Related articles