Incident Response with AI Pilot

Who this is for

Anyone who has received a critical alert and wants to use AI Pilot to diagnose and respond to the incident faster.

What you will complete

Use AI Pilot as your first responder for a firing alert — from initial diagnosis through to proposing and executing a fix.

Before you begin

  • AI Pilot must be enabled with Read Access (at minimum) for diagnosis.
  • Write Access required to execute a fix via AI Pilot.
  • Navigate to AI Pilot in the left sidebar.

AI Pilot as a first responder

When a critical alert fires, AI Pilot can:

  1. Read the alert context — which server, which metric, when it started
  2. Pull the relevant logs and current metrics
  3. Diagnose the most likely cause
  4. Propose a remediation action for your approval

This does not replace human judgment — you always approve or deny. But it compresses the investigation from 15–30 minutes of manual log diving to 2–3 minutes of AI-assisted diagnosis.


Step-by-step: incident response workflow

Step 1: Acknowledge the alert

  1. Go to Alerts → Events tab.
  2. Find the firing alert.
  3. Click Acknowledge to signal you are investigating.
  4. Optional: click Snooze for 1 hour to suppress repeat notifications while you investigate.

Step 2: Open AI Pilot

  1. Go to AI Pilot in the left sidebar.
  2. Describe the incident with the server name and metric. Example:

"I have a critical CPU alert firing on server-name. It has been above 95% for the past 10 minutes. Please diagnose what is causing it."

Step 3: Review the AI diagnosis

AI Pilot will:

  • Read the current CPU metrics and recent history
  • Check active processes and load average if SSH access is enabled
  • Review recent deployments and operations that may correlate
  • Identify the most likely cause (runaway process, deployment spike, resource exhaustion)
  • Explain the diagnosis in plain English

Step 4: Approve the proposed fix (if applicable)

If the AI identifies a fixable cause, it will propose an action. Common proposals for CPU incidents:

  • Restart the specific service consuming excess CPU
  • Kill a specific runaway process
  • Trigger a deployment rollback if the spike started with a recent deployment

Review the approval card carefully. If the proposed fix looks correct, click Allow.

Step 5: Verify resolution

After the fix executes:

  1. Watch the CPU metric in the Monitoring dashboard — it should begin dropping.
  2. Within 1–2 minutes, if the fix was effective, the alert should auto-resolve.
  3. Ask AI Pilot: "Can you confirm server-name CPU is back to normal?"

Step 6: Post-incident note

Ask AI Pilot: "What happened on server-name during this incident and how can we prevent it?" The AI will summarize the incident and suggest a monitoring rule, goal trigger, or configuration change to prevent recurrence.


What success looks like

  • Alert transitions from Firing to Resolved within minutes of the fix executing.
  • CPU (or other affected metric) returns to normal range in the monitoring chart.
  • The Activity Center shows the AI-initiated action as Completed.

Common errors and fixes

"AI Pilot says it cannot read logs for this server" Cause: File and Log Access is disabled, or the server's per-server AI access is set to Read Only or No Access. Fix: Enable File and Log Access under Settings → AI Agent → Agent Controls, and verify the server's per-server access level.

"AI Pilot diagnosed the issue but cannot propose a fix" Cause: The required write operation category may be disabled (e.g., Services and SSH operations are off). Fix: Check Settings → AI Agent → Agent Controls → Fine-tune Write Actions → Server Operations → Services and SSH.

"The alert did not resolve after the AI executed the fix" Cause: The AI fixed one cause but the metric is still elevated due to a secondary issue. Fix: Ask AI Pilot: "CPU on server-name is still high after restarting nginx. What else could be causing it?"


Related articles