Incident Response with AI Pilot

Published · Last updated: May 2026 · 4 min read

Who this is for

Anyone who has received a critical alert and wants to use AI Pilot to diagnose and respond to the incident faster.

What you will complete

Use AI Pilot as your first responder for a firing alert — from initial diagnosis through to proposing and executing a fix.

Before you begin

AI Pilot must be enabled with Read Access (at minimum) for diagnosis.
Write Access required to execute a fix via AI Pilot.
Navigate to AI Pilot in the left sidebar.

AI Pilot as a first responder

When a critical alert fires, AI Pilot can:

Read the alert context — which server, which metric, when it started
Pull the relevant logs and current metrics
Diagnose the most likely cause
Propose a remediation action for your approval

This does not replace human judgment — you always approve or deny. But it compresses the investigation from 15–30 minutes of manual log diving to 2–3 minutes of AI-assisted diagnosis.

Step-by-step: incident response workflow

Step 1: Acknowledge the alert

Go to Alerts → Events tab.
Find the firing alert.
Click Acknowledge to signal you are investigating.
Optional: click Snooze for 1 hour to suppress repeat notifications while you investigate.

Step 2: Open AI Pilot

Go to AI Pilot in the left sidebar.
Describe the incident with the server name and metric. Example:

"I have a critical CPU alert firing on server-name. It has been above 95% for the past 10 minutes. Please diagnose what is causing it."

Step 3: Review the AI diagnosis

AI Pilot will:

Read the current CPU metrics and recent history
Check active processes and load average if SSH access is enabled
Review recent deployments and operations that may correlate
Identify the most likely cause (runaway process, deployment spike, resource exhaustion)
Explain the diagnosis in plain English

Step 4: Approve the proposed fix (if applicable)

If the AI identifies a fixable cause, it will propose an action. Common proposals for CPU incidents:

Restart the specific service consuming excess CPU
Kill a specific runaway process
Trigger a deployment rollback if the spike started with a recent deployment

Review the approval card carefully. If the proposed fix looks correct, click Allow.

Step 5: Verify resolution

After the fix executes:

Watch the CPU metric in the Monitoring dashboard — it should begin dropping.
Within 1–2 minutes, if the fix was effective, the alert should auto-resolve.
Ask AI Pilot: "Can you confirm server-name CPU is back to normal?"

Step 6: Post-incident note

Ask AI Pilot: "What happened on server-name during this incident and how can we prevent it?" The AI will summarize the incident and suggest a monitoring rule, goal trigger, or configuration change to prevent recurrence.

What success looks like

Alert transitions from Firing to Resolved within minutes of the fix executing.
CPU (or other affected metric) returns to normal range in the monitoring chart.
The Activity Center shows the AI-initiated action as Completed.

Common errors and fixes

"AI Pilot says it cannot read logs for this server" Cause: File and Log Access is disabled, or the server's per-server AI access is set to Read Only or No Access. Fix: Enable File and Log Access under Settings → AI Agent → Agent Controls, and verify the server's per-server access level.

"AI Pilot diagnosed the issue but cannot propose a fix" Cause: The required write operation category may be disabled (e.g., Services and SSH operations are off). Fix: Check Settings → AI Agent → Agent Controls → Fine-tune Write Actions → Server Operations → Services and SSH.

"The alert did not resolve after the AI executed the fix" Cause: The AI fixed one cause but the metric is still elevated due to a secondary issue. Fix: Ask AI Pilot: "CPU on server-name is still high after restarting nginx. What else could be causing it?"