Incident Response with AI Pilot
Who this is for
Anyone who has received a critical alert and wants to use AI Pilot to diagnose and respond to the incident faster.
What you will complete
Use AI Pilot as your first responder for a firing alert — from initial diagnosis through to proposing and executing a fix.
Before you begin
- AI Pilot must be enabled with Read Access (at minimum) for diagnosis.
- Write Access required to execute a fix via AI Pilot.
- Navigate to AI Pilot in the left sidebar.
AI Pilot as a first responder
When a critical alert fires, AI Pilot can:
- Read the alert context — which server, which metric, when it started
- Pull the relevant logs and current metrics
- Diagnose the most likely cause
- Propose a remediation action for your approval
This does not replace human judgment — you always approve or deny. But it compresses the investigation from 15–30 minutes of manual log diving to 2–3 minutes of AI-assisted diagnosis.
Step-by-step: incident response workflow
Step 1: Acknowledge the alert
- Go to Alerts → Events tab.
- Find the firing alert.
- Click Acknowledge to signal you are investigating.
- Optional: click Snooze for 1 hour to suppress repeat notifications while you investigate.
Step 2: Open AI Pilot
- Go to AI Pilot in the left sidebar.
- Describe the incident with the server name and metric. Example:
"I have a critical CPU alert firing on server-name. It has been above 95% for the past 10 minutes. Please diagnose what is causing it."
Step 3: Review the AI diagnosis
AI Pilot will:
- Read the current CPU metrics and recent history
- Check active processes and load average if SSH access is enabled
- Review recent deployments and operations that may correlate
- Identify the most likely cause (runaway process, deployment spike, resource exhaustion)
- Explain the diagnosis in plain English
Step 4: Approve the proposed fix (if applicable)
If the AI identifies a fixable cause, it will propose an action. Common proposals for CPU incidents:
- Restart the specific service consuming excess CPU
- Kill a specific runaway process
- Trigger a deployment rollback if the spike started with a recent deployment
Review the approval card carefully. If the proposed fix looks correct, click Allow.
Step 5: Verify resolution
After the fix executes:
- Watch the CPU metric in the Monitoring dashboard — it should begin dropping.
- Within 1–2 minutes, if the fix was effective, the alert should auto-resolve.
- Ask AI Pilot: "Can you confirm server-name CPU is back to normal?"
Step 6: Post-incident note
Ask AI Pilot: "What happened on server-name during this incident and how can we prevent it?" The AI will summarize the incident and suggest a monitoring rule, goal trigger, or configuration change to prevent recurrence.
What success looks like
- Alert transitions from Firing to Resolved within minutes of the fix executing.
- CPU (or other affected metric) returns to normal range in the monitoring chart.
- The Activity Center shows the AI-initiated action as Completed.
Common errors and fixes
"AI Pilot says it cannot read logs for this server" Cause: File and Log Access is disabled, or the server's per-server AI access is set to Read Only or No Access. Fix: Enable File and Log Access under Settings → AI Agent → Agent Controls, and verify the server's per-server access level.
"AI Pilot diagnosed the issue but cannot propose a fix" Cause: The required write operation category may be disabled (e.g., Services and SSH operations are off). Fix: Check Settings → AI Agent → Agent Controls → Fine-tune Write Actions → Server Operations → Services and SSH.
"The alert did not resolve after the AI executed the fix" Cause: The AI fixed one cause but the metric is still elevated due to a secondary issue. Fix: Ask AI Pilot: "CPU on server-name is still high after restarting nginx. What else could be causing it?"