Incident Escalation and Diagnostic Telemetry

Escalation Protocol

The CloudAIPilot orchestration engine is highly autonomous, but critical infrastructure failures or unexpected cloud provider behaviors may occasionally require human intervention. If an operation fails and the AI Pilot or automated Troubleshooting Playbooks cannot resolve the state, you must escalate the incident to the CloudAIPilot Site Reliability Engineering (SRE) support team.

To ensure rapid triage and resolution, specific cryptographic and architectural context is required.

Required Telemetry for Triage

Vague reports (e.g., "The deployment failed") drastically increase time-to-resolution. Your escalation payload must include:

  1. Unique Resource Identifiers: The exact Server ID and Site/App ID (located in your browser's URL or the resource overview dashboard). This allows the SRE team to query the centralized audit logs securely.
  2. Activity Center Logs: If a background task failed, open the Activity Center, expand the failed operation, click View Log, and copy the exact script output or stack trace.
  3. Temporal Context: The exact time and timezone the anomaly occurred, allowing our systems to correlate the failure with wider cloud provider outages.

Strict Security and Zero-Trust Mandates

CloudAIPilot adheres to a strict Zero-Trust security model. When interacting with support:

  • NEVER transmit database credentials, private SSH keys, or Cloud Provider Access Tokens via support tickets or email.
  • CloudAIPilot personnel will never request your raw cryptographic materials.
  • If deep debugging is required, you must explicitly grant "Support Access" via a secure, temporary authorization toggle in your Server Settings. This establishes a time-bounded, audited proxy connection for our engineers.

Initial Self-Diagnosis

Before escalating, we recommend analyzing the local server telemetry:

  1. Navigate to the specific Server view.
  2. Access the Logs tab to view the live system stream.
  3. Review the logs for obvious upstream failures, such as No space left on device, Out of Memory (OOM), or Cloud provider API limit reached.

Related Articles