Emergency Rollback Playbook

Who this is for

Users in an incident where a recent deploy, config change, or restore has broken a production site/app, and they need to recover service quickly.


Rollback Priority Order (Fastest Recovery First)

When production is down, use this order:

  1. Rollback deployment (fastest, least destructive)
  2. Restart service (if issue is runtime-only)
  3. Restore from backup (slower, more destructive)
  4. Cloud snapshot restore (slowest, disaster recovery path)

Step 0 — Stabilize Incident Communication

Before technical steps:

  1. Notify your team that rollback is in progress.
  2. Pause new deployments temporarily.
  3. Assign one person to execute rollback and one person to monitor user impact.

Step 1 — Roll Back the Latest Deployment

For Sites

  1. Go to Site detail → Deployments.
  2. Find the last known good deployment (green/success).
  3. Click Rollback.
  4. Wait for completion in Activity Center.

For Apps

  1. Go to App detail → Deployments.
  2. Select the last known good image/commit.
  3. Click Rollback.
  4. Confirm the app status returns to running.

Step 2 — Validate Recovery

After rollback completes:

  1. Open the production URL in a private/incognito window.
  2. Check core user actions (homepage, login, checkout/API endpoint).
  3. Check server CPU/memory and app logs for fresh errors.

If service is restored, keep the rollback state and begin root cause analysis before any re-deploy.


Step 3 — If Deployment Rollback Fails

If rollback also fails:

  1. Check SSH connectivity (rollback requires server access):
  1. Check disk space (rollback can fail if disk is full):
  • Delete old backups/logs or resize disk
  1. Retry rollback.

If still failing, proceed to backup restore.


Step 4 — Restore from Backup (Last Resort in App Layer)

  1. Go to Server detail → Backups.
  2. Choose the most recent known-good full backup.
  3. Click Restore with Safe Mode enabled.
  4. Monitor in Activity Center.

Warning: Backup restore overwrites current files/database state. Changes made after the backup point are lost.

See KB-05-05: Restore from Backup.


Step 5 — Provider Snapshot Restore (Disaster Path)

If server-level corruption exists (disk failure, severe misconfiguration):

  • Restore from a cloud snapshot using provider console (AWS/GCP/Azure/DO).
  • Re-import the server in CloudAIPilot if needed.

This path is slower but can recover from deep infrastructure damage.


Post-Rollback Checklist

After service is stable:

  • [ ] Keep incident timeline notes (what failed, when, what fixed it)
  • [ ] Preserve failed deploy logs for analysis
  • [ ] Create a root cause issue/ticket
  • [ ] Add a pre-deploy backup policy if missing
  • [ ] Add a staging verification gate before production deploys

Related Articles