Emergency Rollback Playbook
Who this is for
Users in an incident where a recent deploy, config change, or restore has broken a production site/app, and they need to recover service quickly.
Rollback Priority Order (Fastest Recovery First)
When production is down, use this order:
- Rollback deployment (fastest, least destructive)
- Restart service (if issue is runtime-only)
- Restore from backup (slower, more destructive)
- Cloud snapshot restore (slowest, disaster recovery path)
Step 0 — Stabilize Incident Communication
Before technical steps:
- Notify your team that rollback is in progress.
- Pause new deployments temporarily.
- Assign one person to execute rollback and one person to monitor user impact.
Step 1 — Roll Back the Latest Deployment
For Sites
- Go to Site detail → Deployments.
- Find the last known good deployment (green/success).
- Click Rollback.
- Wait for completion in Activity Center.
For Apps
- Go to App detail → Deployments.
- Select the last known good image/commit.
- Click Rollback.
- Confirm the app status returns to
running.
Step 2 — Validate Recovery
After rollback completes:
- Open the production URL in a private/incognito window.
- Check core user actions (homepage, login, checkout/API endpoint).
- Check server CPU/memory and app logs for fresh errors.
If service is restored, keep the rollback state and begin root cause analysis before any re-deploy.
Step 3 — If Deployment Rollback Fails
If rollback also fails:
- Check SSH connectivity (rollback requires server access):
- Check disk space (rollback can fail if disk is full):
- Delete old backups/logs or resize disk
- Retry rollback.
If still failing, proceed to backup restore.
Step 4 — Restore from Backup (Last Resort in App Layer)
- Go to Server detail → Backups.
- Choose the most recent known-good
fullbackup. - Click Restore with Safe Mode enabled.
- Monitor in Activity Center.
Warning: Backup restore overwrites current files/database state. Changes made after the backup point are lost.
See KB-05-05: Restore from Backup.
Step 5 — Provider Snapshot Restore (Disaster Path)
If server-level corruption exists (disk failure, severe misconfiguration):
- Restore from a cloud snapshot using provider console (AWS/GCP/Azure/DO).
- Re-import the server in CloudAIPilot if needed.
This path is slower but can recover from deep infrastructure damage.
Post-Rollback Checklist
After service is stable:
- [ ] Keep incident timeline notes (what failed, when, what fixed it)
- [ ] Preserve failed deploy logs for analysis
- [ ] Create a root cause issue/ticket
- [ ] Add a pre-deploy backup policy if missing
- [ ] Add a staging verification gate before production deploys
Related Articles
- KB-12-01: Deployment Failed Playbook
- KB-05-05: Restore from Backup
- KB-03-06: Rollback Site Deployment
- KB-04-11: App Deployment Rollback