Knowledge Hub

Rolling Back Deployments and Recovering from Failures

Even with thorough testing, issues can slip through. This guide covers rapid rollback procedures, disaster recovery strategies, and pattern-based deployments that minimize blast radius. You'll learn how to recover from image corruptions, registry outages, cluster failures, and application regressions.

Kubernetes Rollback Fundamentals

Kubernetes maintains revision history for all deployments. Rolling back means reverting to a previous working state.

Kubectl Rollout Undo

View rollout history:

kubectl rollout history deployment/myapp -n production # Output:# REVISION  CHANGE-CAUSE# 1         Deployed v1.0.0# 2         Deployed v1.0.1# 3         Deployed v1.1.0 (CURRENT - BROKEN)# 4         Deployed v1.1.1

Rollback to previous revision:

kubectl rollout undo deployment/myapp -n production# Automatically rolls back to revision 3 (most recent working state) # Check statuskubectl rollout status deployment/myapp -n production -w

Rollback to specific revision:

kubectl rollout undo deployment/myapp \  --to-revision=2 \  -n production # Verifykubectl get deployment myapp -o yaml -n production | grep image:

View detailed change history:

kubectl rollout history deployment/myapp --revision=3 -n production# Shows the exact YAML state of revision 3

Under the Hood

Kubernetes stores past replicaset definitions. When you rollback:

# 1. Old ReplicaSet (rev 3) scaled to desired replicaskubectl get rs -n production -l app=myapp # Output:# NAME                 DESIRED  CURRENT  READY# myapp-abc123def4     0        0        0       (rev 4, scaled down)# myapp-xyz789ghi5     3        3        3       (rev 3, scaled up) <-- NOW RUNNING # 2. Pods in old ReplicaSet restart with previous imagekubectl describe replicaset myapp-xyz789ghi5 -n production | grep Image:

Helm Rollback

Helm maintains release history across deployments.

View Release History

helm history myapp -n production # Output:# REVISION  UPDATED                   STATUS      CHART            APP VERSION# 1         Wed Mar 19 10:30:00 2025  superseded  myapp-1.0.0      1.0.0# 2         Wed Mar 19 14:00:00 2025  superseded  myapp-1.0.1      1.0.1# 3         Wed Mar 20 08:15:00 2025  superseded  myapp-1.1.0      1.1.0# 4         Wed Mar 20 15:30:00 2025  deployed    myapp-1.1.1      1.1.1 (CURRENT - BROKEN)# 5         Wed Mar 21 02:45:00 2025  failed      myapp-1.2.0      1.2.0

Rollback to Previous Release

# Simple: rollback to immediately prior releasehelm rollback myapp -n production # Rollback to specific revisionhelm rollback myapp 3 -n production --wait # Verifyhelm status myapp -n productionkubectl get pods -n production -l app=myapp

Atomic Rollback (Recommended)

Automatically rollback if health checks fail:

helm upgrade myapp ./chart \  --namespace production \  --values values.yaml \  --atomic \  --wait \  --timeout 5m # If readiness probe fails within timeout,# Helm automatically rolls back to previous release

Rollback with Custom Message

helm rollback myapp 2 \  --clean-on-fail \  -n production \  --wait # Optionally annotate for audit trailkubectl patch deployment myapp \  -p '{"metadata":{"annotations":{"rolled-back-to-revision":"2","reason":"image-regression"}}}' \  -n production

Pre-Deployment Checklist

Before deploying any update, verify readiness:

1. Image Verification

# Fetch and inspect imageIMAGE="registry.cleanstart.com/python3:3.12.5-prod" # Pull and check sizedocker pull $IMAGEdocker inspect $IMAGE | jq '.Config.Healthcheck' # Verify signature (cosign)cosign verify \  --certificate-identity-regexp "https://github.com/cleanstart/.*" \  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \  $IMAGE # Extract SBOMsyft $IMAGE -o json > image.sbom.json # Quick vulnerability scangrype $IMAGE --fail-on critical

2. Helm Chart Validation

# Lint charthelm lint ./chart # Validate valueshelm template myapp ./chart \  --namespace production \  --values values-prod.yaml | kubectl apply --dry-run=client -f - # Check for template errorshelm template myapp ./chart --debug

3. Kubernetes Cluster Health

# Check node statuskubectl get nodes# All should be Ready # Check etcd health (if you have access)kubectl get componentstatuses # Monitor resource availabilitykubectl top nodeskubectl describe nodes | grep -E "Allocatable|Allocated resources"

4. Database Migration Status (if applicable)

# If deployment includes DB schema changeskubectl exec -it deployment/myapp -n production -- \  ./scripts/db_migration_status.sh # Verify previous migration completed# Never deploy if migration still running

5. Current Pod Status

# All pods ready and stablekubectl get pods -n production -l app=myappkubectl describe pods -n production -l app=myapp | grep -E "Conditions:|Ready"

pre-deployment checklist script.

Before deploying, the system performs the following checks:

Image verification: Verifies the image signature using cosign and scans for critical vulnerabilities using grype
Helm chart validation: Lints the chart and performs a dry-run deployment to ensure validity
Kubernetes cluster health: Verifies cluster connectivity and ensures no nodes are in NotReady or SchedulingDisabled states
Current deployment status: Checks that all current pods are in Running or Succeeded states
Resource availability: Ensures sufficient memory is available on cluster nodes

Once all checks pass, the system confirms readiness to deploy. Here's a reference script:

#!/bin/bash# pre-deploy-checklist.sh set -e NAMESPACE=${NAMESPACE:-production}IMAGE=${IMAGE:-registry.cleanstart.com/python3:3.12.5-prod}CHART_DIR=${CHART_DIR:-./chart} echo "=== PRE-DEPLOYMENT CHECKLIST ===" echo "Checking image verification..."cosign verify --certificate-identity-regexp "https://github.com/cleanstart/.*" $IMAGE || exit 1grype $IMAGE --fail-on critical || exit 1 echo "Checking Helm chart validation..."helm lint $CHART_DIR || exit 1helm template myapp $CHART_DIR | kubectl apply --dry-run=client -f - || exit 1 echo "Checking Kubernetes cluster health..."kubectl cluster-info || exit 1kubectl get nodes | grep -E "NotReady|SchedulingDisabled" && exit 1 echo "Checking current deployment status..."kubectl get pods -n $NAMESPACE -l app=myapp | grep -v "Running" | grep -v "Succeeded" && exit 1 echo "Checking resource availability..."AVAILABLE_MEM=$(kubectl describe nodes | grep "Allocatable:" -A 10 | grep "memory:" | awk '{sum+=$2} END {print sum}')REQUIRED_MEM="1Gi"echo "  Available memory: $AVAILABLE_MEM (required: $REQUIRED_MEM)" echo ""echo "All checks passed. Ready to deploy."

Run before deployment:

chmod +x pre-deploy-checklist.sh./pre-deploy-checklist.sh

Blue/Green Deployment Pattern

Deploy new version alongside old. Switch traffic when new is healthy.

Architecture

Before:  Traffic → Load Balancer → Blue (v1.0.0, 3 pods) After (canary phase):  Traffic → Load Balancer → Blue (v1.0.0, 3 pods)                         → Green (v1.1.0, 3 pods)  [0% traffic initially] After (switch):  Traffic → Load Balancer → Blue (v1.0.0, 3 pods)  [0% traffic now]                         → Green (v1.1.0, 3 pods)  [100% traffic]

Implementation with Helm

values-blue.yaml:

appVersion: blue replicaCount: 3 image:  repository: registry.cleanstart.com/python3  tag: "3.12.4-prod" service:  type: ClusterIP  port: 8080 ingress:  enabled: true  annotations:    kubernetes.io/ingress.class: nginx  hosts:    - host: myapp.example.com      paths:        - path: /          pathType: Prefix  # Traffic weight: 100% blue, 0% green (defined in Ingress)

values-green.yaml:

appVersion: green replicaCount: 3 image:  repository: registry.cleanstart.com/python3  tag: "3.12.5-prod"  # NEW VERSION service:  type: ClusterIP  port: 8080 ingress:  enabled: true  annotations:    kubernetes.io/ingress.class: nginx  hosts:    - host: myapp.example.com      paths:        - path: /          pathType: Prefix

Ingress.yaml (traffic split):

apiVersion: networking.k8s.io/v1kind: Ingressmetadata:  name: myapp-bluegreen  namespace: production  annotations:    nginx.ingress.kubernetes.io/rewrite-target: /    nginx.ingress.kubernetes.io/canary: "true"    nginx.ingress.kubernetes.io/canary-weight: "0"  # 0% to green initiallyspec:  ingressClassName: nginx  rules:  - host: myapp.example.com    http:      paths:      - path: /        pathType: Prefix        backend:          service:            name: myapp-blue            port:              number: 8080---# Canary ingress (routes subset of traffic to green)apiVersion: networking.k8s.io/v1kind: Ingressmetadata:  name: myapp-green-canary  namespace: production  annotations:    nginx.ingress.kubernetes.io/canary: "true"    nginx.ingress.kubernetes.io/canary-weight: "0"  # Increase gradually: 10, 25, 50, 100spec:  ingressClassName: nginx  rules:  - host: myapp.example.com    http:      paths:      - path: /        pathType: Prefix        backend:          service:            name: myapp-green            port:              number: 8080

Deployment Process

Phase 1: Deploy Green (0% traffic)

helm install myapp-green ./chart \  --namespace production \  --values values-green.yaml \  --wait # Verify green deployment is healthykubectl get pods -n production -l version=greenkubectl exec -it deployment/myapp-green -n production -- curl localhost:8080/health

Phase 2: Gradual Traffic Shift

# Update Ingress annotation to shift 10% traffic to greenkubectl patch ingress myapp-green-canary \  -n production \  --type='json' \  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value":"10"}]' # Monitor green deployment metrics for 5 minutes# Check error rate, latency, resource usagesleep 300kubectl top pods -n production -l version=green

Phase 3: Increase Traffic (25%, 50%, 75%, 100%)

for weight in 25 50 75 100; do  kubectl patch ingress myapp-green-canary \    -n production \    --type='json' \    -p="[{\"op\": \"replace\", \"path\": \"/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight\", \"value\":\"$weight\"}]"   echo "Traffic shift to $weight%. Waiting 5 minutes..."  sleep 300   # Check metrics  kubectl logs -n production -l version=green --tail=50 | grep -i errordone

Phase 4: Complete Switch

# Green now receiving 100% traffic# Blue can be deleted (or kept for quick rollback) helm uninstall myapp-blue -n production# OR keep it for 24 hours as fallback # Clean up canary ingresskubectl delete ingress myapp-green-canary -n production # Rename green to be the main deploymentkubectl patch deployment myapp-green \  -n production \  -p '{"metadata":{"labels":{"version":"production"}}}'

Rollback during Blue/Green:

# If green has issues during traffic shift, immediately revertkubectl patch ingress myapp-green-canary \  -n production \  --type='json' \  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value":"0"}]' # All traffic back to blue# Investigate green issues (logs, metrics)# Once fixed, retry traffic shift

Canary Deployment with Flagger

For more sophisticated progressive delivery, use Flagger with Argo Rollouts.

Flagger Installation

# Add Flagger Helm repohelm repo add flagger https://flagger.apphelm repo update # Install Flaggerhelm install flagger flagger/flagger \  --namespace istio-system \  --set prometheus.enabled=true # Install Flagger Prometheushelm install prometheus flagger/prometheus \  --namespace istio-system

Canary Resource Definition

apiVersion: flagger.app/v1beta1kind: Canarymetadata:  name: myapp  namespace: productionspec:  # Target Kubernetes deployment  targetRef:    apiVersion: apps/v1    kind: Deployment    name: myapp   # The service that clients hit  service:    port: 8080   # Progressive traffic shift  analysis:    interval: 1m    threshold: 5    maxWeight: 50    stepWeight: 5    metrics:    - name: request-success-rate      thresholdRange:        min: 99      interval: 1m    - name: request-duration      thresholdRange:        max: 500      interval: 1m    webhooks:    - name: acceptance-test      url: http://flagger-loadtester/      timeout: 30s      metadata:        type: bash        cmd: "curl -sd 'test' http://myapp-canary:8080/api/v1/health"    - name: load-test      url: http://flagger-loadtester/      timeout: 5s      metadata:        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/"   # Skip analysis if success rate drops below threshold  skipAnalysis: false   # Suspend if issues detected  suspend: false

Deploy Update via Canary

# Patch deployment with new imagekubectl set image deployment/myapp \  myapp=registry.cleanstart.com/python3:3.12.5-prod \  -n production # Flagger automatically:# 1. Creates canary ReplicaSet (v1.1.0)# 2. Routes 5% traffic to canary# 3. Runs acceptance + load tests# 4. Monitors metrics (success rate, latency)# 5. Gradually increases traffic (5% → 10% → 15% → ... → 50%)# 6. If metric thresholds breached, auto-rollback # Monitor progresskubectl describe canary myapp -n production

GitOps Rollback

If using ArgoCD or Flux, rollback via git commit.

ArgoCD Rollback

Current Application State (Helm values):

# Deployed version in ArgoCDkubectl get application myapp -n argocd -o yaml | grep -A5 "syncStatus:"

Rollback to Previous Git Commit:

# Find previous working commitgit log --oneline | head -10 # Output:# abc1234 fix: update CleanStart image to 3.12.5-prod (CURRENT - BROKEN)# def5678 chore: update Helm chart version to 1.1.1# ghi9012 fix: update CleanStart image to 3.12.4-prod (PREVIOUS - WORKING) # Revert to previous commitgit revert abc1234git push origin main # ArgoCD detects new commit and syncs automatically (if auto-sync enabled)kubectl patch application myapp -n argocd \  -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}' \  --type merge

Immediate Rollback without Git Commit:

# Hard reset to previous commit (not recommended for audit trail)git reset --hard ghi9012git push origin main --force  # Use with caution! # ArgoCD syncs to new HEADargocd app sync myapp --prune

Flux Rollback

List past reconciliations:

flux get helmreleases myapp -n production

Rollback to previous release:

# Identify previous working Helm release revisionflux get helmreleases myapp -n production --all-namespaces# orkubectl get helmrelease myapp -n production -o yaml | grep "lastReleaseRevision:" # Manually roll back Helm releasehelm rollback myapp 3 -n production

Database Migration Rollback Considerations

Critical rule: Never rollback if database migration has been applied.

Safe Database Upgrade Pattern

Deploy new application version (code that reads both old and new schema)
Run migration (schema change deployed to database)
Monitor for issues
Only then: Option A: Move forward (code using new schema only) Option B: Keep backward-compatible code and rollback application if needed. Avoid: Rolling back application after migration applied (data corruption risk) Deploying breaking schema changes without backward-compatible code first.

Migration Rollback Example

Safe migration:

-- Phase 1: Add new column (application sees both old and new)ALTER TABLE users ADD COLUMN new_email VARCHAR(255); -- Phase 2: Copy dataUPDATE users SET new_email = email; -- Phase 3: Drop old column (only after code updated)ALTER TABLE users DROP COLUMN email;

Application code (during phase 1-2):

# Reads both old and new columnsuser_email = row.get('new_email') or row.get('email')

If Migration Fails

# 1. Immediately stop rolloutkubectl patch deployment myapp \  -p '{"spec":{"progressDeadlineSeconds":10}}' \  -n production # 2. Rollback application (code that handles both schema versions)helm rollback myapp 1 -n production # 3. Rollback schema changekubectl exec -it pod/postgres-0 -n production -- \  psql mydb -c "DROP TABLE users_backup;"  # or reverse the ALTER # 4. Investigation (logs, DB state)# 5. Fix and retry

Registry Outage / Image Pull Failures

CleanStart registry unreachable? Containers can't pull new images.

Prevention: Local Image Cache

Option 1: Pre-pull images to all nodes

# DaemonSet that pre-pulls critical imagesapiVersion: apps/v1kind: DaemonSetmetadata:  name: image-cache-warmer  namespace: kube-systemspec:  selector:    matchLabels:      app: image-cache-warmer  template:    metadata:      labels:        app: image-cache-warmer    spec:      hostNetwork: true      containers:      - name: warmer        image: busybox:latest        command:        - sh        - -c        - |          for image in \            registry.cleanstart.com/python3:3.12.5-prod \            registry.cleanstart.com/nodejs:20.10.0-prod \            registry.cleanstart.com/go:1.21.5-prod; do            docker pull $image || ctr images pull $image          done          sleep 3600  # Re-pull hourly      restartPolicy: Always

Option 2: Fallback registry (mirror)

# Configure Docker to use mirror# /etc/docker/daemon.json{  "registry-mirrors": [    "https://mirror1.cleanstart.local",    "https://gcr.io/mirror-project",    "https://registry.cleanstart.com"  # original  ]} # Restart Dockersystemctl restart docker

Recovery: Image Pull Failure

When registry is down:

# Kubernetes shows ImagePullBackOff statuskubectl get pods -n production# NAME                     READY  STATUS             RESTARTS# myapp-abc123def-xyz1     0/1    ImagePullBackOff   0 # Check errorkubectl describe pod myapp-abc123def-xyz1 -n production# Events:# ... Failed to pull image "registry.cleanstart.com/python3:3.12.5-prod":#     rpc error: code = Unknown desc = failed to pull and unpack image ...

Step 1: Check registry status

curl -v https://registry.cleanstart.com/v2/ || echo "Registry down"

Step 2: If registry truly down, fallback to cached image

# Option A: Use local mirror if availablekubectl set image deployment/myapp \  myapp=mirror-registry.local/myapp:cached-v1.0.0 \  -n production # Option B: If node has image cached, force reschedulekubectl delete pod myapp-abc123def-xyz1 -n production# Kubernetes will retry, potentially finding image in node cache # Option C: Pull image on node directlydocker pull registry.cleanstart.com/python3:3.12.5-prod || \  docker load < /backup/python3-3.12.5-prod.tar

Step 3: Wait for registry recovery

# Monitor until registry onlinewatch -n 5 'curl -s https://registry.cleanstart.com/v2/ && echo "Registry OK" || echo "Down"' # Once online, trigger image pullkubectl patch pod myapp-abc123def-xyz1 -n production \  -p '{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date +'%Y-%m-%dT%H:%M:%S%z')'\"}}}'

Base Image Corruption / Cosign Verification Failure

If cosign verify fails, base image may be compromised.

# Verification failscosign verify registry.cleanstart.com/python3:3.12.5-prod# Error: signature verification failed

Steps:

# 1. Do NOT deploy this image# 2. Verify network connectivity (not MITM)curl -v https://registry.cleanstart.com/ # 3. Check cosign statuscosign version # 4. Try mirror registry if availablecosign verify ghcr.io/cleanstart-mirror/python3:3.12.5-prod # 5. Report to CleanStart security team# security@cleanstart.dev # 6. Fallback to last known-good imagekubectl set image deployment/myapp \  myapp=registry.cleanstart.com/python3:3.12.4-prod \  -n production # 7. Investigate# - Check image build logs at CleanStart# - Verify build pipeline integrity# - Request re-sign if build legitimate

Cluster Failure / Full Recovery

If entire cluster fails (unlikely but possible):

Backup Strategy

Regular etcd snapshots:

# For managed Kubernetes (GKE, EKS, AKS): automatic# For self-managed: manual cron job ETCD_BACKUP_DIR="/mnt/etcd-backups"ETCD_ENDPOINT="https://etcd.your-cluster.local:2379" # SnapshotETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINT \  --cacert=/etc/kubernetes/pki/etcd/ca.crt \  --cert=/etc/kubernetes/pki/etcd/server.crt \  --key=/etc/kubernetes/pki/etcd/server.key \  snapshot save $ETCD_BACKUP_DIR/etcd-$(date +%Y%m%d-%H%M%S).db # Verifyls -lh $ETCD_BACKUP_DIR/

PersistentVolume backups:

# Velero (cloud-agnostic backup tool)velero backup create daily-backup \  --include-namespaces production,staging # Verify backupvelero backup describe daily-backup

Restore Process

# 1. If etcd still intact (partial failure)ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINT snapshot restore $ETCD_BACKUP_DIR/etcd-20250321.db # 2. Restart control planekubectl delete pods -n kube-system kube-apiserver-master-01 # 3. If cluster completely lost, restore from Velerovelero restore create --from-backup daily-backup # 4. Verify restored objectskubectl get all -n productionkubectl get pvc -n production

Post-Incident Review Checklist

After any rollback or incident:

[ ] Root cause identified: What actually broke? [ ] Preventability: Could this have been caught in testing? [ ] Timeline documented: Exact times and actions taken [ ] Logs preserved: Collect logs before cleanup [ ] Fix implemented: Patch applied or configuration changed [ ] Test coverage added: New test that would have caught this [ ] Runbook updated: Incident guide saved for future reference [ ] Team debriefed: Share learnings [ ] Monitoring improved: New alert rule added [ ] Documentation updated: Update this playbook if new scenario discovered. Post-incident summary (template):

## Incident Report: Image Upgrade Regression **Date**: 2025-03-21**Duration**: 2h 15m (14:30 - 16:45 UTC)**Impact**: API service intermittently unavailable (500 errors on 5% of requests) ### Timeline- 14:30: Deployed CleanStart Python 3.12.5-prod to production- 14:45: Canary traffic shift to 10%, no issues observed- 15:15: Traffic shift to 100%, errors spike in logs- 15:25: Incident declared, rollback initiated- 15:40: Rolled back to Python 3.12.4-prod- 15:50: All metrics green, incident resolved ### Root CauseNew cryptography library (openssl 3.1.5) has performance regression in our workload.CPU spike from 2 cores → 8 cores per pod on certain request patterns. ### FixAllocate additional resources for Python 3.12.5 migration:- CPU request: 250m → 500m- CPU limit: 500m → 1000m- Add 2 additional replicas temporarily ### Prevention- Add load test with realistic request patterns in CI/CD- Test with actual production-like resource constraints- Add resource metric checks to pre-deployment validation ### New Alerts- Pod CPU > 800m for 5 minutes- Pod memory > 400Mi

Rollback Testing in CI/CD

Always test rollback before production.

# GitHub Actions rollback testname: Test Rollback Procedure on:  pull_request:    paths:      - "Dockerfile"      - "helm/**"      - ".github/workflows/test-rollback.yml" jobs:  test-rollback:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v4       - name: Create kind cluster        uses: helm/kind-action@v1.7.0       - name: Deploy v1.0.0 (baseline)        run: |          helm install myapp ./helm \            --values helm/values.yaml \            --set image.tag=v1.0.0 \            --wait       - name: Verify v1.0.0 running        run: |          kubectl get pods | grep myapp          kubectl logs deployment/myapp | grep "started"       - name: Deploy v1.1.0 (new version)        run: |          helm upgrade myapp ./helm \            --values helm/values.yaml \            --set image.tag=v1.1.0 \            --wait       - name: Verify v1.1.0 running        run: |          kubectl get pods | grep myapp          kubectl logs deployment/myapp | grep "version 1.1.0"       - name: Test rollback        run: |          helm rollback myapp \            --wait \            --timeout 2m       - name: Verify rollback successful        run: |          kubectl get pods | grep myapp          kubectl logs deployment/myapp | grep "started"       - name: Confirm back on v1.0.0        run: |          IMAGE=$(kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[0].image}')          [[ "$IMAGE" == *"v1.0.0"* ]] || exit 1

run test on every dockerfile change.

Before pushing a PR, run the local rollback test. The test should successfully deploy v1.0.0, deploy v1.1.0 as the new version, rollback to v1.0.0, and verify that the system returns to its original state. Upon successful completion, the test confirms the rollback procedure works correctly.

# Local test before pushing PR./test-rollback-locally.sh

Summary

Rollback capabilities are your safety net: Helm rollback: Revert entire release to previous revision Kubectl rollout undo: Revert deployment to previous replicaset Blue/Green: Deploy alongside old version, switch traffic when verified Canary: Gradually shift traffic while monitoring metrics GitOps: Revert git commits to trigger automatic rollback. Always: Test rollback procedure in CI/CD Verify pre-deployment checklist Use atomic flag to auto-rollback on probe failure Document incidents and improve monitoring Never rollback after database migrations Keep fallback registries and image caches. Time to rollback from incident to resolved: Under 15 minutes when procedures are practiced.