Even with thorough testing, issues can slip through. This guide covers rapid rollback procedures, disaster recovery strategies, and pattern-based deployments that minimize blast radius. You'll learn how to recover from image corruptions, registry outages, cluster failures, and application regressions.
Kubernetes Rollback Fundamentals
Kubernetes maintains revision history for all deployments. Rolling back means reverting to a previous working state.
Kubectl Rollout Undo
View rollout history:
kubectl rollout history deployment/myapp -n production # Output:# REVISION CHANGE-CAUSE# 1 Deployed v1.0.0# 2 Deployed v1.0.1# 3 Deployed v1.1.0 (CURRENT - BROKEN)# 4 Deployed v1.1.1Rollback to previous revision:
kubectl rollout undo deployment/myapp -n production# Automatically rolls back to revision 3 (most recent working state) # Check statuskubectl rollout status deployment/myapp -n production -wRollback to specific revision:
kubectl rollout undo deployment/myapp \ --to-revision=2 \ -n production # Verifykubectl get deployment myapp -o yaml -n production | grep image:View detailed change history:
kubectl rollout history deployment/myapp --revision=3 -n production# Shows the exact YAML state of revision 3Under the Hood
Kubernetes stores past replicaset definitions. When you rollback:
# 1. Old ReplicaSet (rev 3) scaled to desired replicaskubectl get rs -n production -l app=myapp # Output:# NAME DESIRED CURRENT READY# myapp-abc123def4 0 0 0 (rev 4, scaled down)# myapp-xyz789ghi5 3 3 3 (rev 3, scaled up) <-- NOW RUNNING # 2. Pods in old ReplicaSet restart with previous imagekubectl describe replicaset myapp-xyz789ghi5 -n production | grep Image:Helm Rollback
Helm maintains release history across deployments.
View Release History
helm history myapp -n production # Output:# REVISION UPDATED STATUS CHART APP VERSION# 1 Wed Mar 19 10:30:00 2025 superseded myapp-1.0.0 1.0.0# 2 Wed Mar 19 14:00:00 2025 superseded myapp-1.0.1 1.0.1# 3 Wed Mar 20 08:15:00 2025 superseded myapp-1.1.0 1.1.0# 4 Wed Mar 20 15:30:00 2025 deployed myapp-1.1.1 1.1.1 (CURRENT - BROKEN)# 5 Wed Mar 21 02:45:00 2025 failed myapp-1.2.0 1.2.0Rollback to Previous Release
# Simple: rollback to immediately prior releasehelm rollback myapp -n production # Rollback to specific revisionhelm rollback myapp 3 -n production --wait # Verifyhelm status myapp -n productionkubectl get pods -n production -l app=myappAtomic Rollback (Recommended)
Automatically rollback if health checks fail:
helm upgrade myapp ./chart \ --namespace production \ --values values.yaml \ --atomic \ --wait \ --timeout 5m # If readiness probe fails within timeout,# Helm automatically rolls back to previous releaseRollback with Custom Message
helm rollback myapp 2 \ --clean-on-fail \ -n production \ --wait # Optionally annotate for audit trailkubectl patch deployment myapp \ -p '{"metadata":{"annotations":{"rolled-back-to-revision":"2","reason":"image-regression"}}}' \ -n productionPre-Deployment Checklist
Before deploying any update, verify readiness:
1. Image Verification
# Fetch and inspect imageIMAGE="registry.cleanstart.com/python3:3.12.5-prod" # Pull and check sizedocker pull $IMAGEdocker inspect $IMAGE | jq '.Config.Healthcheck' # Verify signature (cosign)cosign verify \ --certificate-identity-regexp "https://github.com/cleanstart/.*" \ --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \ $IMAGE # Extract SBOMsyft $IMAGE -o json > image.sbom.json # Quick vulnerability scangrype $IMAGE --fail-on critical2. Helm Chart Validation
# Lint charthelm lint ./chart # Validate valueshelm template myapp ./chart \ --namespace production \ --values values-prod.yaml | kubectl apply --dry-run=client -f - # Check for template errorshelm template myapp ./chart --debug3. Kubernetes Cluster Health
# Check node statuskubectl get nodes# All should be Ready # Check etcd health (if you have access)kubectl get componentstatuses # Monitor resource availabilitykubectl top nodeskubectl describe nodes | grep -E "Allocatable|Allocated resources"4. Database Migration Status (if applicable)
# If deployment includes DB schema changeskubectl exec -it deployment/myapp -n production -- \ ./scripts/db_migration_status.sh # Verify previous migration completed# Never deploy if migration still running5. Current Pod Status
# All pods ready and stablekubectl get pods -n production -l app=myappkubectl describe pods -n production -l app=myapp | grep -E "Conditions:|Ready"pre-deployment checklist script.
Before deploying, the system performs the following checks:
- Image verification: Verifies the image signature using cosign and scans for critical vulnerabilities using grype
- Helm chart validation: Lints the chart and performs a dry-run deployment to ensure validity
- Kubernetes cluster health: Verifies cluster connectivity and ensures no nodes are in NotReady or SchedulingDisabled states
- Current deployment status: Checks that all current pods are in Running or Succeeded states
- Resource availability: Ensures sufficient memory is available on cluster nodes
Once all checks pass, the system confirms readiness to deploy. Here's a reference script:
#!/bin/bash# pre-deploy-checklist.sh set -e NAMESPACE=${NAMESPACE:-production}IMAGE=${IMAGE:-registry.cleanstart.com/python3:3.12.5-prod}CHART_DIR=${CHART_DIR:-./chart} echo "=== PRE-DEPLOYMENT CHECKLIST ===" echo "Checking image verification..."cosign verify --certificate-identity-regexp "https://github.com/cleanstart/.*" $IMAGE || exit 1grype $IMAGE --fail-on critical || exit 1 echo "Checking Helm chart validation..."helm lint $CHART_DIR || exit 1helm template myapp $CHART_DIR | kubectl apply --dry-run=client -f - || exit 1 echo "Checking Kubernetes cluster health..."kubectl cluster-info || exit 1kubectl get nodes | grep -E "NotReady|SchedulingDisabled" && exit 1 echo "Checking current deployment status..."kubectl get pods -n $NAMESPACE -l app=myapp | grep -v "Running" | grep -v "Succeeded" && exit 1 echo "Checking resource availability..."AVAILABLE_MEM=$(kubectl describe nodes | grep "Allocatable:" -A 10 | grep "memory:" | awk '{sum+=$2} END {print sum}')REQUIRED_MEM="1Gi"echo " Available memory: $AVAILABLE_MEM (required: $REQUIRED_MEM)" echo ""echo "All checks passed. Ready to deploy."Run before deployment:
chmod +x pre-deploy-checklist.sh./pre-deploy-checklist.shBlue/Green Deployment Pattern
Deploy new version alongside old. Switch traffic when new is healthy.
Architecture
Before: Traffic → Load Balancer → Blue (v1.0.0, 3 pods) After (canary phase): Traffic → Load Balancer → Blue (v1.0.0, 3 pods) → Green (v1.1.0, 3 pods) [0% traffic initially] After (switch): Traffic → Load Balancer → Blue (v1.0.0, 3 pods) [0% traffic now] → Green (v1.1.0, 3 pods) [100% traffic]Implementation with Helm
values-blue.yaml:
appVersion: blue replicaCount: 3 image: repository: registry.cleanstart.com/python3 tag: "3.12.4-prod" service: type: ClusterIP port: 8080 ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx hosts: - host: myapp.example.com paths: - path: / pathType: Prefix # Traffic weight: 100% blue, 0% green (defined in Ingress)values-green.yaml:
appVersion: green replicaCount: 3 image: repository: registry.cleanstart.com/python3 tag: "3.12.5-prod" # NEW VERSION service: type: ClusterIP port: 8080 ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx hosts: - host: myapp.example.com paths: - path: / pathType: PrefixIngress.yaml (traffic split):
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: myapp-bluegreen namespace: production annotations: nginx.ingress.kubernetes.io/rewrite-target: / nginx.ingress.kubernetes.io/canary: "true" nginx.ingress.kubernetes.io/canary-weight: "0" # 0% to green initiallyspec: ingressClassName: nginx rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp-blue port: number: 8080---# Canary ingress (routes subset of traffic to green)apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: myapp-green-canary namespace: production annotations: nginx.ingress.kubernetes.io/canary: "true" nginx.ingress.kubernetes.io/canary-weight: "0" # Increase gradually: 10, 25, 50, 100spec: ingressClassName: nginx rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp-green port: number: 8080Deployment Process
Phase 1: Deploy Green (0% traffic)
helm install myapp-green ./chart \ --namespace production \ --values values-green.yaml \ --wait # Verify green deployment is healthykubectl get pods -n production -l version=greenkubectl exec -it deployment/myapp-green -n production -- curl localhost:8080/healthPhase 2: Gradual Traffic Shift
# Update Ingress annotation to shift 10% traffic to greenkubectl patch ingress myapp-green-canary \ -n production \ --type='json' \ -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value":"10"}]' # Monitor green deployment metrics for 5 minutes# Check error rate, latency, resource usagesleep 300kubectl top pods -n production -l version=greenPhase 3: Increase Traffic (25%, 50%, 75%, 100%)
for weight in 25 50 75 100; do kubectl patch ingress myapp-green-canary \ -n production \ --type='json' \ -p="[{\"op\": \"replace\", \"path\": \"/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight\", \"value\":\"$weight\"}]" echo "Traffic shift to $weight%. Waiting 5 minutes..." sleep 300 # Check metrics kubectl logs -n production -l version=green --tail=50 | grep -i errordonePhase 4: Complete Switch
# Green now receiving 100% traffic# Blue can be deleted (or kept for quick rollback) helm uninstall myapp-blue -n production# OR keep it for 24 hours as fallback # Clean up canary ingresskubectl delete ingress myapp-green-canary -n production # Rename green to be the main deploymentkubectl patch deployment myapp-green \ -n production \ -p '{"metadata":{"labels":{"version":"production"}}}'Rollback during Blue/Green:
# If green has issues during traffic shift, immediately revertkubectl patch ingress myapp-green-canary \ -n production \ --type='json' \ -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value":"0"}]' # All traffic back to blue# Investigate green issues (logs, metrics)# Once fixed, retry traffic shiftCanary Deployment with Flagger
For more sophisticated progressive delivery, use Flagger with Argo Rollouts.
Flagger Installation
# Add Flagger Helm repohelm repo add flagger https://flagger.apphelm repo update # Install Flaggerhelm install flagger flagger/flagger \ --namespace istio-system \ --set prometheus.enabled=true # Install Flagger Prometheushelm install prometheus flagger/prometheus \ --namespace istio-systemCanary Resource Definition
apiVersion: flagger.app/v1beta1kind: Canarymetadata: name: myapp namespace: productionspec: # Target Kubernetes deployment targetRef: apiVersion: apps/v1 kind: Deployment name: myapp # The service that clients hit service: port: 8080 # Progressive traffic shift analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 1m webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: bash cmd: "curl -sd 'test' http://myapp-canary:8080/api/v1/health" - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/" # Skip analysis if success rate drops below threshold skipAnalysis: false # Suspend if issues detected suspend: falseDeploy Update via Canary
# Patch deployment with new imagekubectl set image deployment/myapp \ myapp=registry.cleanstart.com/python3:3.12.5-prod \ -n production # Flagger automatically:# 1. Creates canary ReplicaSet (v1.1.0)# 2. Routes 5% traffic to canary# 3. Runs acceptance + load tests# 4. Monitors metrics (success rate, latency)# 5. Gradually increases traffic (5% → 10% → 15% → ... → 50%)# 6. If metric thresholds breached, auto-rollback # Monitor progresskubectl describe canary myapp -n productionGitOps Rollback
If using ArgoCD or Flux, rollback via git commit.
ArgoCD Rollback
Current Application State (Helm values):
# Deployed version in ArgoCDkubectl get application myapp -n argocd -o yaml | grep -A5 "syncStatus:"Rollback to Previous Git Commit:
# Find previous working commitgit log --oneline | head -10 # Output:# abc1234 fix: update CleanStart image to 3.12.5-prod (CURRENT - BROKEN)# def5678 chore: update Helm chart version to 1.1.1# ghi9012 fix: update CleanStart image to 3.12.4-prod (PREVIOUS - WORKING) # Revert to previous commitgit revert abc1234git push origin main # ArgoCD detects new commit and syncs automatically (if auto-sync enabled)kubectl patch application myapp -n argocd \ -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}' \ --type mergeImmediate Rollback without Git Commit:
# Hard reset to previous commit (not recommended for audit trail)git reset --hard ghi9012git push origin main --force # Use with caution! # ArgoCD syncs to new HEADargocd app sync myapp --pruneFlux Rollback
List past reconciliations:
flux get helmreleases myapp -n productionRollback to previous release:
# Identify previous working Helm release revisionflux get helmreleases myapp -n production --all-namespaces# orkubectl get helmrelease myapp -n production -o yaml | grep "lastReleaseRevision:" # Manually roll back Helm releasehelm rollback myapp 3 -n productionDatabase Migration Rollback Considerations
Critical rule: Never rollback if database migration has been applied.
Safe Database Upgrade Pattern
- Deploy new application version (code that reads both old and new schema)
- Run migration (schema change deployed to database)
- Monitor for issues
- Only then: Option A: Move forward (code using new schema only) Option B: Keep backward-compatible code and rollback application if needed. Avoid: Rolling back application after migration applied (data corruption risk) Deploying breaking schema changes without backward-compatible code first.
Migration Rollback Example
Safe migration:
-- Phase 1: Add new column (application sees both old and new)ALTER TABLE users ADD COLUMN new_email VARCHAR(255); -- Phase 2: Copy dataUPDATE users SET new_email = email; -- Phase 3: Drop old column (only after code updated)ALTER TABLE users DROP COLUMN email;Application code (during phase 1-2):
# Reads both old and new columnsuser_email = row.get('new_email') or row.get('email')If Migration Fails
# 1. Immediately stop rolloutkubectl patch deployment myapp \ -p '{"spec":{"progressDeadlineSeconds":10}}' \ -n production # 2. Rollback application (code that handles both schema versions)helm rollback myapp 1 -n production # 3. Rollback schema changekubectl exec -it pod/postgres-0 -n production -- \ psql mydb -c "DROP TABLE users_backup;" # or reverse the ALTER # 4. Investigation (logs, DB state)# 5. Fix and retryRegistry Outage / Image Pull Failures
CleanStart registry unreachable? Containers can't pull new images.
Prevention: Local Image Cache
Option 1: Pre-pull images to all nodes
# DaemonSet that pre-pulls critical imagesapiVersion: apps/v1kind: DaemonSetmetadata: name: image-cache-warmer namespace: kube-systemspec: selector: matchLabels: app: image-cache-warmer template: metadata: labels: app: image-cache-warmer spec: hostNetwork: true containers: - name: warmer image: busybox:latest command: - sh - -c - | for image in \ registry.cleanstart.com/python3:3.12.5-prod \ registry.cleanstart.com/nodejs:20.10.0-prod \ registry.cleanstart.com/go:1.21.5-prod; do docker pull $image || ctr images pull $image done sleep 3600 # Re-pull hourly restartPolicy: AlwaysOption 2: Fallback registry (mirror)
# Configure Docker to use mirror# /etc/docker/daemon.json{ "registry-mirrors": [ "https://mirror1.cleanstart.local", "https://gcr.io/mirror-project", "https://registry.cleanstart.com" # original ]} # Restart Dockersystemctl restart dockerRecovery: Image Pull Failure
When registry is down:
# Kubernetes shows ImagePullBackOff statuskubectl get pods -n production# NAME READY STATUS RESTARTS# myapp-abc123def-xyz1 0/1 ImagePullBackOff 0 # Check errorkubectl describe pod myapp-abc123def-xyz1 -n production# Events:# ... Failed to pull image "registry.cleanstart.com/python3:3.12.5-prod":# rpc error: code = Unknown desc = failed to pull and unpack image ...Step 1: Check registry status
curl -v https://registry.cleanstart.com/v2/ || echo "Registry down"Step 2: If registry truly down, fallback to cached image
# Option A: Use local mirror if availablekubectl set image deployment/myapp \ myapp=mirror-registry.local/myapp:cached-v1.0.0 \ -n production # Option B: If node has image cached, force reschedulekubectl delete pod myapp-abc123def-xyz1 -n production# Kubernetes will retry, potentially finding image in node cache # Option C: Pull image on node directlydocker pull registry.cleanstart.com/python3:3.12.5-prod || \ docker load < /backup/python3-3.12.5-prod.tarStep 3: Wait for registry recovery
# Monitor until registry onlinewatch -n 5 'curl -s https://registry.cleanstart.com/v2/ && echo "Registry OK" || echo "Down"' # Once online, trigger image pullkubectl patch pod myapp-abc123def-xyz1 -n production \ -p '{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date +'%Y-%m-%dT%H:%M:%S%z')'\"}}}'Base Image Corruption / Cosign Verification Failure
If cosign verify fails, base image may be compromised.
# Verification failscosign verify registry.cleanstart.com/python3:3.12.5-prod# Error: signature verification failedSteps:
# 1. Do NOT deploy this image# 2. Verify network connectivity (not MITM)curl -v https://registry.cleanstart.com/ # 3. Check cosign statuscosign version # 4. Try mirror registry if availablecosign verify ghcr.io/cleanstart-mirror/python3:3.12.5-prod # 5. Report to CleanStart security team# security@cleanstart.dev # 6. Fallback to last known-good imagekubectl set image deployment/myapp \ myapp=registry.cleanstart.com/python3:3.12.4-prod \ -n production # 7. Investigate# - Check image build logs at CleanStart# - Verify build pipeline integrity# - Request re-sign if build legitimateCluster Failure / Full Recovery
If entire cluster fails (unlikely but possible):
Backup Strategy
Regular etcd snapshots:
# For managed Kubernetes (GKE, EKS, AKS): automatic# For self-managed: manual cron job ETCD_BACKUP_DIR="/mnt/etcd-backups"ETCD_ENDPOINT="https://etcd.your-cluster.local:2379" # SnapshotETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINT \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot save $ETCD_BACKUP_DIR/etcd-$(date +%Y%m%d-%H%M%S).db # Verifyls -lh $ETCD_BACKUP_DIR/PersistentVolume backups:
# Velero (cloud-agnostic backup tool)velero backup create daily-backup \ --include-namespaces production,staging # Verify backupvelero backup describe daily-backupRestore Process
# 1. If etcd still intact (partial failure)ETCDCTL_API=3 etcdctl --endpoints=$ETCD_ENDPOINT snapshot restore $ETCD_BACKUP_DIR/etcd-20250321.db # 2. Restart control planekubectl delete pods -n kube-system kube-apiserver-master-01 # 3. If cluster completely lost, restore from Velerovelero restore create --from-backup daily-backup # 4. Verify restored objectskubectl get all -n productionkubectl get pvc -n productionPost-Incident Review Checklist
After any rollback or incident:
[ ] Root cause identified: What actually broke? [ ] Preventability: Could this have been caught in testing? [ ] Timeline documented: Exact times and actions taken [ ] Logs preserved: Collect logs before cleanup [ ] Fix implemented: Patch applied or configuration changed [ ] Test coverage added: New test that would have caught this [ ] Runbook updated: Incident guide saved for future reference [ ] Team debriefed: Share learnings [ ] Monitoring improved: New alert rule added [ ] Documentation updated: Update this playbook if new scenario discovered. Post-incident summary (template):
## Incident Report: Image Upgrade Regression **Date**: 2025-03-21**Duration**: 2h 15m (14:30 - 16:45 UTC)**Impact**: API service intermittently unavailable (500 errors on 5% of requests) ### Timeline- 14:30: Deployed CleanStart Python 3.12.5-prod to production- 14:45: Canary traffic shift to 10%, no issues observed- 15:15: Traffic shift to 100%, errors spike in logs- 15:25: Incident declared, rollback initiated- 15:40: Rolled back to Python 3.12.4-prod- 15:50: All metrics green, incident resolved ### Root CauseNew cryptography library (openssl 3.1.5) has performance regression in our workload.CPU spike from 2 cores → 8 cores per pod on certain request patterns. ### FixAllocate additional resources for Python 3.12.5 migration:- CPU request: 250m → 500m- CPU limit: 500m → 1000m- Add 2 additional replicas temporarily ### Prevention- Add load test with realistic request patterns in CI/CD- Test with actual production-like resource constraints- Add resource metric checks to pre-deployment validation ### New Alerts- Pod CPU > 800m for 5 minutes- Pod memory > 400MiRollback Testing in CI/CD
Always test rollback before production.
# GitHub Actions rollback testname: Test Rollback Procedure on: pull_request: paths: - "Dockerfile" - "helm/**" - ".github/workflows/test-rollback.yml" jobs: test-rollback: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Create kind cluster uses: helm/kind-action@v1.7.0 - name: Deploy v1.0.0 (baseline) run: | helm install myapp ./helm \ --values helm/values.yaml \ --set image.tag=v1.0.0 \ --wait - name: Verify v1.0.0 running run: | kubectl get pods | grep myapp kubectl logs deployment/myapp | grep "started" - name: Deploy v1.1.0 (new version) run: | helm upgrade myapp ./helm \ --values helm/values.yaml \ --set image.tag=v1.1.0 \ --wait - name: Verify v1.1.0 running run: | kubectl get pods | grep myapp kubectl logs deployment/myapp | grep "version 1.1.0" - name: Test rollback run: | helm rollback myapp \ --wait \ --timeout 2m - name: Verify rollback successful run: | kubectl get pods | grep myapp kubectl logs deployment/myapp | grep "started" - name: Confirm back on v1.0.0 run: | IMAGE=$(kubectl get deployment myapp -o jsonpath='{.spec.template.spec.containers[0].image}') [[ "$IMAGE" == *"v1.0.0"* ]] || exit 1run test on every dockerfile change.
Before pushing a PR, run the local rollback test. The test should successfully deploy v1.0.0, deploy v1.1.0 as the new version, rollback to v1.0.0, and verify that the system returns to its original state. Upon successful completion, the test confirms the rollback procedure works correctly.
# Local test before pushing PR./test-rollback-locally.shSummary
Rollback capabilities are your safety net: Helm rollback: Revert entire release to previous revision Kubectl rollout undo: Revert deployment to previous replicaset Blue/Green: Deploy alongside old version, switch traffic when verified Canary: Gradually shift traffic while monitoring metrics GitOps: Revert git commits to trigger automatic rollback. Always: Test rollback procedure in CI/CD Verify pre-deployment checklist Use atomic flag to auto-rollback on probe failure Document incidents and improve monitoring Never rollback after database migrations Keep fallback registries and image caches. Time to rollback from incident to resolved: Under 15 minutes when procedures are practiced.
