Knowledge Hub

Supply Chain Disaster Recovery Plan

Purpose

Your production infrastructure depends on CleanStart container images. This plan ensures your organization can survive and recover from CleanStart supply chain disruptions—including registry outages, delayed security patches, or worst-case vendor discontinuation. The goal is zero downtime and full operational continuity when the registry is unavailable.

Risk Scenarios

Scenario 1: Registry Temporary Outage (hours). When the CleanStart registry becomes unreachable for a 2–8 hour window due to network issues, denial-of-service attacks, or scheduled maintenance, organizations cannot pull new images or deploy new application versions from CleanStart during this time. However, existing containers that are already running continue to operate normally because the image is already cached on the nodes—the outage doesn't immediately disrupt applications that are already running. The operational problem arises when you need to scale applications horizontally by adding new pods, because Kubernetes cannot pull the required images for the new pods. CI/CD pipelines block waiting for image pulls to complete. If a security issue is discovered and the team rebuilds the application with a fix, they cannot deploy the new version because the deployment process requires pulling the CleanStart base image, which is unreachable at that moment.

Scenario 2: Extended Registry Outage (days). The CleanStart registry remains unreachable for 1–7 days due to catastrophic infrastructure failure, acquisition by another company that shuts it down, or extended maintenance windows. The cluster auto-scaler cannot place new pods on newly created nodes because image pulls fail. Node termination and replacement workflows break when a node fails and the cluster needs to spawn a replacement—the new pods cannot start because they cannot pull the images. Disaster recovery failover to a new region may fail if images aren't pre-cached in the new location. Development and staging environments suffer first because they exercise the system more actively; production is harder-hit because cluster auto-scaling must continuously attempt to recover running services even without the ability to pull new images. If a node in the production cluster fails, the auto-scaler attempts to spin up a replacement node, but without access to CleanStart, the replacement cannot pull the CleanStart base images, leaving the cluster degraded and unable to recover capacity.

Scenario 3: Delayed Security Patch. When a critical security vulnerability is publicly disclosed (for example, a critical vulnerability in glibc, the C standard library used by most Linux distributions), CleanStart hasn't yet published a patched image because they're testing the patch, coordinating with the upstream project, and ensuring the patch doesn't break dependent applications. The testing and release process will take 4–12 hours. During this window, your production applications are running vulnerable images. Compliance scanning tools flag the vulnerability in your systems. Until CleanStart publishes a patched image, your organization cannot fully remediate the vulnerability without either waiting for the official patch or rebuilding from an alternative base image.

Scenario 4: Vendor Discontinuation. CleanStart ceases operations, is acquired and subsequently shut down, or withdraws from the market. The published images continue to work because they're OCI standard—Kubernetes doesn't care who built the image. However, you lose access to future updates and security patches. No one is maintaining the images, so security vulnerabilities discovered in future years have no official patch source available. Your organization must either accept running increasingly obsolete images or migrate to a different base image provider. This is a long-term survival scenario requiring detailed migration planning and execution.

Mitigation Strategy 1: Registry Mirroring

Why Registry Mirroring is Non-Negotiable

Every production deployment MUST maintain a local registry mirror of CleanStart images. This is the single most critical disaster recovery control. It provides availability through pulling images from internal registry during outages, speed from local pulls being 10–100x faster than remote pulls, compliance for air-gapped environments that can function without external dependency, cost reduction from reduced egress bandwidth from CleanStart registry, and audit visibility into which images are used in production.

A well-maintained mirror means a 2-hour registry outage has zero impact on production.

Mirror Setup: Cloud Providers

AWS ECR (Elastic Container Registry)

Enable replication policy to automatically mirror CleanStart images. Create ECR repositories for each CleanStart image. Cost is approximately $0.60 per GB stored per month. A mirror of 100 images (~50 GB total) costs ~$30/month.

Azure ACR (Azure Container Registry)

Enable geo-replication for automatic multi-region sync.

GCP Artifact Registry

Set up cross-region replication.

Harbor (Self-Hosted)

For air-gapped or on-premises environments, define a replication policy as a ConfigMap with scheduled triggers.

JFrog Artifactory

Configure federated repositories with remote and virtual repository configurations.

Mirror Setup: Automated Sync Pipeline

Create a scheduled job using Kubernetes CronJob to continuously sync CleanStart images. The sync runs every hour, authenticates to both registries, pulls from CleanStart, verifies signatures using cosign, tags for internal registry, pushes to internal registry, and logs all results.

Sync Monitoring and Alerting

Create alerts to detect sync failures. Define Prometheus rules that trigger alerts when mirror sync fails, when mirror is out of sync (no successful sync in past hour), or when mirror storage is nearly full.

Mitigation Strategy 2: Image Caching

Node-Level Image Pre-Pulling

Ensure all worker nodes have critical CleanStart images pre-pulled using a DaemonSet. The daemonSet runs on all nodes, pulls critical images at startup, and keeps them cached.

Image Garbage Collection

Configure kubelet to retain images appropriately by setting imageGCHighThresholdPercent to 85 percent and imageGCLowThresholdPercent to 75 percent with imageMinimumGCAge set to 48 hours.

Mitigation Strategy 3: Air-Gapped Backup

For environments with strict air-gap requirements or where images must survive extended offline periods, export and back up images to tarball format.

Export Images to Tarball

Create a backup script that exports all critical CleanStart images to tarballs with checksums and metadata, stores them in an immutable backup location, and creates a manifest for restoration instructions.

Restore from Tarball

Create a restore script that extracts backup tarballs, verifies checksums, loads images into local registry, and verifies successful restoration.

Scenario Response Playbooks

Playbook: Registry Temporary Outage (2–8 hours)

Detection (automated): Mirror sync job fails to pull from CleanStart registry and monitoring alert "MirrorSyncLag" is triggered.

Immediate Response (first 15 minutes): Confirm registry outage, page the on-call engineer, send status notification to all teams, and verify internal mirror is healthy and up-to-date.

Operational Adjustments: Switch all image pulls to internal registry, disable new deployments temporarily, ensure node auto-scaling uses pre-pulled images only, and monitor pod restart rates.

Monitoring: Monitor error rate in pods (should be stable), cluster resource utilization (should be stable), and internal registry latency (should be 10-100ms).

Recovery (once registry returns): Monitor registry health for 30 minutes of successful syncs, resume normal deployments, perform mirror sync to ensure latest images are available, and post-incident update the DR runbook.

Success Criteria: Zero production downtime, zero deployment delays during outage.

Playbook: Extended Registry Outage (1–7 days)

Detection: Registry outage is confirmed as extended (> 4 hours without recovery) and leadership is paged.

Escalation: Activate incident command center, notify all teams, freeze all non-critical changes, and activate backup communication channels.

Operational Adjustments: Scale all critical services to 3 replicas minimum, pause any scheduled maintenance or rolling updates, disable any cluster auto-scaling, implement manual pod restart restrictions, and reduce log verbosity.

Workarounds: Use cleanimg-customize to patch cached images locally or rebuild images from Dockerfile if necessary.

Monitoring: Monitor cluster stability metrics, registry cache hit rates (should be 100%), pod eviction and restart rates (should be minimal), and memory and disk pressure.

Recovery: Once registry returns, stagger recovery over 2–4 hours, manually mirror sync to latest images, gradually rollout of new images (canary deployment), and post-incident review.

Playbook: Delayed Security Patch

Detection: Critical CVE is disclosed and CleanStart does not have patch available immediately.

Assessment (first 30 minutes): Determine CVSS score and exploitability, assess risk in your specific environment, and check if CleanStart has provided advisory or estimated timeline.

Immediate Mitigations: Restrict network access to vulnerable pods using NetworkPolicy, add WAF/IDS rules to block known exploit patterns, use admission controller to block deployments, and escalate alert level.

Custom Workarounds (if patch delayed > 24 hours): Use cleanimg-customize v0.3.0 to build patched image locally or rebuild from alternative base if necessary.

Communication: Provide hourly updates to stakeholders, update public status page, and clearly communicate "Vulnerable but mitigated" status.

Resolution: Deploy patched image via canary rollout, monitor for regressions from patch, and document patch timeline and workarounds.

Playbook: Vendor Discontinuation

Detection: CleanStart announces shutdown or registry domain becomes inaccessible.

Immediate Actions (first 24 hours): Pull complete copies of all CleanStart images to air-gapped storage, export all SBOMs and signatures, assess which applications depend on which images, and create detailed inventory.

Medium-Term Strategy (weeks 1–4): Evaluate alternatives (Alpine Linux, Ubuntu, UBI, or other vendors). Test compatibility by rebuilding against alternative bases and running full test suite. Plan gradual migration starting with non-critical services, performing canary rollout of each service, and monitoring for weeks.

Long-Term (months 2–6): Complete migration to new provider, update CI/CD pipelines and documentation, archive CleanStart images in air-gapped backup, and establish relationship with new vendor.

Communication: Provide transparent timeline to all teams, send regular status updates, and establish clear ownership.

Cost Impact: Expect 4–8 weeks of engineering effort for complete migration.

Testing Your DR Plan

Quarterly DR Drill Schedule

Conduct formal DR exercises every 3 months. A typical Q1 2025 drill might proceed as follows: Scenario: 6-hour registry outage, Date: March 14, 2025, Time: 15:00–16:00 UTC (1 hour simulation window).

Execution:

Notify teams: "DR drill in progress. This is a test."
Temporarily block access to registry.cleanstart.com via firewall rule
Monitor production: error rates, deployment ability, pod startup
Teams attempt to deploy new application version (should succeed from mirror)
Teams attempt to scale pods (should succeed if pre-pulled)
After 1 hour, restore access

Success Criteria: The drill succeeds if there is zero production downtime, all deployments succeed using the mirror, all pod scaling operations succeed, and internal registry sync lag remains under 30 minutes.

Debrief: After the drill, the team should discuss what worked well, what failed or was slower than expected, and what needs to be improved. The playbook should be updated based on the findings.

DR Readiness Checklist

Before each drill, verify that internal registry mirror is running and healthy, mirror sync job completed in last 2 hours, all critical images are in mirror, image pre-pulling DaemonSet running on all nodes, kubelet image garbage collection configured correctly, air-gapped backup files exist and are < 7 days old, backup integrity verified (checksums validated), monitoring alerts configured and tested, all team members trained on playbooks, and escalation contacts updated and reachable.

DR Monitoring Dashboard

Key metrics should be displayed on a wall dashboard in your network operations center (NOC) to provide visibility into mirror health and disaster recovery status.

Mirror Sync Status shows critical synchronization metrics including the timestamp of the last successful sync, current sync lag measured in hours or minutes, the sync success rate over the past 7 days, and the count of failed syncs in the last 24 hours.

Mirror Storage displays capacity metrics including total storage used and available in gigabytes, the percentage utilization, and the count of images currently stored.

Image Availability provides inventory metrics showing total images in the mirror, count of missing images not yet synced, the age of the least recently synced image, and the time when the most recent image was synced.

Production Impact metrics indicate the percentage of pods currently using the internal registry, the count of pods in pending state waiting for images, the count of nodes experiencing image pull errors, and the p99 latency for image pulls.

Air-Gapped Backup status shows the timestamp of the last backup, backup file size in gigabytes, backup integrity status (PASS or FAIL), and encryption status.

Prometheus Queries for DR Monitoring

# Mirror sync lag (hours)(time() - mirror_last_sync_timestamp) / 3600 # Mirror storage utilization(mirror_storage_used_bytes / mirror_storage_total_bytes) * 100 # Image pull success raterate(image_pulls_total{result="success"}[1h]) / rate(image_pulls_total[1h]) # Pods in pending statecount(kube_pod_status_phase{phase="Pending"}) # Image pull latency (p99)histogram_quantile(0.99, rate(image_pull_duration_seconds_bucket[5m]))

What to Read Next

To learn about safely rolling out CleanStart image upgrades using canary deployments, read 06-operate/kubernetes-helm/canary-progressive-delivery.md. For setting up alerts for mirror outages and sync failures, see 06-operate/observability/alerting-playbooks.md. To understand how to verify signatures and SBOMs when syncing images to your mirror, consult 08-reference/deployment/container-image-security.md.