Deploying AI containers is the easy part. Operating them — handling GPU failures, updating model versions without downtime, managing CUDA compatibility across clusters, patching framework vulnerabilities in images serving live inference — is where most organizations struggle. This guide covers production-tested patterns for running stateful ML workloads in Kubernetes using hardened images from registry.cleanstart.com.
GPU Lifecycle Management
GPU nodes are compute-expensive and failure-prone. Unlike stateless web services, GPU failures cascade: running inference jobs hang, model serving latency spikes, and recovery requires graceful draining plus node replacement.
Monitor GPU health relentlessly by deploying node-exporter with NVIDIA GPU metrics to understand GPU state:
helm install nvidia-metrics nvidia/nvidia-gpu-prometheus-exporter \ --set nvidia.address=localhost:5555 \ --namespace monitoringWatch for GPU memory fragmentation (unused but unavailable memory), thermal throttling (GPU clock drops under 100% utilization), power throttling (GPU power-limited due to thermal issues), and uncorrectable ECC errors (single error indicates pending failure).
Prometheus queries to alert on critical GPU issues:
# GPU memory pressurenvidia_smi_used_memory_mb / nvidia_smi_total_memory_mb > 0.95 # Thermal throttling activenvidia_gpu_power_throttle_reason != 0 # ECC errors accumulatingrate(nvidia_gpu_ecc_errors_uncorrected[5m]) > 0Handle GPU node failures without data loss by draining the node gracefully, which respects PodDisruptionBudgets:
# Drain the node gracefully (respects PodDisruptionBudgets)kubectl drain <node-name> \ --ignore-daemonsets \ --delete-emptydir-data \ --pod-selector='!inference'For inference workloads, use readiness probes to stop sending traffic before eviction:
spec: containers: - name: model-server readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 30 periodSeconds: 5 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: workload-type operator: In values: ["inference"] topologyKey: kubernetes.io/hostnameCUDA and driver version alignment is critical since GPU containers fail silently when CUDA versions mismatch. Every node must report its driver version and available CUDA capability:
# On each GPU nodenvidia-smi --query-gpu=driver_version,compute_cap --format=csv > /etc/kubernetes/gpu-info.txt # Add to node labelskubectl label node <node-name> \ nvidia.com/cuda-driver=550.107.02 \ nvidia.com/compute-cap=8.0Pin image CUDA versions to node capabilities:
nodeSelector: nvidia.com/compute-cap: "8.0" # Ampere or newerspec: containers: - image: registry.cleanstart.com/pytorch:2.2-cuda12.1-cudnn8.9During cluster upgrades that change driver versions, test the new driver + updated image combination on a canary node first:
# Update GPU driver on single node# (orchestrated by node-driver-installer DaemonSet)kubectl label node canary-gpu nvidia.com/driver-update=true # Watch for successful CUDA handshakekubectl logs -l app=node-driver-installer -f # Only after success: roll out to production nodeskubectl label nodes -l pool=gpu-prod nvidia.com/driver-update=true --overwriteModel Version Management
Models are the application state. A single inference service might serve v1.2.0, v1.3.0-canary, and v1.2.1-hotfix simultaneously. Kubernetes doesn't understand model semantics, so you must orchestrate updates explicitly.
Blue-green model deployments keep two parallel inference services (blue and green), each tied to a model version. Switch traffic by updating the load balancer:
# Deploy v2.0 model to green service (scale: 0 initially)helm install model-inference-green ./charts/inference-api \ --set model.version=v2.0 \ --set replicas=0 \ --namespace production # Once green is ready, scale upkubectl scale deployment model-inference-green --replicas=3 # Send 10% of traffic to green for smoke testingkubectl patch virtualservice model-inference \ --type merge \ -p '{"spec":{"hosts":[{"name":"model.internal","http":[ {"route":[{"destination":{"host":"model-blue","port":{"number":8000}},"weight":90}, {"destination":{"host":"model-green","port":{"number":8000}},"weight":10}]} ]}]}}' # If metrics look good, shift 100% to greenkubectl patch virtualservice model-inference \ --type merge \ -p '{"spec":{"hosts":[{"name":"model.internal","http":[ {"route":[{"destination":{"host":"model-green","port":{"number":8000}},"weight":100}]} ]}]}}' # Scale down bluekubectl scale deployment model-inference-blue --replicas=0Canary model deployments provide gradual rollout for incremental updates. Not all traffic switchovers need 10% jumps. Use Flagger to gradually shift traffic based on error rate and latency metrics:
helm repo add flagger https://flagger.apphelm install flagger flagger/flagger --namespace istio-system # Define canary rules: shift 10% every 5 minutes if no errorsapiVersion: flagger.app/v1beta1kind: Canarymetadata: name: model-inferencespec: targetRef: apiVersion: apps/v1 kind: Deployment name: model-inference service: port: 8000 analysis: interval: 5m threshold: 5 maxWeight: 50 stepWeight: 10 metrics: - name: error-rate thresholdRange: max: 1 interval: 1m - name: latency-ms thresholdRange: max: 500 interval: 1mInstant rollback to known-good models requires preparing before deployment. Before deploying a new model version, tag the current model artifact with its image digest and inference metric baseline:
# Record baseline metrics from current v1.2.0 modelkubectl get deployment model-inference \ -o jsonpath='{.spec.template.spec.containers[0].image}' > /backups/model-v1.2.0-image.txt # Store inference latency p99 from Prometheuscurl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,model_latency_seconds)' \ | jq '.data.result[].value[1]' > /backups/model-v1.2.0-latency-p99.txt # On rollback: redeploy from stored imagekubectl set image deployment/model-inference \ model=$(cat /backups/model-v1.2.0-image.txt) \ --record # Verify latency returns to baseline within 2 minutesImage Update Workflows
CleanStart publishes patched images (e.g., pytorch:2.2-cuda12.1-cudnn8.9-20250301 fixing CUDA vulnerability CVE-2024-12345). Rolling updates across a production cluster requires staging, validation, and coordinated rollout.
Staging image updates in dev environment means testing patched images before production rollout:
# 1. Add CleanStart Helm repohelm repo add cleanstart https://helm.cleanstart.devhelm repo update # 2. Deploy staging release with new image taghelm install inference-staging ./charts/inference-api \ --set image.repository=registry.cleanstart.com/pytorch \ --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \ --set environment=staging \ --namespace staging # 3. Run smoke tests against staging# - Model loading should complete in <10s# - First inference batch should succeed# - GPU memory should stabilize within 30scurl -X POST http://inference-staging:8000/healthcurl -X POST http://inference-staging:8000/inference \ -H "Content-Type: application/json" \ -d '{"input": "test data"}' # 4. Monitor staging resource usage (compare with current prod)kubectl top pods -n staging -l app=inference-apiProgressive rollout to production uses Helm's built-in upgrade hooks and StatefulSet ordering:
# Create new release candidate (v2.1)helm upgrade inference ./charts/inference-api \ --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \ --set updateStrategy=RollingUpdate \ --set maxUnavailable=1 \ --set maxSurge=1 \ --namespace production \ --dry-run --debug # Always dry-run first # If dry-run output looks correct:helm upgrade inference ./charts/inference-api \ --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \ --set updateStrategy=RollingUpdate \ --set maxUnavailable=1 \ --set maxSurge=1 \ --namespace production # Monitor rollout in real-timekubectl rollout status deployment/inference --namespace production --watchAutomated image scanning before deployment catches vulnerabilities before they reach production:
#!/bin/bash# scan-and-deploy.sh IMAGE_TAG=$1NEW_IMAGE="registry.cleanstart.com/pytorch:${IMAGE_TAG}" # Scan image for vulnerabilitiestrivy image "$NEW_IMAGE" --severity HIGH,CRITICAL if [ $? -ne 0 ]; then echo "Image scan failed. Aborting deployment." exit 1fi # Scan passed: proceed with upgradehelm upgrade inference ./charts/inference-api \ --set image.tag="${IMAGE_TAG}" \ --namespace productionRun this before any production deployment:
./scan-and-deploy.sh 2.2-cuda12.1-cudnn8.9-20250301Rollback on image incompatibility happens immediately when patched image introduces latency regression or crashes:
# Immediate rollback to previous releasehelm rollback inference 5 --namespace production # Kubernetes will revert to previous image immediatelykubectl rollout status deployment/inference --namespace production --watchScaling Economics
GPU time is expensive ($0.35–1.20 per GPU-hour on cloud). Running idle inference servers or maintaining excess GPU memory overhead compounds costs across thousands of deployments.
Right-size GPU memory requests by monitoring actual peak memory usage and requesting only what's needed:
# Check max memory ever used by a modelkubectl exec -it <pod> -- nvidia-smi --query-memory=used --format=csv,noheader | sort -n | tail -1 # Example: if model peaks at 18GB, request 20GB (allow 10% headroom)resources: requests: nvidia.com/gpu: "1" memory: "20Gi" limits: memory: "24Gi" # Allow 20% over request for temp spikesOversizing GPU memory request prevents GPU sharing and wastes $5,000+/month per idle GPU.
Time-based scaling for batch workloads works when your inference service handles primarily evening peak loads:
# Install KEDA for time-based scalinghelm repo add kedacore https://kedacore.github.io/chartshelm install keda kedacore/keda --namespace keda # Scale down to 1 replica (warm standby) from 9am–5pmapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: inference-time-scalingspec: scaleTargetRef: name: inference-api minReplicaCount: 1 maxReplicaCount: 10 triggers: - type: cron metadata: timezone: America/New_York start: 0 9 * * 1-5 # 9am weekdays end: 0 18 * * 1-5 # 6pm weekdays desiredReplicas: "1" - type: cron metadata: timezone: America/New_York start: 0 18 * * 1-5 # 6pm weekdays end: 0 9 * * 2-6 # 9am next day desiredReplicas: "8"Savings: ~$120k/year on a 10-GPU inference cluster with uneven daily demand.
Multi-Instance GPU (MIG) for small models partitions a single GPU into up to 7 independent instances. For models using <20GB VRAM:
# Enable MIG mode on GPU nodenvidia-smi -i 0 -mig 1 # Create profiles (1g.10gb = 1 instance with 10GB)nvidia-smi -i 0 -mig 1 -cip 1g.10gb # Kubernetes will expose as nvidia.com/gpu-10gb-1# Schedule small models on fractional GPU:resources: limits: nvidia.com/gpu-10gb-1: "1"Trade-off: GPU memory per instance is fixed; cannot accommodate ad-hoc larger models. MIG is cost-effective for fixed-sized model serving.
Spot instances for non-critical inference work when serving inference with 30-second acceptable latency:
# Create spot instance pool with lower costkubectl create node pool spot-gpu \ --machine-type=n1-standard-8 \ --accelerator=nvidia-tesla-a10 \ --preemptible # Tolerate preemption with priority classapiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: batch-inferencevalue: 100globalDefault: false---spec: priorityClassName: batch-inference tolerations: - key: preemptible operator: Equal value: "true" effect: NoSchedule affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: cloud.google.com/gke-preemptible operator: In values: ["true"]Savings: up to 80% on compute cost with graceful restart handling.
Incident Response for AI Containers
AI container incidents differ from traditional application incidents. GPU memory dumps leak sensitive model weights. Model inference degradation cascades to dependent services before visibility appears in application logs. Rollback must combine both image and model atomically.
Detect model quality degradation before users notice by deploying an inference validator that checks every Nth prediction:
apiVersion: v1kind: ConfigMapmetadata: name: inference-validatordata: validation-rules.json: | { "sample-rate": 0.05, "anomaly-detection": { "output-shape-mismatch": {"alert": true}, "output-nan-or-inf": {"alert": true}, "output-range-violation": { "min": -1000, "max": 1000, "alert": true } }, "latency-thresholds-p99": { "normal": 250, "degraded": 500, "failure": 1000 } }The validator sidecar runs alongside inference and triggers alerts on anomalies:
spec: containers: - name: model-server image: registry.cleanstart.com/pytorch:2.2-cuda12.1 - name: inference-validator image: registry.cleanstart.com/inference-validator:1.0 env: - name: INFERENCE_ENDPOINT value: "localhost:8000" - name: VALIDATION_CONFIG valueFrom: configMapKeyRef: name: inference-validator key: validation-rules.json resources: requests: cpu: "500m"Isolate model theft during container escape by limiting data exfiltration if inference container is compromised:
spec: securityContext: runAsNonRoot: true readOnlyRootFilesystem: true containers: - name: model-server securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL volumeMounts: - name: model mountPath: /model readOnly: true - name: tmp mountPath: /tmp - name: gpu-monitor image: registry.cleanstart.com/nvidia-smi:latest securityContext: privileged: false volumes: - name: model secret: secretName: model-weights defaultMode: 0400 - name: tmp emptyDir: {} networkPolicy: policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: name: production ports: - protocol: TCP port: 8000If model weights are extracted despite protections, rotate immediately:
# Generate new model artifactpython retrain.py --output-path=/tmp/model-v1.2.1-hotfix # Patch secretkubectl create secret generic model-weights \ --from-file=/tmp/model-v1.2.1-hotfix \ --dry-run=client -o yaml | kubectl apply -f - # Force pod restart to load new secretkubectl rollout restart deployment/inference-apiGPU memory dump containment prevents sensitive data leakage on GPU failure. Prevent dumps from being exfiltrated:
# Disable GPU memory dumpsnvidia-smi -pm 0 # Disable persistent modenvidia-smi -lgc 0 # Disable graphics clock persistence # Mount /dev/nvidia* as read-only where possiblespec: containers: - name: model-server volumeDevices: - devicePath: /dev/nvidia0 name: gpu volumes: - name: gpu hostPath: path: /dev/nvidia0 type: CharDeviceRapid rollback bundle ensures you can recover quickly. Before deploying a new image or model version, create a tagged rollback point:
#!/bin/bash# create-rollback-point.sh RELEASE_NAME=$1NAMESPACE=$2 # Capture current stateCURRENT_IMAGE=$(kubectl get deployment ${RELEASE_NAME} \ -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}')CURRENT_MODEL=$(kubectl get configmap ${RELEASE_NAME}-model \ -n ${NAMESPACE} -o jsonpath='{.data.model-version}')CURRENT_HELM_RELEASE=$(helm list -n ${NAMESPACE} | grep ${RELEASE_NAME} | awk '{print $9}') # Create rollback snapshotcat > /backups/${RELEASE_NAME}-rollback-$(date +%s).json <<EOF{ "image": "${CURRENT_IMAGE}", "model": "${CURRENT_MODEL}", "helm-release": "${CURRENT_HELM_RELEASE}", "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"}EOF echo "Rollback point created: /backups/${RELEASE_NAME}-rollback-*.json"On incident:
# Restore from rollback pointROLLBACK_FILE="/backups/inference-api-rollback-1719432567.json"IMAGE=$(jq -r '.image' ${ROLLBACK_FILE})MODEL=$(jq -r '.model' ${ROLLBACK_FILE})HELM_RELEASE=$(jq -r '.helm-release' ${ROLLBACK_FILE}) helm rollback inference-api ${HELM_RELEASE} --namespace productionkubectl set image deployment/inference-api \ model-server=${IMAGE} \ --record \ --namespace production # Verify inference returns to normalkubectl rollout status deployment/inference-api --namespace production --watchBitnami Chart Operations for AI Supporting Services
ML inference workloads depend on Redis (caching, session state), PostgreSQL (model metadata, request logs), and NGINX (load balancing, request routing). Coordinated updates of the full stack prevent cascading failures.
Upgrade Redis with zero cache loss by using Bitnami's Redis Helm chart which supports seamless upgrades using replication:
helm repo add bitnami https://charts.bitnami.com/bitnamihelm install cache-layer bitnami/redis \ --set architecture=replication \ --set replica.replicaCount=2 \ --set replica.persistence.enabled=true \ --namespace production # Upgrade Redis (replicas stay online, primary switches)helm upgrade cache-layer bitnami/redis \ --set image.tag=7.2-debian-12 \ --set persistence.enabled=true \ --namespace production # Monitor upgradekubectl rollout status statefulset/cache-layer-redis-replicas --namespace production --watchCache remains available during entire upgrade (replica serves reads while primary updates).
PostgreSQL model metadata with minimal downtime stores model versions, training runs, and inference metrics:
helm install model-db bitnami/postgresql \ --set primary.persistence.enabled=true \ --set primary.persistence.size=50Gi \ --set readReplicas.replicaCount=1 \ --set readReplicas.persistence.enabled=true \ --namespace production # Schema for model metadatakubectl exec -it model-db-postgresql-0 -- psql -U postgres <<EOFCREATE TABLE models ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, version VARCHAR(50) NOT NULL, image_uri VARCHAR(255), cuda_version VARCHAR(20), framework VARCHAR(50), gpu_memory_mb INT, created_at TIMESTAMP DEFAULT NOW(), metadata JSONB); CREATE TABLE inference_runs ( id SERIAL PRIMARY KEY, model_id INT REFERENCES models(id), latency_ms INT, gpu_memory_used_mb INT, output_hash VARCHAR(64), status VARCHAR(50), timestamp TIMESTAMP DEFAULT NOW());EOFInference pods query postgres.production:5432 to validate model version before loading from registry:
import psycopg2conn = psycopg2.connect("host=model-db-postgresql port=5432 user=postgres")cursor = conn.cursor()cursor.execute("SELECT image_uri FROM models WHERE version = %s", (model_version,))image_uri = cursor.fetchone()[0]NGINX reverse proxy with rate limiting for inference API protects model-serving endpoints from DDoS and uneven load distribution:
helm install api-gateway bitnami/nginx \ --set service.type=LoadBalancer \ --set ingress.enabled=true \ --set ingress.hostname=inference.company.com \ --namespace production # ConfigMap for rate limiting ruleskubectl create configmap nginx-config \ --from-literal=limit-rate.conf="limit_req_zone \$binary_remote_addr zone=inference:10m rate=10r/s; \ limit_req zone=inference burst=20 nodelay;" \ --namespace productionUpdate NGINX Helm release to mount rate-limiting config:
helm upgrade api-gateway bitnami/nginx \ --set server.serverBlock='upstream inference { server inference-api:8000; }server { listen 80; server_name _; location /inference { limit_req zone=inference burst=20 nodelay; proxy_pass http://inference; proxy_connect_timeout 10s; proxy_read_timeout 30s; }}' \ --namespace productionCoordinated stack upgrades update Redis, PostgreSQL, and NGINX in sequence, verifying inference remains online:
#!/bin/bash# upgrade-stack.sh echo "=== Upgrading AI supporting services stack ===" # 1. Upgrade Redis (replication handles failover)echo "Upgrading Redis..."helm upgrade cache-layer bitnami/redis \ --set image.tag=7.2-debian-12 \ --namespace productionkubectl rollout status statefulset/cache-layer-redis-replicas --namespace production # 2. Upgrade PostgreSQL (read replica stays online)echo "Upgrading PostgreSQL..."helm upgrade model-db bitnami/postgresql \ --set image.tag=15.3-debian-12 \ --namespace productionkubectl rollout status statefulset/model-db-postgresql-read --namespace production # 3. Upgrade NGINX (rolling update keeps gateway online)echo "Upgrading NGINX..."helm upgrade api-gateway bitnami/nginx \ --set image.tag=1.25-debian-12 \ --namespace productionkubectl rollout status deployment/api-gateway --namespace production # 4. Smoke test full stackecho "Running smoke tests..."curl -s http://inference.company.com/health | jq .curl -s http://inference.company.com/inference \ -X POST \ -H "Content-Type: application/json" \ -d '{"input": "test"}' | jq . echo "=== Stack upgrade complete ==="Run this script during maintenance windows:
./upgrade-stack.sh 2>&1 | tee upgrade-$(date +%s).logRelated Documentation
deploying-ai-containers-production.md — Initial deployment, cluster provisioning, storage setup. ai-container-attack-surface.md — GPU security, model extraction prevention, supply chain risks. upgrade-patching-playbook.md — Framework for scheduled patching, vulnerability response timelines. rollback-disaster-recovery.md — Disaster recovery procedures, backup strategies, RTO/RPO targets.
