Knowledge Hub

Operating AI Containers at Scale: Day-2 Patterns for ML Infrastructure

Deploying AI containers is the easy part. Operating them — handling GPU failures, updating model versions without downtime, managing CUDA compatibility across clusters, patching framework vulnerabilities in images serving live inference — is where most organizations struggle. This guide covers production-tested patterns for running stateful ML workloads in Kubernetes using hardened images from registry.cleanstart.com.

GPU Lifecycle Management

GPU nodes are compute-expensive and failure-prone. Unlike stateless web services, GPU failures cascade: running inference jobs hang, model serving latency spikes, and recovery requires graceful draining plus node replacement.

Monitor GPU health relentlessly by deploying node-exporter with NVIDIA GPU metrics to understand GPU state:

helm install nvidia-metrics nvidia/nvidia-gpu-prometheus-exporter \  --set nvidia.address=localhost:5555 \  --namespace monitoring

Watch for GPU memory fragmentation (unused but unavailable memory), thermal throttling (GPU clock drops under 100% utilization), power throttling (GPU power-limited due to thermal issues), and uncorrectable ECC errors (single error indicates pending failure).

Prometheus queries to alert on critical GPU issues:

# GPU memory pressurenvidia_smi_used_memory_mb / nvidia_smi_total_memory_mb > 0.95 # Thermal throttling activenvidia_gpu_power_throttle_reason != 0 # ECC errors accumulatingrate(nvidia_gpu_ecc_errors_uncorrected[5m]) > 0

Handle GPU node failures without data loss by draining the node gracefully, which respects PodDisruptionBudgets:

# Drain the node gracefully (respects PodDisruptionBudgets)kubectl drain <node-name> \  --ignore-daemonsets \  --delete-emptydir-data \  --pod-selector='!inference'

For inference workloads, use readiness probes to stop sending traffic before eviction:

spec:  containers:  - name: model-server    readinessProbe:      httpGet:        path: /ready        port: 8000      initialDelaySeconds: 30      periodSeconds: 5  affinity:    podAntiAffinity:      preferredDuringSchedulingIgnoredDuringExecution:      - weight: 100        podAffinityTerm:          labelSelector:            matchExpressions:            - key: workload-type              operator: In              values: ["inference"]          topologyKey: kubernetes.io/hostname

CUDA and driver version alignment is critical since GPU containers fail silently when CUDA versions mismatch. Every node must report its driver version and available CUDA capability:

# On each GPU nodenvidia-smi --query-gpu=driver_version,compute_cap --format=csv > /etc/kubernetes/gpu-info.txt # Add to node labelskubectl label node <node-name> \  nvidia.com/cuda-driver=550.107.02 \  nvidia.com/compute-cap=8.0

Pin image CUDA versions to node capabilities:

nodeSelector:  nvidia.com/compute-cap: "8.0"  # Ampere or newerspec:  containers:  - image: registry.cleanstart.com/pytorch:2.2-cuda12.1-cudnn8.9

During cluster upgrades that change driver versions, test the new driver + updated image combination on a canary node first:

# Update GPU driver on single node# (orchestrated by node-driver-installer DaemonSet)kubectl label node canary-gpu nvidia.com/driver-update=true # Watch for successful CUDA handshakekubectl logs -l app=node-driver-installer -f # Only after success: roll out to production nodeskubectl label nodes -l pool=gpu-prod nvidia.com/driver-update=true --overwrite

Model Version Management

Models are the application state. A single inference service might serve v1.2.0, v1.3.0-canary, and v1.2.1-hotfix simultaneously. Kubernetes doesn't understand model semantics, so you must orchestrate updates explicitly.

Blue-green model deployments keep two parallel inference services (blue and green), each tied to a model version. Switch traffic by updating the load balancer:

# Deploy v2.0 model to green service (scale: 0 initially)helm install model-inference-green ./charts/inference-api \  --set model.version=v2.0 \  --set replicas=0 \  --namespace production # Once green is ready, scale upkubectl scale deployment model-inference-green --replicas=3 # Send 10% of traffic to green for smoke testingkubectl patch virtualservice model-inference \  --type merge \  -p '{"spec":{"hosts":[{"name":"model.internal","http":[    {"route":[{"destination":{"host":"model-blue","port":{"number":8000}},"weight":90},             {"destination":{"host":"model-green","port":{"number":8000}},"weight":10}]}  ]}]}}' # If metrics look good, shift 100% to greenkubectl patch virtualservice model-inference \  --type merge \  -p '{"spec":{"hosts":[{"name":"model.internal","http":[    {"route":[{"destination":{"host":"model-green","port":{"number":8000}},"weight":100}]}  ]}]}}' # Scale down bluekubectl scale deployment model-inference-blue --replicas=0

Canary model deployments provide gradual rollout for incremental updates. Not all traffic switchovers need 10% jumps. Use Flagger to gradually shift traffic based on error rate and latency metrics:

helm repo add flagger https://flagger.apphelm install flagger flagger/flagger --namespace istio-system # Define canary rules: shift 10% every 5 minutes if no errorsapiVersion: flagger.app/v1beta1kind: Canarymetadata:  name: model-inferencespec:  targetRef:    apiVersion: apps/v1    kind: Deployment    name: model-inference  service:    port: 8000  analysis:    interval: 5m    threshold: 5    maxWeight: 50    stepWeight: 10    metrics:    - name: error-rate      thresholdRange:        max: 1      interval: 1m    - name: latency-ms      thresholdRange:        max: 500      interval: 1m

Instant rollback to known-good models requires preparing before deployment. Before deploying a new model version, tag the current model artifact with its image digest and inference metric baseline:

# Record baseline metrics from current v1.2.0 modelkubectl get deployment model-inference \  -o jsonpath='{.spec.template.spec.containers[0].image}' > /backups/model-v1.2.0-image.txt # Store inference latency p99 from Prometheuscurl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,model_latency_seconds)' \  | jq '.data.result[].value[1]' > /backups/model-v1.2.0-latency-p99.txt # On rollback: redeploy from stored imagekubectl set image deployment/model-inference \  model=$(cat /backups/model-v1.2.0-image.txt) \  --record # Verify latency returns to baseline within 2 minutes

Image Update Workflows

CleanStart publishes patched images (e.g., pytorch:2.2-cuda12.1-cudnn8.9-20250301 fixing CUDA vulnerability CVE-2024-12345). Rolling updates across a production cluster requires staging, validation, and coordinated rollout.

Staging image updates in dev environment means testing patched images before production rollout:

# 1. Add CleanStart Helm repohelm repo add cleanstart https://helm.cleanstart.devhelm repo update # 2. Deploy staging release with new image taghelm install inference-staging ./charts/inference-api \  --set image.repository=registry.cleanstart.com/pytorch \  --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \  --set environment=staging \  --namespace staging # 3. Run smoke tests against staging# - Model loading should complete in <10s# - First inference batch should succeed# - GPU memory should stabilize within 30scurl -X POST http://inference-staging:8000/healthcurl -X POST http://inference-staging:8000/inference \  -H "Content-Type: application/json" \  -d '{"input": "test data"}' # 4. Monitor staging resource usage (compare with current prod)kubectl top pods -n staging -l app=inference-api

Progressive rollout to production uses Helm's built-in upgrade hooks and StatefulSet ordering:

# Create new release candidate (v2.1)helm upgrade inference ./charts/inference-api \  --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \  --set updateStrategy=RollingUpdate \  --set maxUnavailable=1 \  --set maxSurge=1 \  --namespace production \  --dry-run --debug  # Always dry-run first # If dry-run output looks correct:helm upgrade inference ./charts/inference-api \  --set image.tag=2.2-cuda12.1-cudnn8.9-20250301 \  --set updateStrategy=RollingUpdate \  --set maxUnavailable=1 \  --set maxSurge=1 \  --namespace production # Monitor rollout in real-timekubectl rollout status deployment/inference --namespace production --watch

Automated image scanning before deployment catches vulnerabilities before they reach production:

#!/bin/bash# scan-and-deploy.sh IMAGE_TAG=$1NEW_IMAGE="registry.cleanstart.com/pytorch:${IMAGE_TAG}" # Scan image for vulnerabilitiestrivy image "$NEW_IMAGE" --severity HIGH,CRITICAL if [ $? -ne 0 ]; then  echo "Image scan failed. Aborting deployment."  exit 1fi # Scan passed: proceed with upgradehelm upgrade inference ./charts/inference-api \  --set image.tag="${IMAGE_TAG}" \  --namespace production

Run this before any production deployment:

./scan-and-deploy.sh 2.2-cuda12.1-cudnn8.9-20250301

Rollback on image incompatibility happens immediately when patched image introduces latency regression or crashes:

# Immediate rollback to previous releasehelm rollback inference 5 --namespace production # Kubernetes will revert to previous image immediatelykubectl rollout status deployment/inference --namespace production --watch

Scaling Economics

GPU time is expensive ($0.35–1.20 per GPU-hour on cloud). Running idle inference servers or maintaining excess GPU memory overhead compounds costs across thousands of deployments.

Right-size GPU memory requests by monitoring actual peak memory usage and requesting only what's needed:

# Check max memory ever used by a modelkubectl exec -it <pod> -- nvidia-smi --query-memory=used --format=csv,noheader | sort -n | tail -1 # Example: if model peaks at 18GB, request 20GB (allow 10% headroom)resources:  requests:    nvidia.com/gpu: "1"    memory: "20Gi"  limits:    memory: "24Gi"  # Allow 20% over request for temp spikes

Oversizing GPU memory request prevents GPU sharing and wastes $5,000+/month per idle GPU.

Time-based scaling for batch workloads works when your inference service handles primarily evening peak loads:

# Install KEDA for time-based scalinghelm repo add kedacore https://kedacore.github.io/chartshelm install keda kedacore/keda --namespace keda # Scale down to 1 replica (warm standby) from 9am–5pmapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata:  name: inference-time-scalingspec:  scaleTargetRef:    name: inference-api  minReplicaCount: 1  maxReplicaCount: 10  triggers:  - type: cron    metadata:      timezone: America/New_York      start: 0 9 * * 1-5  # 9am weekdays      end: 0 18 * * 1-5   # 6pm weekdays      desiredReplicas: "1"  - type: cron    metadata:      timezone: America/New_York      start: 0 18 * * 1-5  # 6pm weekdays      end: 0 9 * * 2-6    # 9am next day      desiredReplicas: "8"

Savings: ~$120k/year on a 10-GPU inference cluster with uneven daily demand.

Multi-Instance GPU (MIG) for small models partitions a single GPU into up to 7 independent instances. For models using <20GB VRAM:

# Enable MIG mode on GPU nodenvidia-smi -i 0 -mig 1 # Create profiles (1g.10gb = 1 instance with 10GB)nvidia-smi -i 0 -mig 1 -cip 1g.10gb # Kubernetes will expose as nvidia.com/gpu-10gb-1# Schedule small models on fractional GPU:resources:  limits:    nvidia.com/gpu-10gb-1: "1"

Trade-off: GPU memory per instance is fixed; cannot accommodate ad-hoc larger models. MIG is cost-effective for fixed-sized model serving.

Spot instances for non-critical inference work when serving inference with 30-second acceptable latency:

# Create spot instance pool with lower costkubectl create node pool spot-gpu \  --machine-type=n1-standard-8 \  --accelerator=nvidia-tesla-a10 \  --preemptible # Tolerate preemption with priority classapiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata:  name: batch-inferencevalue: 100globalDefault: false---spec:  priorityClassName: batch-inference  tolerations:  - key: preemptible    operator: Equal    value: "true"    effect: NoSchedule  affinity:    nodeAffinity:      preferredDuringSchedulingIgnoredDuringExecution:      - weight: 100        preference:          matchExpressions:          - key: cloud.google.com/gke-preemptible            operator: In            values: ["true"]

Savings: up to 80% on compute cost with graceful restart handling.

Incident Response for AI Containers

AI container incidents differ from traditional application incidents. GPU memory dumps leak sensitive model weights. Model inference degradation cascades to dependent services before visibility appears in application logs. Rollback must combine both image and model atomically.

Detect model quality degradation before users notice by deploying an inference validator that checks every Nth prediction:

apiVersion: v1kind: ConfigMapmetadata:  name: inference-validatordata:  validation-rules.json: |    {      "sample-rate": 0.05,      "anomaly-detection": {        "output-shape-mismatch": {"alert": true},        "output-nan-or-inf": {"alert": true},        "output-range-violation": {          "min": -1000,          "max": 1000,          "alert": true        }      },      "latency-thresholds-p99": {        "normal": 250,        "degraded": 500,        "failure": 1000      }    }

The validator sidecar runs alongside inference and triggers alerts on anomalies:

spec:  containers:  - name: model-server    image: registry.cleanstart.com/pytorch:2.2-cuda12.1  - name: inference-validator    image: registry.cleanstart.com/inference-validator:1.0    env:    - name: INFERENCE_ENDPOINT      value: "localhost:8000"    - name: VALIDATION_CONFIG      valueFrom:        configMapKeyRef:          name: inference-validator          key: validation-rules.json    resources:      requests:        cpu: "500m"

Isolate model theft during container escape by limiting data exfiltration if inference container is compromised:

spec:  securityContext:    runAsNonRoot: true    readOnlyRootFilesystem: true  containers:  - name: model-server    securityContext:      allowPrivilegeEscalation: false      capabilities:        drop:        - ALL    volumeMounts:    - name: model      mountPath: /model      readOnly: true    - name: tmp      mountPath: /tmp  - name: gpu-monitor    image: registry.cleanstart.com/nvidia-smi:latest    securityContext:      privileged: false  volumes:  - name: model    secret:      secretName: model-weights      defaultMode: 0400  - name: tmp    emptyDir: {}  networkPolicy:    policyTypes:    - Egress    egress:    - to:      - namespaceSelector:          matchLabels:            name: production      ports:      - protocol: TCP        port: 8000

If model weights are extracted despite protections, rotate immediately:

# Generate new model artifactpython retrain.py --output-path=/tmp/model-v1.2.1-hotfix # Patch secretkubectl create secret generic model-weights \  --from-file=/tmp/model-v1.2.1-hotfix \  --dry-run=client -o yaml | kubectl apply -f - # Force pod restart to load new secretkubectl rollout restart deployment/inference-api

GPU memory dump containment prevents sensitive data leakage on GPU failure. Prevent dumps from being exfiltrated:

# Disable GPU memory dumpsnvidia-smi -pm 0  # Disable persistent modenvidia-smi -lgc 0 # Disable graphics clock persistence # Mount /dev/nvidia* as read-only where possiblespec:  containers:  - name: model-server    volumeDevices:    - devicePath: /dev/nvidia0      name: gpu  volumes:  - name: gpu    hostPath:      path: /dev/nvidia0      type: CharDevice

Rapid rollback bundle ensures you can recover quickly. Before deploying a new image or model version, create a tagged rollback point:

#!/bin/bash# create-rollback-point.sh RELEASE_NAME=$1NAMESPACE=$2 # Capture current stateCURRENT_IMAGE=$(kubectl get deployment ${RELEASE_NAME} \  -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}')CURRENT_MODEL=$(kubectl get configmap ${RELEASE_NAME}-model \  -n ${NAMESPACE} -o jsonpath='{.data.model-version}')CURRENT_HELM_RELEASE=$(helm list -n ${NAMESPACE} | grep ${RELEASE_NAME} | awk '{print $9}') # Create rollback snapshotcat > /backups/${RELEASE_NAME}-rollback-$(date +%s).json <<EOF{  "image": "${CURRENT_IMAGE}",  "model": "${CURRENT_MODEL}",  "helm-release": "${CURRENT_HELM_RELEASE}",  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"}EOF echo "Rollback point created: /backups/${RELEASE_NAME}-rollback-*.json"

On incident:

# Restore from rollback pointROLLBACK_FILE="/backups/inference-api-rollback-1719432567.json"IMAGE=$(jq -r '.image' ${ROLLBACK_FILE})MODEL=$(jq -r '.model' ${ROLLBACK_FILE})HELM_RELEASE=$(jq -r '.helm-release' ${ROLLBACK_FILE}) helm rollback inference-api ${HELM_RELEASE} --namespace productionkubectl set image deployment/inference-api \  model-server=${IMAGE} \  --record \  --namespace production # Verify inference returns to normalkubectl rollout status deployment/inference-api --namespace production --watch

Bitnami Chart Operations for AI Supporting Services

ML inference workloads depend on Redis (caching, session state), PostgreSQL (model metadata, request logs), and NGINX (load balancing, request routing). Coordinated updates of the full stack prevent cascading failures.

Upgrade Redis with zero cache loss by using Bitnami's Redis Helm chart which supports seamless upgrades using replication:

helm repo add bitnami https://charts.bitnami.com/bitnamihelm install cache-layer bitnami/redis \  --set architecture=replication \  --set replica.replicaCount=2 \  --set replica.persistence.enabled=true \  --namespace production # Upgrade Redis (replicas stay online, primary switches)helm upgrade cache-layer bitnami/redis \  --set image.tag=7.2-debian-12 \  --set persistence.enabled=true \  --namespace production # Monitor upgradekubectl rollout status statefulset/cache-layer-redis-replicas --namespace production --watch

Cache remains available during entire upgrade (replica serves reads while primary updates).

PostgreSQL model metadata with minimal downtime stores model versions, training runs, and inference metrics:

helm install model-db bitnami/postgresql \  --set primary.persistence.enabled=true \  --set primary.persistence.size=50Gi \  --set readReplicas.replicaCount=1 \  --set readReplicas.persistence.enabled=true \  --namespace production # Schema for model metadatakubectl exec -it model-db-postgresql-0 -- psql -U postgres <<EOFCREATE TABLE models (  id SERIAL PRIMARY KEY,  name VARCHAR(255) NOT NULL,  version VARCHAR(50) NOT NULL,  image_uri VARCHAR(255),  cuda_version VARCHAR(20),  framework VARCHAR(50),  gpu_memory_mb INT,  created_at TIMESTAMP DEFAULT NOW(),  metadata JSONB); CREATE TABLE inference_runs (  id SERIAL PRIMARY KEY,  model_id INT REFERENCES models(id),  latency_ms INT,  gpu_memory_used_mb INT,  output_hash VARCHAR(64),  status VARCHAR(50),  timestamp TIMESTAMP DEFAULT NOW());EOF

Inference pods query postgres.production:5432 to validate model version before loading from registry:

import psycopg2conn = psycopg2.connect("host=model-db-postgresql port=5432 user=postgres")cursor = conn.cursor()cursor.execute("SELECT image_uri FROM models WHERE version = %s", (model_version,))image_uri = cursor.fetchone()[0]

NGINX reverse proxy with rate limiting for inference API protects model-serving endpoints from DDoS and uneven load distribution:

helm install api-gateway bitnami/nginx \  --set service.type=LoadBalancer \  --set ingress.enabled=true \  --set ingress.hostname=inference.company.com \  --namespace production # ConfigMap for rate limiting ruleskubectl create configmap nginx-config \  --from-literal=limit-rate.conf="limit_req_zone \$binary_remote_addr zone=inference:10m rate=10r/s; \                                   limit_req zone=inference burst=20 nodelay;" \  --namespace production

Update NGINX Helm release to mount rate-limiting config:

helm upgrade api-gateway bitnami/nginx \  --set server.serverBlock='upstream inference { server inference-api:8000; }server {  listen 80;  server_name _;  location /inference {    limit_req zone=inference burst=20 nodelay;    proxy_pass http://inference;    proxy_connect_timeout 10s;    proxy_read_timeout 30s;  }}' \  --namespace production

Coordinated stack upgrades update Redis, PostgreSQL, and NGINX in sequence, verifying inference remains online:

#!/bin/bash# upgrade-stack.sh echo "=== Upgrading AI supporting services stack ===" # 1. Upgrade Redis (replication handles failover)echo "Upgrading Redis..."helm upgrade cache-layer bitnami/redis \  --set image.tag=7.2-debian-12 \  --namespace productionkubectl rollout status statefulset/cache-layer-redis-replicas --namespace production # 2. Upgrade PostgreSQL (read replica stays online)echo "Upgrading PostgreSQL..."helm upgrade model-db bitnami/postgresql \  --set image.tag=15.3-debian-12 \  --namespace productionkubectl rollout status statefulset/model-db-postgresql-read --namespace production # 3. Upgrade NGINX (rolling update keeps gateway online)echo "Upgrading NGINX..."helm upgrade api-gateway bitnami/nginx \  --set image.tag=1.25-debian-12 \  --namespace productionkubectl rollout status deployment/api-gateway --namespace production # 4. Smoke test full stackecho "Running smoke tests..."curl -s http://inference.company.com/health | jq .curl -s http://inference.company.com/inference \  -X POST \  -H "Content-Type: application/json" \  -d '{"input": "test"}' | jq . echo "=== Stack upgrade complete ==="

Run this script during maintenance windows:

./upgrade-stack.sh 2>&1 | tee upgrade-$(date +%s).log

deploying-ai-containers-production.md — Initial deployment, cluster provisioning, storage setup. ai-container-attack-surface.md — GPU security, model extraction prevention, supply chain risks. upgrade-patching-playbook.md — Framework for scheduled patching, vulnerability response timelines. rollback-disaster-recovery.md — Disaster recovery procedures, backup strategies, RTO/RPO targets.

Operating AI Containers at Scale: Day-2 Patterns for ML Infrastructure

GPU Lifecycle Management

Model Version Management

Image Update Workflows

Scaling Economics

Incident Response for AI Containers

Bitnami Chart Operations for AI Supporting Services

Related Documentation

Operating AI Containers at Scale: Day-2 Patterns for ML Infrastructure

GPU Lifecycle Management

Model Version Management

Image Update Workflows

Scaling Economics

Incident Response for AI Containers

Bitnami Chart Operations for AI Supporting Services

Related Documentation