Moving AI workloads from development to production introduces challenges that standard web applications never face: GPU scheduling, model file management, inference latency requirements, and the security implications of running high-value model assets in shared infrastructure. This guide covers GPU node setup, inference server deployment, model distribution, scaling strategies, and production security hardening.
GPU Node Configuration
Kubernetes doesn't natively understand GPUs. You must explicitly install the NVIDIA device plugin and configure nodes to advertise GPU resources.
Installing the NVIDIA Device Plugin
The NVIDIA device plugin should be deployed as a DaemonSet to ensure every GPU node in your cluster automatically exposes its GPUs to the Kubernetes scheduler. This allows pods to request GPU resources just like they request CPU and memory.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.ymlTo verify that GPU discovery is working correctly, describe any node that has GPU hardware. You should see nvidia.com/gpu in the allocatable resources. The expected output shows the number of GPUs available on each node, such as nvidia.com/gpu: 2 on GPU-enabled nodes.
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"# Output: nvidia.com/gpu: 2 (on GPU nodes)GPU Resource Requests in Pod Specs
When defining your pod specifications, explicitly request GPUs in the resources section. Specify both the requests and limits, and importantly, set them equal to each other. Shared GPU access is unreliable for ML workloads since inference jobs are sensitive to resource contention and performance variations.
apiVersion: v1kind: Podmetadata: name: inference-serverspec: containers: - name: server image: registry.cleanstart.com/inference-ollama:latest resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1Node Affinity and GPU Node Taints
To prevent non-GPU workloads from consuming GPU node resources, you should taint GPU nodes so they only accept workloads that specifically request GPU access. This ensures your expensive GPU nodes are reserved for inference and training jobs rather than being occupied by general-purpose applications.
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoScheduleAny pod that requires GPUs must include matching tolerations to be scheduled on these tainted nodes.
tolerations:- key: nvidia.com/gpu operator: Equal value: present effect: NoScheduleIn addition to taints, use node affinity to guarantee pod placement on GPU nodes. This ensures that your inference workloads run on the correct hardware.
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: accelerator operator: In values: - nvidia-gpuMulti-GPU Configurations
For models that require multiple GPUs to fit in memory or benefit from distributed tensor parallelism, request all necessary GPUs in your pod specification. When making multi-GPU requests, set CUDA_VISIBLE_DEVICES to all GPUs on the node, not just the first N.
resources: requests: nvidia.com/gpu: 4 limits: nvidia.com/gpu: 4env:- name: CUDA_VISIBLE_DEVICES value: "0,1,2,3"Deploying Inference Servers with Helm
CleanStart maintains Helm charts for production inference servers. Use the universal values.yaml pattern across all charts to ensure consistent deployments and easy version management.
Ollama Deployment
Deploy Ollama, a local LLM server, using the CleanStart chart. First add the Helm repository, then install the chart with your custom values.
helm repo add cleanstart https://helm.cleanstart.devhelm repo updatehelm install ollama cleanstart/inference-ollama -f ollama-values.yamlHere's a sample ollama-values.yaml configuration that sets up a production-ready Ollama deployment with two replicas, each requesting one GPU, persistent storage for models, and proper health checks.
replicaCount: 2 image: repository: registry.cleanstart.com/inference-ollama tag: latest pullPolicy: IfNotPresent resources: requests: nvidia.com/gpu: 1 memory: 8Gi cpu: 4 limits: nvidia.com/gpu: 1 memory: 8Gi cpu: 4 service: type: ClusterIP port: 11434 modelStorage: enabled: true size: 100Gi storageClass: fast-ssd initContainers:- name: load-model image: registry.cleanstart.com/inference-ollama:latest command: - sh - -c - | ollama pull mistral:latest volumeMounts: - name: models mountPath: /root/.ollama livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 30 periodSeconds: 5vLLM Deployment with Tensor Parallelism
vLLM provides high-throughput inference with support for tensor parallelism across multiple GPUs. Deploy vLLM for large language models that benefit from distributed serving.
helm install vllm cleanstart/inference-vllm -f vllm-values.yamlHere's the configuration with tensor parallelism enabled for serving very large models like a 70-billion parameter LLM:
replicaCount: 1 image: repository: registry.cleanstart.com/inference-vllm tag: latest resources: requests: nvidia.com/gpu: 4 memory: 32Gi cpu: 16 limits: nvidia.com/gpu: 4 memory: 32Gi cpu: 16 env:- name: TENSOR_PARALLEL_SIZE value: "4"- name: MODEL_NAME value: "meta-llama/Llama-2-70b-hf"- name: MAX_MODEL_LEN value: "8192" modelStorage: enabled: true size: 200Gi service: port: 8000 targetPort: 8000 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 60 periodSeconds: 5Tensor parallelism distributes the model across multiple GPUs, which is essential for 70B+ parameter models that don't fit on a single accelerator.
Health Checks and Readiness Probes
Inference servers have slow startups because models load on first request. Set generous initial delays to avoid premature restarts and incorrect readiness signals.
livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 30 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 120 periodSeconds: 5 failureThreshold: 2Liveness probes restart containers that have truly crashed. Readiness probes remove unhealthy pods from the load balancer without killing them, allowing recovery time.
Horizontal Pod Autoscaler (HPA) Based on Inference Latency
Don't scale based on CPU, which doesn't reflect inference queue depth. Instead, scale based on request latency percentiles, which directly measure user experience.
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: vllm-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: vllm minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metricName: inference_request_latency_p99 targetAverageValue: 500mCustom metrics like inference latency require Prometheus. Export latency from your inference server using Prometheus client libraries.
from prometheus_client import Histogram latency_histogram = Histogram( 'inference_request_latency_seconds', 'Inference request latency', buckets=(0.1, 0.5, 1.0, 2.0, 5.0)) @app.post("/v1/completions")async def inference(request: CompletionRequest): with latency_histogram.time(): result = await model.generate(request) return resultModel Storage and Distribution
Model files are massive (7B-70B+ GB). Handle them deliberately rather than building them into container images.
PersistentVolumeClaims for Model Storage
Create a shared volume for models that all inference containers can access. Using ReadOnlyMany access mode allows multiple pods to read the same models simultaneously, ensuring consistency and preventing accidental mutations.
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: ai-modelsspec: accessModes: - ReadOnlyMany storageClassName: fast-ssd resources: requests: storage: 500GiMount the volume as read-only in inference containers to prevent any modification of model files.
volumes:- name: models persistentVolumeClaim: claimName: ai-modelsvolumeMounts:- name: models mountPath: /models readOnly: trueInit Containers for Model Downloading
Download models before the inference server starts by using init containers. This ensures models are available before serving traffic and prevents cache misses on container startup.
initContainers:- name: download-models image: registry.cleanstart.com/inference-ollama:latest command: - sh - -c - | if [ ! -f /models/mistral-7b-instruct.bin ]; then echo "Downloading mistral model..." ollama pull mistral:latest cp -r /root/.ollama/models /models/ fi echo "Models ready" volumeMounts: - name: models mountPath: /models - name: cache mountPath: /root/.ollamaresources: requests: memory: 4Gi cpu: 4Downloads happen once. Subsequent pod starts skip the init container if models already exist, reducing startup time.
Model Caching Strategies Across Pods
Use a shared cache layer like Redis to cache model artifacts and metadata across pods. This reduces repeated model loading and improves request latency.
import redisimport json cache = redis.Redis(host='redis-cache', port=6379) def get_model(model_name: str): cached = cache.get(f"model:{model_name}") if cached: return json.loads(cached) model = load_from_disk(model_name) cache.set(f"model:{model_name}", json.dumps(model), ex=3600) return modelFor HuggingFace models, use the transformers library's built-in cache mechanism by setting appropriate environment variables.
env:- name: HF_HOME value: /models/huggingface-cache- name: TRANSFORMERS_CACHE value: /models/huggingface-cacheRead-Only Volume Mounts for Model Integrity
Prevent containers from modifying model files by mounting volumes as read-only. This enforces immutability and ensures that if models are corrupted or poisoned, all inference pods immediately detect the problem rather than silently running with corrupted weights.
volumeMounts:- name: models mountPath: /models readOnly: trueScaling AI Workloads
Production AI systems must handle variable demand without accumulating idle resources.
Inference Scaling: Request-Based HPA
Scale based on request queue depth or custom metrics. This is more accurate than CPU-based scaling for inference workloads where queuing is the primary bottleneck.
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: inference-scalerspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-server minReplicas: 1 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metricName: http_requests_in_queue targetAverageValue: "10"Queue depth is the most accurate scaling signal for inference. HPA increases pods when requests back up, ensuring requests don't timeout waiting for processing.
Training Scaling: Distributed Training with Multiple GPU Pods
For training jobs, use Kubernetes Jobs with multiple GPU pods. Each pod runs a single training process, and distributed training frameworks coordinate across pods using environment variables.
apiVersion: batch/v1kind: Jobmetadata: name: distributed-trainingspec: parallelism: 4 completions: 4 template: spec: containers: - name: trainer image: registry.cleanstart.com/training-pytorch:latest env: - name: RANK valueFrom: fieldRef: fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] - name: WORLD_SIZE value: "4" - name: MASTER_ADDR value: "distributed-training-0.distributed-training" resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 restartPolicy: NeverBatch Inference: Kubernetes Jobs
For offline or scheduled inference that doesn't require immediate responses, use CronJobs to schedule batch processing. This is more cost-effective than running a continuously-up inference server for occasional workloads.
apiVersion: batch/v1kind: CronJobmetadata: name: batch-inferencespec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: batch-processor image: registry.cleanstart.com/inference-batch:latest env: - name: INPUT_PATH value: /data/pending-tasks - name: OUTPUT_PATH value: /data/results - name: BATCH_SIZE value: "32" volumeMounts: - name: shared-data mountPath: /data resources: requests: nvidia.com/gpu: 2 volumes: - name: shared-data persistentVolumeClaim: claimName: batch-data restartPolicy: OnFailureCronJobs run on a schedule. Jobs retry on failure. Setting restartPolicy: OnFailure retries transient failures without re-running successful inferences.
Cost Optimization: Spot/Preemptible GPU Nodes with Graceful Shutdown
Preemptible GPUs cost 70% less but can be terminated on short notice. Use them with graceful shutdown handlers to drain workloads before node termination.
gcloud container node-pools create gpu-spot \ --cluster=my-cluster \ --machine-type=n1-standard-32 \ --accelerator=type=nvidia-tesla-a100,count=4 \ --spot \ --preemptibleAdd a preemption handler to drain pods gracefully before node termination. This reduces data loss and allows workloads to complete or migrate to other nodes.
apiVersion: apps/v1kind: DaemonSetmetadata: name: preemption-handlerspec: template: spec: containers: - name: handler image: registry.cleanstart.com/preemption-handler:latest env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName securityContext: privileged: trueThe handler listens for GCP preemption notices and cordons the node, giving workloads 30 seconds to gracefully shut down.
Security Hardening for Production AI
Production AI containers handle sensitive model assets and high-value inference requests. Harden your deployment.
Network Policies Restricting Model Container Egress
Prevent containers from reaching the internet and exfiltrating models. Whitelist only necessary network destinations.
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: inference-egress-lockspec: podSelector: matchLabels: app: inference-server policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: TCP port: 53 - protocol: UDP port: 53 - to: - podSelector: matchLabels: app: redis-cache ports: - protocol: TCP port: 6379 - to: - podSelector: matchLabels: app: postgres-metadata ports: - protocol: TCP port: 5432This policy allows DNS for name resolution and connections to internal services only. All other egress is denied.
Pod Security Standards (Restricted Profile)
Enforce the restricted PSS to prevent privilege escalation and other container escape vectors. The restricted profile is the strictest and appropriate for untrusted workloads.
apiVersion: policy.k8s.io/v1kind: PodSecurityPolicymetadata: name: restricted-aispec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' - 'persistentVolumeClaim' runAsUser: rule: 'MustRunAsNonRoot' runAsGroup: rule: 'MustRunAs' ranges: - min: 1000 max: 65535 readOnlyRootFilesystem: false fsGroup: rule: 'MustRunAs' ranges: - min: 1000 max: 65535Containers run as non-root, drop dangerous capabilities, and can't escalate privileges.
Secret Management for API Keys and Model Access Tokens
Store secrets in Kubernetes Secrets or external vault systems. Never commit credentials to configuration files or container images.
kubectl create secret generic model-credentials \ --from-literal=huggingface-token=$HF_TOKEN \ --from-literal=registry-password=$REGISTRY_PASSMount secrets as environment variables rather than files to reduce surface area.
env:- name: HF_TOKEN valueFrom: secretKeyRef: name: model-credentials key: huggingface-token- name: REGISTRY_PASSWORD valueFrom: secretKeyRef: name: model-credentials key: registry-passwordFor multi-environment secrets, use external secrets managers (Sealed Secrets, External Secrets Operator) to encrypt secrets at rest.
Admission Controllers Enforcing CleanStart Images Only
Prevent unauthorized images from running. Use an ImagePolicy webhook to enforce image sources.
apiVersion: admissionregistration.k8s.io/v1kind: ValidatingWebhookConfigurationmetadata: name: image-policywebhooks:- name: image-policy.cleanstart.dev clientConfig: service: name: image-policy-service namespace: kube-system path: "/validate" caBundle: LS0tLS1CRUdJTi... rules: - operations: ["CREATE", "UPDATE"] apiGroups: [""] apiVersions: ["v1"] resources: ["pods"] failurePolicy: Fail namespaceSelector: matchLabels: enforce-cleanstart: "true"The webhook rejects pods with images not from registry.cleanstart.com.
Monitoring AI Containers
Inference workloads have distinct performance characteristics from traditional web services. Monitor what matters.
GPU Utilization and Memory
Export NVIDIA GPU metrics to Prometheus for monitoring and alerting. Use the NVIDIA DCGM exporter to collect detailed metrics.
serviceMonitor: enabled: true interval: 30s additionalPrometheusRules:- alert: HighGPUMemory expr: nvidia_smi_memory_used_percent > 85 for: 5m labels: severity: warning annotations: summary: "GPU memory above 85%"Add the NVIDIA DCGM exporter to your cluster to collect GPU metrics.
helm install dcgm-exporter cleanstart/nvidia-dcgm-exporterInference Latency (p50, p95, p99)
Track request latency percentiles to understand user experience. Export latency histograms with appropriate buckets from your inference service.
from prometheus_client import Histogram inference_latency = Histogram( 'inference_latency_seconds', 'Inference request latency', buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]) @app.post("/v1/completions")async def generate(request: CompletionRequest): with inference_latency.time(): result = await model.generate(request.prompt) return {"text": result}Query percentiles from Prometheus to understand tail latency.
histogram_quantile(0.95, rate(inference_latency_seconds_bucket[5m]))histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))Request Throughput
Count requests per second to track system capacity utilization.
request_counter = Counter( 'inference_requests_total', 'Total inference requests', ['model', 'status']) @app.post("/v1/completions")async def generate(request: CompletionRequest): try: result = await model.generate(request.prompt) request_counter.labels(model=request.model, status='success').inc() return {"text": result} except Exception as e: request_counter.labels(model=request.model, status='error').inc() raiseModel Load Time
Track how long models take to initialize on container startup. This metric helps you understand cold start latency and optimize model loading.
model_load_time = Gauge( 'model_load_time_seconds', 'Time to load model on startup', ['model_name']) def load_model(name: str): start = time.time() model = Model.from_pretrained(name) elapsed = time.time() - start model_load_time.labels(model_name=name).set(elapsed) return modelError Rates and Types
Distinguish between different types of failures. OOM errors indicate resource constraints, timeouts indicate capacity issues, and other errors indicate bugs.
from prometheus_client import Counter error_counter = Counter( 'inference_errors_total', 'Total inference errors', ['error_type']) @app.post("/v1/completions")async def generate(request: CompletionRequest): try: result = await model.generate(request.prompt) return {"text": result} except OutOfMemoryError: error_counter.labels(error_type='oom').inc() return {"error": "Model memory exceeded"}, 507 except TimeoutError: error_counter.labels(error_type='timeout').inc() return {"error": "Inference timeout"}, 504 except Exception as e: error_counter.labels(error_type='unknown').inc() raiseExample Prometheus Alerts
Configure alerts for common issues in AI workloads.
groups:- name: ai-containers interval: 30s rules: - alert: HighGPUMemory expr: nvidia_smi_memory_used_percent > 85 for: 5m annotations: summary: "GPU {{ $labels.gpu }} memory above 85%" - alert: HighInferenceLatency expr: histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 5 for: 5m annotations: summary: "Inference p99 latency above 5s" - alert: InferenceErrorRate expr: rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.05 for: 5m annotations: summary: "Inference error rate above 5%" - alert: PodNotReady expr: kube_pod_status_ready{pod=~"inference.*"} == 0 for: 5m annotations: summary: "Inference pod not ready for 5 minutes"Grafana Dashboard Configuration
Export metrics to Grafana for visual monitoring and alerting.
{ "dashboard": { "title": "AI Container Production Metrics", "panels": [ { "title": "GPU Memory Usage", "targets": [ {"expr": "nvidia_smi_memory_used_percent"} ] }, { "title": "Inference Latency p99", "targets": [ {"expr": "histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))"} ] }, { "title": "Requests Per Second", "targets": [ {"expr": "rate(inference_requests_total[1m])"} ] }, { "title": "Error Rate", "targets": [ {"expr": "rate(inference_errors_total[5m]) / rate(inference_requests_total[5m])"} ] }, { "title": "Active Pod Count", "targets": [ {"expr": "count(kube_pod_status_ready{pod=~\"inference.*\", condition=\"true\"})"} ] } ] }}Bitnami Chart Migration for AI Stacks
If you're running support services like Redis, PostgreSQL, and NGINX on Bitnami charts, migrate to CleanStart using the universal values pattern for consistency and security hardening.
Complete AI Stack Deployment
Deploy an inference server with caching and metadata storage in a coordinated stack.
# values-ai-stack.yaml # Inference Server (vLLM)inference: enabled: true chart: cleanstart/inference-vllm values: replicaCount: 2 image: repository: registry.cleanstart.com/inference-vllm tag: latest resources: requests: nvidia.com/gpu: 2 memory: 16Gi cpu: 8 limits: nvidia.com/gpu: 2 memory: 16Gi cpu: 8 service: type: ClusterIP port: 8000 # Redis Cache Layercache: enabled: true chart: cleanstart/redis values: architecture: replication auth: enabled: true password: ${REDIS_PASSWORD} replica: replicaCount: 2 persistence: enabled: true size: 50Gi resources: requests: memory: 4Gi cpu: 2 limits: memory: 4Gi cpu: 2 # PostgreSQL for Metadatametadata: enabled: true chart: cleanstart/postgresql values: auth: username: aiapp password: ${PG_PASSWORD} database: inference_metadata primary: persistence: enabled: true size: 100Gi replica: replicaCount: 2 resources: requests: memory: 4Gi cpu: 2 limits: memory: 4Gi cpu: 2 # NGINX Ingress Controlleringress: enabled: true chart: cleanstart/nginx-ingress values: controller: replicaCount: 2 service: type: LoadBalancer resources: requests: memory: 512Mi cpu: 500m limits: memory: 512Mi cpu: 500mDeploy the stack:
helm repo add cleanstart https://helm.cleanstart.devhelm repo update helm install ai-stack cleanstart/ai-stack -f values-ai-stack.yaml \ --set REDIS_PASSWORD=$(openssl rand -base64 32) \ --set PG_PASSWORD=$(openssl rand -base64 32)Service Interconnection
Configure the inference server to use Redis and PostgreSQL by injecting configuration through ConfigMaps and Secrets.
apiVersion: v1kind: ConfigMapmetadata: name: inference-configdata: REDIS_HOST: "redis-cache" REDIS_PORT: "6379" POSTGRES_HOST: "postgresql" POSTGRES_PORT: "5432" POSTGRES_DB: "inference_metadata" POSTGRES_USER: "aiapp"---apiVersion: v1kind: Secretmetadata: name: service-credentialstype: OpaquestringData: REDIS_PASSWORD: ${REDIS_PASSWORD} POSTGRES_PASSWORD: ${PG_PASSWORD}---apiVersion: v1kind: Podmetadata: name: inference-serverspec: containers: - name: server image: registry.cleanstart.com/inference-vllm:latest envFrom: - configMapRef: name: inference-config - secretRef: name: service-credentialsNetwork Policies for Service Mesh
Restrict traffic to necessary services to prevent lateral movement in case of compromise.
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: ai-stack-isolationspec: podSelector: matchLabels: app: inference-server policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: ingress-nginx ports: - protocol: TCP port: 8000 egress: - to: - podSelector: matchLabels: app: redis ports: - protocol: TCP port: 6379 - to: - podSelector: matchLabels: app: postgresql ports: - protocol: TCP port: 5432 - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53Allow Redis to replicate within the Redis cluster:
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: redis-replicationspec: podSelector: matchLabels: app: redis policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: redis ports: - protocol: TCP port: 6379Allow PostgreSQL replication between primary and replicas:
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: postgres-replicationspec: podSelector: matchLabels: app: postgresql policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: postgresql ports: - protocol: TCP port: 5432Related Documentation
ai-ml-container-stack-explained.md — Architecture overview and component interactions. ai-container-attack-surface.md — Security vulnerabilities in AI workloads and mitigation. helm-charts-kubernetes.md — Helm chart patterns and deployment strategies. bitnami-helm-values-reference.md — Migration patterns from Bitnami to CleanStart charts.
