Knowledge Hub

Deploying AI Containers to Production: GPU, Scaling, and Security

Moving AI workloads from development to production introduces challenges that standard web applications never face: GPU scheduling, model file management, inference latency requirements, and the security implications of running high-value model assets in shared infrastructure. This guide covers GPU node setup, inference server deployment, model distribution, scaling strategies, and production security hardening.

GPU Node Configuration

Kubernetes doesn't natively understand GPUs. You must explicitly install the NVIDIA device plugin and configure nodes to advertise GPU resources.

Installing the NVIDIA Device Plugin

The NVIDIA device plugin should be deployed as a DaemonSet to ensure every GPU node in your cluster automatically exposes its GPUs to the Kubernetes scheduler. This allows pods to request GPU resources just like they request CPU and memory.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

To verify that GPU discovery is working correctly, describe any node that has GPU hardware. You should see nvidia.com/gpu in the allocatable resources. The expected output shows the number of GPUs available on each node, such as nvidia.com/gpu: 2 on GPU-enabled nodes.

kubectl describe nodes | grep -A 5 "nvidia.com/gpu"# Output: nvidia.com/gpu: 2 (on GPU nodes)

GPU Resource Requests in Pod Specs

When defining your pod specifications, explicitly request GPUs in the resources section. Specify both the requests and limits, and importantly, set them equal to each other. Shared GPU access is unreliable for ML workloads since inference jobs are sensitive to resource contention and performance variations.

apiVersion: v1kind: Podmetadata:  name: inference-serverspec:  containers:  - name: server    image: registry.cleanstart.com/inference-ollama:latest    resources:      requests:        nvidia.com/gpu: 1      limits:        nvidia.com/gpu: 1

Node Affinity and GPU Node Taints

To prevent non-GPU workloads from consuming GPU node resources, you should taint GPU nodes so they only accept workloads that specifically request GPU access. This ensures your expensive GPU nodes are reserved for inference and training jobs rather than being occupied by general-purpose applications.

kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

Any pod that requires GPUs must include matching tolerations to be scheduled on these tainted nodes.

tolerations:- key: nvidia.com/gpu  operator: Equal  value: present  effect: NoSchedule

In addition to taints, use node affinity to guarantee pod placement on GPU nodes. This ensures that your inference workloads run on the correct hardware.

affinity:  nodeAffinity:    requiredDuringSchedulingIgnoredDuringExecution:      nodeSelectorTerms:      - matchExpressions:        - key: accelerator          operator: In          values:          - nvidia-gpu

Multi-GPU Configurations

For models that require multiple GPUs to fit in memory or benefit from distributed tensor parallelism, request all necessary GPUs in your pod specification. When making multi-GPU requests, set CUDA_VISIBLE_DEVICES to all GPUs on the node, not just the first N.

resources:  requests:    nvidia.com/gpu: 4  limits:    nvidia.com/gpu: 4env:- name: CUDA_VISIBLE_DEVICES  value: "0,1,2,3"

Deploying Inference Servers with Helm

CleanStart maintains Helm charts for production inference servers. Use the universal values.yaml pattern across all charts to ensure consistent deployments and easy version management.

Ollama Deployment

Deploy Ollama, a local LLM server, using the CleanStart chart. First add the Helm repository, then install the chart with your custom values.

helm repo add cleanstart https://helm.cleanstart.devhelm repo updatehelm install ollama cleanstart/inference-ollama -f ollama-values.yaml

Here's a sample ollama-values.yaml configuration that sets up a production-ready Ollama deployment with two replicas, each requesting one GPU, persistent storage for models, and proper health checks.

replicaCount: 2 image:  repository: registry.cleanstart.com/inference-ollama  tag: latest  pullPolicy: IfNotPresent resources:  requests:    nvidia.com/gpu: 1    memory: 8Gi    cpu: 4  limits:    nvidia.com/gpu: 1    memory: 8Gi    cpu: 4 service:  type: ClusterIP  port: 11434 modelStorage:  enabled: true  size: 100Gi  storageClass: fast-ssd initContainers:- name: load-model  image: registry.cleanstart.com/inference-ollama:latest  command:  - sh  - -c  - |    ollama pull mistral:latest  volumeMounts:  - name: models    mountPath: /root/.ollama livenessProbe:  httpGet:    path: /api/tags    port: 11434  initialDelaySeconds: 30  periodSeconds: 10 readinessProbe:  httpGet:    path: /api/tags    port: 11434  initialDelaySeconds: 30  periodSeconds: 5

vLLM Deployment with Tensor Parallelism

vLLM provides high-throughput inference with support for tensor parallelism across multiple GPUs. Deploy vLLM for large language models that benefit from distributed serving.

helm install vllm cleanstart/inference-vllm -f vllm-values.yaml

Here's the configuration with tensor parallelism enabled for serving very large models like a 70-billion parameter LLM:

replicaCount: 1 image:  repository: registry.cleanstart.com/inference-vllm  tag: latest resources:  requests:    nvidia.com/gpu: 4    memory: 32Gi    cpu: 16  limits:    nvidia.com/gpu: 4    memory: 32Gi    cpu: 16 env:- name: TENSOR_PARALLEL_SIZE  value: "4"- name: MODEL_NAME  value: "meta-llama/Llama-2-70b-hf"- name: MAX_MODEL_LEN  value: "8192" modelStorage:  enabled: true  size: 200Gi service:  port: 8000  targetPort: 8000 livenessProbe:  httpGet:    path: /health    port: 8000  initialDelaySeconds: 60  periodSeconds: 10 readinessProbe:  httpGet:    path: /ready    port: 8000  initialDelaySeconds: 60  periodSeconds: 5

Tensor parallelism distributes the model across multiple GPUs, which is essential for 70B+ parameter models that don't fit on a single accelerator.

Health Checks and Readiness Probes

Inference servers have slow startups because models load on first request. Set generous initial delays to avoid premature restarts and incorrect readiness signals.

livenessProbe:  httpGet:    path: /health    port: 8000  initialDelaySeconds: 120  periodSeconds: 30  failureThreshold: 3 readinessProbe:  httpGet:    path: /ready    port: 8000  initialDelaySeconds: 120  periodSeconds: 5  failureThreshold: 2

Liveness probes restart containers that have truly crashed. Readiness probes remove unhealthy pods from the load balancer without killing them, allowing recovery time.

Horizontal Pod Autoscaler (HPA) Based on Inference Latency

Don't scale based on CPU, which doesn't reflect inference queue depth. Instead, scale based on request latency percentiles, which directly measure user experience.

apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:  name: vllm-hpaspec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: vllm  minReplicas: 1  maxReplicas: 10  metrics:  - type: Pods    pods:      metricName: inference_request_latency_p99      targetAverageValue: 500m

Custom metrics like inference latency require Prometheus. Export latency from your inference server using Prometheus client libraries.

from prometheus_client import Histogram latency_histogram = Histogram(    'inference_request_latency_seconds',    'Inference request latency',    buckets=(0.1, 0.5, 1.0, 2.0, 5.0)) @app.post("/v1/completions")async def inference(request: CompletionRequest):    with latency_histogram.time():        result = await model.generate(request)    return result

Model Storage and Distribution

Model files are massive (7B-70B+ GB). Handle them deliberately rather than building them into container images.

PersistentVolumeClaims for Model Storage

Create a shared volume for models that all inference containers can access. Using ReadOnlyMany access mode allows multiple pods to read the same models simultaneously, ensuring consistency and preventing accidental mutations.

apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: ai-modelsspec:  accessModes:  - ReadOnlyMany  storageClassName: fast-ssd  resources:    requests:      storage: 500Gi

Mount the volume as read-only in inference containers to prevent any modification of model files.

volumes:- name: models  persistentVolumeClaim:    claimName: ai-modelsvolumeMounts:- name: models  mountPath: /models  readOnly: true

Init Containers for Model Downloading

Download models before the inference server starts by using init containers. This ensures models are available before serving traffic and prevents cache misses on container startup.

initContainers:- name: download-models  image: registry.cleanstart.com/inference-ollama:latest  command:  - sh  - -c  - |    if [ ! -f /models/mistral-7b-instruct.bin ]; then      echo "Downloading mistral model..."      ollama pull mistral:latest      cp -r /root/.ollama/models /models/    fi    echo "Models ready"  volumeMounts:  - name: models    mountPath: /models  - name: cache    mountPath: /root/.ollamaresources:  requests:    memory: 4Gi    cpu: 4

Downloads happen once. Subsequent pod starts skip the init container if models already exist, reducing startup time.

Model Caching Strategies Across Pods

Use a shared cache layer like Redis to cache model artifacts and metadata across pods. This reduces repeated model loading and improves request latency.

import redisimport json cache = redis.Redis(host='redis-cache', port=6379) def get_model(model_name: str):    cached = cache.get(f"model:{model_name}")    if cached:        return json.loads(cached)     model = load_from_disk(model_name)    cache.set(f"model:{model_name}", json.dumps(model), ex=3600)    return model

For HuggingFace models, use the transformers library's built-in cache mechanism by setting appropriate environment variables.

env:- name: HF_HOME  value: /models/huggingface-cache- name: TRANSFORMERS_CACHE  value: /models/huggingface-cache

Read-Only Volume Mounts for Model Integrity

Prevent containers from modifying model files by mounting volumes as read-only. This enforces immutability and ensures that if models are corrupted or poisoned, all inference pods immediately detect the problem rather than silently running with corrupted weights.

volumeMounts:- name: models  mountPath: /models  readOnly: true

Scaling AI Workloads

Production AI systems must handle variable demand without accumulating idle resources.

Inference Scaling: Request-Based HPA

Scale based on request queue depth or custom metrics. This is more accurate than CPU-based scaling for inference workloads where queuing is the primary bottleneck.

apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:  name: inference-scalerspec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: inference-server  minReplicas: 1  maxReplicas: 20  metrics:  - type: Resource    resource:      name: cpu      target:        type: Utilization        averageUtilization: 70  - type: Pods    pods:      metricName: http_requests_in_queue      targetAverageValue: "10"

Queue depth is the most accurate scaling signal for inference. HPA increases pods when requests back up, ensuring requests don't timeout waiting for processing.

Training Scaling: Distributed Training with Multiple GPU Pods

For training jobs, use Kubernetes Jobs with multiple GPU pods. Each pod runs a single training process, and distributed training frameworks coordinate across pods using environment variables.

apiVersion: batch/v1kind: Jobmetadata:  name: distributed-trainingspec:  parallelism: 4  completions: 4  template:    spec:      containers:      - name: trainer        image: registry.cleanstart.com/training-pytorch:latest        env:        - name: RANK          valueFrom:            fieldRef:              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']        - name: WORLD_SIZE          value: "4"        - name: MASTER_ADDR          value: "distributed-training-0.distributed-training"        resources:          requests:            nvidia.com/gpu: 1          limits:            nvidia.com/gpu: 1      restartPolicy: Never

Batch Inference: Kubernetes Jobs

For offline or scheduled inference that doesn't require immediate responses, use CronJobs to schedule batch processing. This is more cost-effective than running a continuously-up inference server for occasional workloads.

apiVersion: batch/v1kind: CronJobmetadata:  name: batch-inferencespec:  schedule: "0 2 * * *"  jobTemplate:    spec:      template:        spec:          containers:          - name: batch-processor            image: registry.cleanstart.com/inference-batch:latest            env:            - name: INPUT_PATH              value: /data/pending-tasks            - name: OUTPUT_PATH              value: /data/results            - name: BATCH_SIZE              value: "32"            volumeMounts:            - name: shared-data              mountPath: /data            resources:              requests:                nvidia.com/gpu: 2          volumes:          - name: shared-data            persistentVolumeClaim:              claimName: batch-data          restartPolicy: OnFailure

CronJobs run on a schedule. Jobs retry on failure. Setting restartPolicy: OnFailure retries transient failures without re-running successful inferences.

Cost Optimization: Spot/Preemptible GPU Nodes with Graceful Shutdown

Preemptible GPUs cost 70% less but can be terminated on short notice. Use them with graceful shutdown handlers to drain workloads before node termination.

gcloud container node-pools create gpu-spot \  --cluster=my-cluster \  --machine-type=n1-standard-32 \  --accelerator=type=nvidia-tesla-a100,count=4 \  --spot \  --preemptible

Add a preemption handler to drain pods gracefully before node termination. This reduces data loss and allows workloads to complete or migrate to other nodes.

apiVersion: apps/v1kind: DaemonSetmetadata:  name: preemption-handlerspec:  template:    spec:      containers:      - name: handler        image: registry.cleanstart.com/preemption-handler:latest        env:        - name: NODE_NAME          valueFrom:            fieldRef:              fieldPath: spec.nodeName        securityContext:          privileged: true

The handler listens for GCP preemption notices and cordons the node, giving workloads 30 seconds to gracefully shut down.

Security Hardening for Production AI

Production AI containers handle sensitive model assets and high-value inference requests. Harden your deployment.

Network Policies Restricting Model Container Egress

Prevent containers from reaching the internet and exfiltrating models. Whitelist only necessary network destinations.

apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:  name: inference-egress-lockspec:  podSelector:    matchLabels:      app: inference-server  policyTypes:  - Egress  egress:  - to:    - namespaceSelector:        matchLabels:          name: kube-system    ports:    - protocol: TCP      port: 53    - protocol: UDP      port: 53  - to:    - podSelector:        matchLabels:          app: redis-cache    ports:    - protocol: TCP      port: 6379  - to:    - podSelector:        matchLabels:          app: postgres-metadata    ports:    - protocol: TCP      port: 5432

This policy allows DNS for name resolution and connections to internal services only. All other egress is denied.

Pod Security Standards (Restricted Profile)

Enforce the restricted PSS to prevent privilege escalation and other container escape vectors. The restricted profile is the strictest and appropriate for untrusted workloads.

apiVersion: policy.k8s.io/v1kind: PodSecurityPolicymetadata:  name: restricted-aispec:  privileged: false  allowPrivilegeEscalation: false  requiredDropCapabilities:  - ALL  volumes:  - 'configMap'  - 'emptyDir'  - 'projected'  - 'secret'  - 'downwardAPI'  - 'persistentVolumeClaim'  runAsUser:    rule: 'MustRunAsNonRoot'  runAsGroup:    rule: 'MustRunAs'    ranges:    - min: 1000      max: 65535  readOnlyRootFilesystem: false  fsGroup:    rule: 'MustRunAs'    ranges:    - min: 1000      max: 65535

Containers run as non-root, drop dangerous capabilities, and can't escalate privileges.

Secret Management for API Keys and Model Access Tokens

Store secrets in Kubernetes Secrets or external vault systems. Never commit credentials to configuration files or container images.

kubectl create secret generic model-credentials \  --from-literal=huggingface-token=$HF_TOKEN \  --from-literal=registry-password=$REGISTRY_PASS

Mount secrets as environment variables rather than files to reduce surface area.

env:- name: HF_TOKEN  valueFrom:    secretKeyRef:      name: model-credentials      key: huggingface-token- name: REGISTRY_PASSWORD  valueFrom:    secretKeyRef:      name: model-credentials      key: registry-password

For multi-environment secrets, use external secrets managers (Sealed Secrets, External Secrets Operator) to encrypt secrets at rest.

Admission Controllers Enforcing CleanStart Images Only

Prevent unauthorized images from running. Use an ImagePolicy webhook to enforce image sources.

apiVersion: admissionregistration.k8s.io/v1kind: ValidatingWebhookConfigurationmetadata:  name: image-policywebhooks:- name: image-policy.cleanstart.dev  clientConfig:    service:      name: image-policy-service      namespace: kube-system      path: "/validate"    caBundle: LS0tLS1CRUdJTi...  rules:  - operations: ["CREATE", "UPDATE"]    apiGroups: [""]    apiVersions: ["v1"]    resources: ["pods"]  failurePolicy: Fail  namespaceSelector:    matchLabels:      enforce-cleanstart: "true"

The webhook rejects pods with images not from registry.cleanstart.com.

Monitoring AI Containers

Inference workloads have distinct performance characteristics from traditional web services. Monitor what matters.

GPU Utilization and Memory

Export NVIDIA GPU metrics to Prometheus for monitoring and alerting. Use the NVIDIA DCGM exporter to collect detailed metrics.

serviceMonitor:  enabled: true  interval: 30s additionalPrometheusRules:- alert: HighGPUMemory  expr: nvidia_smi_memory_used_percent > 85  for: 5m  labels:    severity: warning  annotations:    summary: "GPU memory above 85%"

Add the NVIDIA DCGM exporter to your cluster to collect GPU metrics.

helm install dcgm-exporter cleanstart/nvidia-dcgm-exporter

Inference Latency (p50, p95, p99)

Track request latency percentiles to understand user experience. Export latency histograms with appropriate buckets from your inference service.

from prometheus_client import Histogram inference_latency = Histogram(    'inference_latency_seconds',    'Inference request latency',    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]) @app.post("/v1/completions")async def generate(request: CompletionRequest):    with inference_latency.time():        result = await model.generate(request.prompt)    return {"text": result}

Query percentiles from Prometheus to understand tail latency.

histogram_quantile(0.95, rate(inference_latency_seconds_bucket[5m]))histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))

Request Throughput

Count requests per second to track system capacity utilization.

request_counter = Counter(    'inference_requests_total',    'Total inference requests',    ['model', 'status']) @app.post("/v1/completions")async def generate(request: CompletionRequest):    try:        result = await model.generate(request.prompt)        request_counter.labels(model=request.model, status='success').inc()        return {"text": result}    except Exception as e:        request_counter.labels(model=request.model, status='error').inc()        raise

Model Load Time

Track how long models take to initialize on container startup. This metric helps you understand cold start latency and optimize model loading.

model_load_time = Gauge(    'model_load_time_seconds',    'Time to load model on startup',    ['model_name']) def load_model(name: str):    start = time.time()    model = Model.from_pretrained(name)    elapsed = time.time() - start    model_load_time.labels(model_name=name).set(elapsed)    return model

Error Rates and Types

Distinguish between different types of failures. OOM errors indicate resource constraints, timeouts indicate capacity issues, and other errors indicate bugs.

from prometheus_client import Counter error_counter = Counter(    'inference_errors_total',    'Total inference errors',    ['error_type']) @app.post("/v1/completions")async def generate(request: CompletionRequest):    try:        result = await model.generate(request.prompt)        return {"text": result}    except OutOfMemoryError:        error_counter.labels(error_type='oom').inc()        return {"error": "Model memory exceeded"}, 507    except TimeoutError:        error_counter.labels(error_type='timeout').inc()        return {"error": "Inference timeout"}, 504    except Exception as e:        error_counter.labels(error_type='unknown').inc()        raise

Example Prometheus Alerts

Configure alerts for common issues in AI workloads.

groups:- name: ai-containers  interval: 30s  rules:  - alert: HighGPUMemory    expr: nvidia_smi_memory_used_percent > 85    for: 5m    annotations:      summary: "GPU {{ $labels.gpu }} memory above 85%"   - alert: HighInferenceLatency    expr: histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 5    for: 5m    annotations:      summary: "Inference p99 latency above 5s"   - alert: InferenceErrorRate    expr: rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.05    for: 5m    annotations:      summary: "Inference error rate above 5%"   - alert: PodNotReady    expr: kube_pod_status_ready{pod=~"inference.*"} == 0    for: 5m    annotations:      summary: "Inference pod not ready for 5 minutes"

Grafana Dashboard Configuration

Export metrics to Grafana for visual monitoring and alerting.

{  "dashboard": {    "title": "AI Container Production Metrics",    "panels": [      {        "title": "GPU Memory Usage",        "targets": [          {"expr": "nvidia_smi_memory_used_percent"}        ]      },      {        "title": "Inference Latency p99",        "targets": [          {"expr": "histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m]))"}        ]      },      {        "title": "Requests Per Second",        "targets": [          {"expr": "rate(inference_requests_total[1m])"}        ]      },      {        "title": "Error Rate",        "targets": [          {"expr": "rate(inference_errors_total[5m]) / rate(inference_requests_total[5m])"}        ]      },      {        "title": "Active Pod Count",        "targets": [          {"expr": "count(kube_pod_status_ready{pod=~\"inference.*\", condition=\"true\"})"}        ]      }    ]  }}

Bitnami Chart Migration for AI Stacks

If you're running support services like Redis, PostgreSQL, and NGINX on Bitnami charts, migrate to CleanStart using the universal values pattern for consistency and security hardening.

Complete AI Stack Deployment

Deploy an inference server with caching and metadata storage in a coordinated stack.

# values-ai-stack.yaml # Inference Server (vLLM)inference:  enabled: true  chart: cleanstart/inference-vllm  values:    replicaCount: 2    image:      repository: registry.cleanstart.com/inference-vllm      tag: latest    resources:      requests:        nvidia.com/gpu: 2        memory: 16Gi        cpu: 8      limits:        nvidia.com/gpu: 2        memory: 16Gi        cpu: 8    service:      type: ClusterIP      port: 8000 # Redis Cache Layercache:  enabled: true  chart: cleanstart/redis  values:    architecture: replication    auth:      enabled: true      password: ${REDIS_PASSWORD}    replica:      replicaCount: 2    persistence:      enabled: true      size: 50Gi    resources:      requests:        memory: 4Gi        cpu: 2      limits:        memory: 4Gi        cpu: 2 # PostgreSQL for Metadatametadata:  enabled: true  chart: cleanstart/postgresql  values:    auth:      username: aiapp      password: ${PG_PASSWORD}      database: inference_metadata    primary:      persistence:        enabled: true        size: 100Gi    replica:      replicaCount: 2    resources:      requests:        memory: 4Gi        cpu: 2      limits:        memory: 4Gi        cpu: 2 # NGINX Ingress Controlleringress:  enabled: true  chart: cleanstart/nginx-ingress  values:    controller:      replicaCount: 2      service:        type: LoadBalancer    resources:      requests:        memory: 512Mi        cpu: 500m      limits:        memory: 512Mi        cpu: 500m

Deploy the stack:

helm repo add cleanstart https://helm.cleanstart.devhelm repo update helm install ai-stack cleanstart/ai-stack -f values-ai-stack.yaml \  --set REDIS_PASSWORD=$(openssl rand -base64 32) \  --set PG_PASSWORD=$(openssl rand -base64 32)

Service Interconnection

Configure the inference server to use Redis and PostgreSQL by injecting configuration through ConfigMaps and Secrets.

apiVersion: v1kind: ConfigMapmetadata:  name: inference-configdata:  REDIS_HOST: "redis-cache"  REDIS_PORT: "6379"  POSTGRES_HOST: "postgresql"  POSTGRES_PORT: "5432"  POSTGRES_DB: "inference_metadata"  POSTGRES_USER: "aiapp"---apiVersion: v1kind: Secretmetadata:  name: service-credentialstype: OpaquestringData:  REDIS_PASSWORD: ${REDIS_PASSWORD}  POSTGRES_PASSWORD: ${PG_PASSWORD}---apiVersion: v1kind: Podmetadata:  name: inference-serverspec:  containers:  - name: server    image: registry.cleanstart.com/inference-vllm:latest    envFrom:    - configMapRef:        name: inference-config    - secretRef:        name: service-credentials

Network Policies for Service Mesh

Restrict traffic to necessary services to prevent lateral movement in case of compromise.

apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:  name: ai-stack-isolationspec:  podSelector:    matchLabels:      app: inference-server  policyTypes:  - Ingress  - Egress  ingress:  - from:    - namespaceSelector:        matchLabels:          name: ingress-nginx    ports:    - protocol: TCP      port: 8000  egress:  - to:    - podSelector:        matchLabels:          app: redis    ports:    - protocol: TCP      port: 6379  - to:    - podSelector:        matchLabels:          app: postgresql    ports:    - protocol: TCP      port: 5432  - to:    - namespaceSelector:        matchLabels:          name: kube-system    ports:    - protocol: UDP      port: 53

Allow Redis to replicate within the Redis cluster:

apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:  name: redis-replicationspec:  podSelector:    matchLabels:      app: redis  policyTypes:  - Ingress  ingress:  - from:    - podSelector:        matchLabels:          app: redis    ports:    - protocol: TCP      port: 6379

Allow PostgreSQL replication between primary and replicas:

apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:  name: postgres-replicationspec:  podSelector:    matchLabels:      app: postgresql  policyTypes:  - Ingress  ingress:  - from:    - podSelector:        matchLabels:          app: postgresql    ports:    - protocol: TCP      port: 5432

ai-ml-container-stack-explained.md — Architecture overview and component interactions. ai-container-attack-surface.md — Security vulnerabilities in AI workloads and mitigation. helm-charts-kubernetes.md — Helm chart patterns and deployment strategies. bitnami-helm-values-reference.md — Migration patterns from Bitnami to CleanStart charts.

Deploying AI Containers to Production: GPU, Scaling, and Security

GPU Node Configuration

Installing the NVIDIA Device Plugin

GPU Resource Requests in Pod Specs

Node Affinity and GPU Node Taints

Multi-GPU Configurations

Deploying Inference Servers with Helm

Ollama Deployment

vLLM Deployment with Tensor Parallelism

Health Checks and Readiness Probes

Horizontal Pod Autoscaler (HPA) Based on Inference Latency

Model Storage and Distribution

PersistentVolumeClaims for Model Storage

Init Containers for Model Downloading

Model Caching Strategies Across Pods

Read-Only Volume Mounts for Model Integrity

Scaling AI Workloads

Inference Scaling: Request-Based HPA

Training Scaling: Distributed Training with Multiple GPU Pods

Batch Inference: Kubernetes Jobs

Cost Optimization: Spot/Preemptible GPU Nodes with Graceful Shutdown

Security Hardening for Production AI

Network Policies Restricting Model Container Egress

Pod Security Standards (Restricted Profile)

Secret Management for API Keys and Model Access Tokens

Admission Controllers Enforcing CleanStart Images Only

Monitoring AI Containers

GPU Utilization and Memory

Inference Latency (p50, p95, p99)

Request Throughput

Model Load Time

Error Rates and Types

Example Prometheus Alerts

Grafana Dashboard Configuration

Bitnami Chart Migration for AI Stacks

Complete AI Stack Deployment

Service Interconnection

Network Policies for Service Mesh

Related Documentation

Deploying AI Containers to Production: GPU, Scaling, and Security

GPU Node Configuration

Installing the NVIDIA Device Plugin

GPU Resource Requests in Pod Specs

Node Affinity and GPU Node Taints

Multi-GPU Configurations

Deploying Inference Servers with Helm

Ollama Deployment

vLLM Deployment with Tensor Parallelism

Health Checks and Readiness Probes

Horizontal Pod Autoscaler (HPA) Based on Inference Latency

Model Storage and Distribution

PersistentVolumeClaims for Model Storage

Init Containers for Model Downloading

Model Caching Strategies Across Pods

Read-Only Volume Mounts for Model Integrity

Scaling AI Workloads

Inference Scaling: Request-Based HPA

Training Scaling: Distributed Training with Multiple GPU Pods

Batch Inference: Kubernetes Jobs

Cost Optimization: Spot/Preemptible GPU Nodes with Graceful Shutdown

Security Hardening for Production AI

Network Policies Restricting Model Container Egress

Pod Security Standards (Restricted Profile)

Secret Management for API Keys and Model Access Tokens

Admission Controllers Enforcing CleanStart Images Only

Monitoring AI Containers

GPU Utilization and Memory

Inference Latency (p50, p95, p99)

Request Throughput

Model Load Time

Error Rates and Types

Example Prometheus Alerts

Grafana Dashboard Configuration

Bitnami Chart Migration for AI Stacks

Complete AI Stack Deployment

Service Interconnection

Network Policies for Service Mesh

Related Documentation