Organizations deploying AI workloads face a confusing landscape of models, frameworks, and runners — each with different security profiles, container requirements, and attack surfaces. Understanding what each layer does is the first step to securing it.
1AI/ML Container StackThe Three-Layer AI Stack
When you run a large language model (LLM) or machine learning inference in a container, you're actually combining three separate components. Each layer serves a distinct purpose, and each introduces its own security risks.
This separation is critical to understand because many security failures in AI workloads stem from confusion about which layer is responsible for a given task.
Layer 1: LLM Models (The Brain)
LLM models are trained neural networks — massive files containing billions of parameters learned from training data. Examples include Llama 3, Mistral, Gemma, Phi, CodeLlama, and Stable Diffusion.
Key facts about models shape how you should think about them: Size ranges from 4GB (7B-parameter Mistral) to 140GB+ (70B-parameter Llama 3). Larger models provide better reasoning but require more GPU memory, longer inference latency, and higher computational cost.
Formats vary considerably. Models are distributed as GGUF (quantized, CPU-friendly), safetensors (PyTorch-compatible, safer deserialization), ONNX (cross-platform), or raw PyTorch pickle files. Format choice affects loading speed, compatibility, and deserialization security.
Self-contained models are just weights and architecture—they cannot run by themselves. The weights are numerical values with no inherent ability to execute code or make network requests.
Immutable in production — once trained, model files change rarely since they're data artifacts, not executable code. A model trained once is used to serve inference requests for months or years.
Storage-at-rest security is critical because models are often the most sensitive artifacts in an AI system. Unauthorized access to model weights can reveal training data patterns, enable model extraction attacks, or allow attackers to fine-tune stolen models for malicious purposes.
Models don't execute; they're loaded into memory by runners (Layer 3) and used to compute predictions. Think of a model as a mathematical function frozen into a file.
Layer 2: Training Frameworks (The Factory)
Training frameworks are the libraries that build, train, and fine-tune models. PyTorch, TensorFlow, JAX, and ONNX Runtime are examples.
Key facts about frameworks determine their role in your architecture: Purpose is critical to understand. Frameworks are used during model development and fine-tuning, not always needed in production. A framework handles backpropagation, gradient computation, and parameter updates during training. Once a model is trained, the framework is no longer needed to serve predictions.
Heavy dependencies pull in CUDA (NVIDIA GPU compute), cuDNN (GPU math libraries), NCCL (multi-GPU communication), and MPI (distributed training). These system-level dependencies introduce complexity and attack surface.
Massive attack surface results from installation complexity. A single pip install torch pulls in 200+ transitive dependencies, each with its own dependencies. Most organizations have zero visibility into what's being installed. Any of these packages can execute arbitrary code during installation.
C/C++ native extensions are used by many frameworks with memory safety bugs, exposing the runtime to buffer overflows, use-after-free bugs, and remote code execution. CUDA kernels particularly are proprietary NVIDIA code with frequent CVEs.
Development focus means frameworks are designed for ease of use in notebooks and research environments, not hardened for production. Frameworks prioritize flexibility over security.
Frameworks are necessary during model training and fine-tuning, but often unnecessary in production inference-only systems. This separation is important: your training container can be large and complex, while your inference container can be minimal and hardened.
Layer 3: Model Runners / Inference Servers (The Driver)
Model runners load trained models and expose them as services through APIs, enabling applications to send prompts and receive predictions.
Examples include Ollama, vLLM, TGI (Text Generation Inference), TensorRT-LLM, llama.cpp, and Triton Inference Server.
Key facts about runners determine their role: Core job involves multiple steps. Runners load model weights into GPU/CPU memory, tokenize text input into token IDs, manage memory allocation and GPU utilization, batch multiple requests, generate tokens autoregressively, decode tokens back to text, and return responses. All of these happen in a loop for each user request.
API endpoints expose HTTP/gRPC interfaces for external consumers. They typically follow OpenAI-compatible APIs for interoperability or custom APIs for specialized use cases.
Framework abstraction is key—they provide a consistent interface regardless of which framework trained the model. You can train with PyTorch, save as safetensors, and run with vLLM without PyTorch installed.
Optimization techniques include kernel fusion (combining multiple operations), quantization (reducing precision to lower memory), attention optimization (using efficient attention algorithms), and speculative decoding (predicting multiple tokens in parallel).
Stateful architecture means runners must maintain loaded models in GPU/CPU memory across multiple requests. GPU memory is precious since models can occupy 10-100GB, leaving little room for other processes.
Performance-critical operation means inference latency directly impacts user experience. Runners optimize for throughput (requests per second) and latency (milliseconds per token) at scale.
Runners are what actually run in production and are the piece that external applications interact with. They're also the piece with the highest attack surface since they expose APIs to untrusted input.
How These Layers Combine in Containers
Different deployment scenarios use different combinations of these layers.
Pattern A: Ollama + GGUF Model
In this pattern, a container runs the Ollama binary, which loads a GGUF model (typically from /models/mistral.gguf), and exposes an HTTP API on port 11434 for client requests.
Use case encompasses development, single-model inference, CPU or GPU, and easy model switching. Ollama handles the entire runner responsibility, managing tokenization, attention, and token generation internally.
Why this pattern works: Ollama bundles the runner and model format in one tool, so you only need to manage one container. GGUF format is quantized, meaning models run on CPU efficiently — no GPU required. This approach is good for experimentation and development.
Security consideration: Ollama's simplicity comes at a cost. As a single tool combining multiple responsibilities, a vulnerability in Ollama affects the entire system. CVE-2024-37032 demonstrated this risk when it allowed arbitrary code execution through crafted requests.
Pattern B: vLLM + safetensors Model
In this pattern, a container runs vLLM with the PyTorch runtime, loads a safetensors model (typically from /models/llama-13b.safetensors), and exposes an OpenAI-compatible HTTP API on port 8000.
Use case covers production inference, high throughput, batching-optimized deployment, NVIDIA GPU usage, and model deployment at scale. vLLM is purpose-built for high-performance serving.
Why this pattern works: vLLM implements advanced techniques like PagedAttention, which reduces GPU memory usage by 60-75%. This enables serving larger models or batching more requests on the same hardware. Production deployments typically use this pattern.
Security consideration: vLLM requires PyTorch runtime, which adds dependency weight. However, for production deployments, the throughput gains justify the additional dependencies compared to Ollama.
Pattern C: PyTorch + Custom Model (Training/Fine-tuning)
In this pattern, a container runs PyTorch with CUDA and training dependencies. It loads a dataset and base model, executes the training loop, and saves the fine-tuned weights to /output/ upon completion.
Use case includes fine-tuning existing models on your data, specialized model development, research work, and batch processing. This pattern is not used in production serving.
Why this pattern works: Fine-tuning means starting with a pre-trained model and training it further on your specific dataset. This is much faster than training from scratch. You need the full training framework to compute gradients and update weights.
Security consideration: Training containers should be ephemeral. After training completes, the container is discarded. Separate training containers from serving containers to minimize attack surface in production. Training data is sensitive; ensure dataset access is tightly controlled.
Pattern D: TensorRT-LLM + Optimized Model
In this pattern, a container runs TensorRT-LLM with a compiled runtime. It loads a TensorRT-optimized model and exposes a high-throughput inference API for production requests.
Use case targets ultra-low latency, maximum throughput, enterprise production deployment, and NVIDIA GPU usage. This is the highest-performance pattern.
Why this pattern works: TensorRT compiles models to GPU kernels optimized for specific hardware. This produces 2-5x latency reduction compared to standard runners. For time-sensitive applications (real-time trading, medical diagnosis) or high-volume serving, this overhead is worth the complexity.
Security consideration: Model compilation happens offline, making compiled models opaque — you can't inspect what kernels are generated. This is a tradeoff: extreme performance means less visibility into the compiled code.
Why Every Layer Creates Security Risk
Each layer has a unique attack surface. Understanding these risks is essential because a compromise in any layer can compromise the entire system.
The good news is that risks in each layer can be addressed independently. A compromise in the training framework doesn't affect production inference if you use separate containers. A vulnerability in a runner doesn't affect model integrity if you verify model signatures before loading.
Model Layer Risks
Supply chain poisoning is a critical threat: model weights can be modified during download, storage, or transit. An attacker with access to the model repository, CDN, or network can inject malicious weights that subtly alter model behavior. Unlike code, malicious model weights don't produce obvious errors — they just change outputs slightly, making detection nearly impossible without extensive testing.
Deserialization attacks occur when model formats execute arbitrary Python code during loading. Some formats (PyTorch pickle, older TensorFlow SavedModel) are vulnerable to this. A malicious model file can execute code before the runner ever starts serving requests. This is critical: deserialization happens inside your container, with full container privileges.
Model extraction attacks become trivial if an attacker gains filesystem access to a model file. They can copy it, fine-tune it elsewhere, or republish it. Standard containers with shell access make extraction trivial — a simple command like docker exec container_id cat /models/model.safetensors > stolen_model.safetensors reveals the entire model.
Model-based prompt injection can trigger unintended behavior, especially in unfiltered models. Models trained on internet data may contain exploitable patterns learned from adversarial examples, jailbreaks, or prompt injection attempts in the training data. A simple prompt injection cannot compromise the container itself, but can expose sensitive information or bypass safety guardrails.
Framework Layer Risks
Transitive dependency vulnerabilities are pervasive because PyTorch's dependency tree includes 200+ packages. A single CVE in any transitive dependency (not even something you explicitly installed) becomes a container vulnerability. Many of these packages are obscure and poorly maintained.
Native code memory bugs in PyTorch, TensorFlow, and CUDA involve C/C++ code without memory safety. Buffer overflows, use-after-free, and integer overflows in CUDA kernels have led to remote code execution exploits.
CUDA driver vulnerabilities are frequent because CUDA is proprietary NVIDIA code. CVEs in the CUDA driver are frequent and often critical. Running a CUDA container means running code that NVIDIA controls, not you.
Runtime mutation is possible in standard containers with pip/apt at runtime. An attacker inside the container can install backdoors, cryptominers, or data exfiltration tools.
Runner Layer Risks
API server vulnerabilities are a major attack surface. Ollama had CVE-2024-37032, an unauthenticated remote code execution vulnerability allowing crafted model requests to execute arbitrary code inside the container. vLLM and other runners have had injection vulnerabilities, path traversal bugs, authentication bypasses, and memory corruption exploits. These runners expose network APIs accepting untrusted input — a classic attack surface.
Model file path traversal occurs when runners accept model paths from HTTP requests without careful validation. Attackers can escape the model directory with .. sequences, load arbitrary files as models (including configuration files, private keys, or sensitive data), or access files outside the intended model directory. Once a file is loaded as a model, its contents can be extracted via the runner's API.
Deserialization during model loading is dangerous if a runner loads model formats that execute code during deserialization (PyTorch pickle). An attacker can supply a malicious model file via an API request and achieve code execution.
Shell-based configuration is used by many runners for initialization or model loading. If attackers can write to these scripts or inject environment variables, they achieve code execution with container privileges.
Privilege escalation through GPU access becomes possible because runners often require GPU access through NVIDIA Container Runtime, which is granted at the container level with broad capabilities. An attacker escaping the runner process can leverage GPU capabilities to launch kernel exploits or establish persistence.
The Container Security Problem for AI
Standard approaches to containerizing AI workloads create particularly dangerous systems with multiple intersecting risks.
Massive images result from combining Ubuntu 22.04 or Debian 12 plus Python plus PyTorch plus CUDA, creating 2-8GB containers. Every megabyte adds attack surface. Larger images mean more code, more potential vulnerabilities, and more packages pulling in more dependencies. A 5GB image might contain thousands of files, many of which are never used.
Shell access enables extraction in standard containers that include /bin/bash and /bin/sh. This means anyone with container access can exec into the running container, mount the filesystem, and steal model weights or training data. A model worth millions can be extracted with a single shell command: docker exec container_id tar czf model.tar.gz /models/.
Runtime mutation is enabled by containers with apt/pip that allow attackers to install additional packages, rootkits, or backdoors at runtime. A compromised application inside the container can become more compromised. An attacker who gains code execution can run apt-get install netcat and establish a reverse shell, persistent backdoor, or cryptominer.
No model integrity verification occurs in most AI containers that load models from paths without cryptographic verification. If an attacker can write to the model directory, they can swap in malicious model weights. Model files are often stored on shared volumes or network filesystems — attack surface expands beyond the container.
Unaudited transitive dependencies result from running pip install torch, which executes install scripts from 200+ unknown packages. Each one is a potential supply chain attack. Most organizations have zero visibility into what's being installed. Package maintainers' accounts get compromised frequently, leading to malicious package versions spreading to thousands of container builds.
GPU driver attack vector emerges because GPU drivers run at kernel privilege. A CVE in the CUDA driver or GPU runtime is a kernel-level exploit. Standard containers grant GPU access with broad capabilities — an attacker escaping the runner process can leverage this for kernel exploitation or persistence.
No SBOM or provenance means standard containers don't include software bills of materials. You have no way to prove what software is inside, audit its licenses, identify transitive dependencies with known CVEs, or verify no malicious modifications occurred during build. For compliance, security audits, or incident response, this is a critical gap.
Container Hardening Best Practices for AI Workloads
When deploying AI workloads to production, container security requires deliberate design choices across multiple dimensions.
Hardened Model Runners
Production runners should be designed to minimize exploitability of API vulnerabilities. Several key practices apply: Remove shell access to eliminate /bin/bash and /bin/sh. Without a shell, RCE vulnerabilities (like CVE-2024-37032 in Ollama) become unexploitable — an attacker cannot break out of the runner or interact with the filesystem directly. The runner's only interface is its HTTP API.
Read-only filesystem makes model files and code immutable after container startup. If an attacker modifies a file, the system either fails safely or rejects the write. This prevents silent corruption and mutation attacks.
Minimize base image by using Alpine or distroless images instead of Ubuntu. This eliminates the package manager and unnecessary system utilities, reducing deployment-time attacks.
Principle: A vulnerability allowing arbitrary file path in a model load request is dangerous in a standard container (attacker breaks out via shell). In a hardened container with no shell and read-only root filesystem, the same vulnerability is contained.
Training Framework Containers
Training frameworks require extensive dependencies, but this should be separated from inference. Development containers include build tools and all framework dependencies but should document them in a software bill of materials (SBOM) so you know exactly what's included.
Production inference containers strip out training dependencies, leaving only runtime components. A model trained in a heavy development container can run in a minimal inference container, reducing attack surface significantly.
This separation is a best practice: train in a feature-complete container, deploy inference in a hardened, minimal container.
Hardening Across All Layers
Key hardening techniques applicable to all AI container layers include: Distroless or Alpine base images provide smaller images with fewer packages and reduced attack surface. No package manager is available at runtime.
Read-only root filesystem makes model files and code immutable after container startup, preventing both accidental corruption and intentional modification.
No shell access removes /bin/bash and /bin/sh, restricting attacker capabilities even if code execution is achieved.
Pre-compiled dependencies mean no pip/apt at runtime. All dependencies are compiled during build phase, eliminating installation-time supply chain attacks.
Software Bill of Materials (SBOM) using SPDX 3.0 or CycloneDX format documents every dependency, enabling auditing for known CVEs and supply chain verification.
Cryptographic signatures using tools like Cosign enable verification of image authenticity and detection of tampering in transit.
Provenance attestations using SLSA or similar frameworks document the build process, source code version, and builder identity, enabling incident response and forensics.
Immutable pull semantics through digest-based pulling (hashes, not tags) ensures reproducibility — you get the exact image you tested, not a newer version.
GPU Runtime Considerations
GPU support adds complexity to container security: CUDA driver supply chain risk is significant because CUDA is proprietary NVIDIA code. Pulling CUDA during container build or deployment introduces supply chain dependencies. Prefer pre-compiled, verified CUDA runtimes.
Pre-compilation strategy involves compiling CUDA runtime libraries for specific GPU types (H100, A100, L40S, A10, T4) and including them in the base image. This eliminates download time during deployment, intermediate caching of untrusted binaries, installer code execution in your container, dependency on external package repositories, and version mismatches between expected and installed CUDA.
The tradeoff is clear: pre-compiled runtimes are less flexible (one image per GPU type) but more secure (frozen, verifiable, documented dependencies).
Standard vs. Hardened AI Container Approaches
Aspect | Standard Ubuntu/Debian-based AI Image | Hardened Minimal AI Image |
|---|---|---|
Base OS | Ubuntu 22.04 / Debian 12 | Distroless / Alpine |
Image size | 2-8 GB | 200-800 MB |
Deployment overhead | Large images slow deployment | Smaller pulls reduce latency and network load |
Shell available | /bin/bash, /bin/sh, full POSIX environment | None — eliminates interactive access |
Package manager | apt, pip available at runtime | None — prevents runtime package installation |
CVE surface | 50-200+ from base OS alone | Dramatically reduced by eliminating unnecessary packages |
SBOM documentation | Rarely provided, often incomplete | Best practice: SPDX 3.0 or CycloneDX format with full tree |
Cryptographic signatures | Typically not included | Enables verification of authenticity and supply chain integrity |
Model file security | Shell access enables trivial extraction | Read-only FS + no shell prevents direct extraction |
Runtime mutation | Attackers can install backdoors via package manager | Immutable filesystem prevents runtime changes |
FIPS compliance | Difficult to achieve | Achievable with hardened runtime libraries |
GPU driver handling | Runtime installation increases supply chain risk | Pre-compiled CUDA reduces external dependencies |
Compliance audit trail | Limited visibility into container composition | Enhanced by SBOM, signatures, and build provenance records |
Incident response | Difficult to definitively identify running software | Forensics enabled by documented dependencies and attestations |
Production Readiness: Choosing Your Hardening Strategy
When deploying AI workloads to production, your container strategy must account for your threat model, compliance requirements, and operational constraints.
Development and experimental environments may use standard containers when data is non-sensitive (public datasets, internal testing), speed of iteration matters more than hardening, the container is temporary (spun up, used once, discarded), or security risks are acceptable for the deployment context.
Production environments benefit from hardened containers when models are valuable intellectual property (preventing extraction is important), input data is sensitive (preventing compromised container access to secrets), compliance is required (HIPAA, PCI, FIPS, SOC 2, GDPR, etc.), you're operating at scale with security auditing and incident response requirements, you want to minimize runtime mutation and shell access risks, or verifiable build provenance is required for forensics.
The decision isn't binary. Many organizations use a layered approach by using standard containers for development and experimentation, then hardened containers for staging and production. This separation is clean: train and experiment in feature-complete containers, deploy inference in hardened, minimal containers.
Security by Layers: Which Threats Apply to You?
Not every organization faces every threat. Understanding the three-layer stack helps you apply appropriate security controls.
Threat: Model weight extraction is relevant if your model is valuable IP, competitors would pay to steal it, or regulators audit model provenance. It's addressed by read-only FS plus no shell in CleanStart containers.
Threat: Supply chain poisoning in dependencies is relevant if you operate at scale with security auditing or handle sensitive data. It's addressed by pre-compiled dependencies in CleanStart and SBOM for auditability.
Threat: Runtime code injection via runner vulnerability is relevant if your runner is exposed to untrusted input (public API) or runs in untrusted network. It's addressed by shell-less runners that eliminate RCE pathways with least-privilege design.
Threat: CUDA driver kernel exploits is relevant if untrusted code runs in the same container as the runner. It's addressed by pre-verified GPU runtimes in CleanStart and FIPS variants for cryptographic assurance.
Threat: Training data leakage during training is relevant if your training data is confidential (medical data, financial records). It's addressed by ephemeral training containers, read-only model outputs, and no runtime shell.
Real-World Example: What Goes Wrong
A typical deployment failure follows this pattern:
- Build phase:
docker build -t my-ai-api .pulls Ubuntu 22.04, PyTorch 2.1, and installs pip packages and CUDA drivers from NVIDIA. The image is 4GB with 150+ known CVEs. - Deployment phase: Image is pushed to registry and pulled on GPU server, taking 10 minutes to transfer. No verification confirms the image is unmodified.
- Runtime phase: Container starts. CVE-2024-1234 in a transitive pip dependency is exploited by malicious input to the API, granting attacker code execution inside the container.
- Containment failure: Container has shell and package manager. Attacker runs
apt-get install openssh-serverand establishes a reverse shell. - Data exfiltration: Attacker mounts
/models/and copies the 50GB model file to an external server. The model is now in the competitor's hands. - Incident response nightmare: With no SBOM, build logs, or signatures, the security team cannot definitively determine what version of what software was running. Could the model have been poisoned in the container? Unknown. The team must assume the worst case.
With CleanStart, the scenario is completely different:
- Build phase: Build from verified source with automatic SBOM generation and all dependencies scanned.
- Deployment phase: Image is signed with cosign and has SLSA attestations attached. Deployment verifies the signature before pulling.
- Runtime phase: CVE triggers a code execution attempt, but there's no shell — the attacker can't interact with the filesystem or establish persistence.
- Containment success: Read-only filesystem means the attacker can't copy the model. No shell means the attacker can't run arbitrary commands. The compromised process is isolated.
- Incident response: SBOM identifies the exact CVE, SLSA attestations prove no tampering in transit, and build logs trace the issue to source. Mitigation is clear: rebuild with patched dependency.
Next Steps
Understand AI runtime security in detail: See ai-runtime-overview.md. Learn container hardening best practices: See container-security-best-practices.md. Secure your build pipeline: See prebuild-stage-security.md. Understand the supply chain problem: See the-illusion-of-the-single-artifact.md.
