A container image with a 2-year-old vulnerability running as root with full kernel capabilities in a privileged pod is technically "containerized," but it's not secure. True container security requires explicit choices at every stage: from the Dockerfile you write to the runtime policies you enforce in production.
This document covers the security controls that actually matter, why each one exists, and how to implement them in production environments.
Build-Time Security: The Foundation
Security decisions made at build time determine what vulnerabilities exist downstream. A vulnerability baked into the image will reach production unless caught by scanning. A comprehensive defense-in-depth approach layers security controls from build through runtime.
Use Minimal Base Images
A base image is the starting point providing the OS, package manager, shell, and core utilities that applications build upon. Standard base images include everything—ubuntu:22.04 is 77 MB with 1,500+ packages including bash, curl, wget, man pages, and development tools. If one of those packages contains a vulnerability, your image is vulnerable even if the application never uses that code.
Minimal base images include only what's necessary. Alpine Linux is 7 MB with just a minimal libc and package manager. Distroless images are 50 MB containing only the Python runtime and minimal dependencies, without bash or utilities. The result is 5–10x smaller images that are faster to pull, easier to scan, and have less attack surface.
Trade-offs exist. Minimal images make debugging harder (no bash, no curl) but production shouldn't need debugging—debug in staging. Alpine uses musl libc (not glibc), which breaks some applications. Specific base images are needed for specific runtimes (Java needs larger images, Python can use slim).
Recommendation: For Python, Node.js, Go use Alpine or distroless. For Java use eclipse-temurin-slim or distroless-java. For .NET use mcr.microsoft.com/dotnet/runtime-deps. Never use ubuntu or debian if a smaller alternative exists.
Multi-Stage Builds: Compile Separately From Runtime
Compiled applications require compilers at build time but not at runtime. A single-stage build results in 1.3 GB images including the entire Go compiler chain even though the application only needs the binary. A multi-stage build uses 50–100x less space by discarding the compiler and build tools in the final image.
The pattern separates builder and runtime. The first stage uses golang:1.21 (1.3 GB) with the compiler and build tools. Source code and build tools copy into this stage. The build runs. The second stage uses alpine:3.18 (minimal runtime). Only the compiled binary copies from the builder. The final image is ~30 MB—alpine plus the binary. The compiler, build tools, and source code are completely discarded.
Benefit: 50–100x size reduction for compiled languages and compiler/build tools aren't in production where they could be exploited.
No Secrets in Images
Secrets (passwords, API keys, SSH keys, tokens) must never be in container images. Hardcoding secrets in Dockerfile makes them part of the image history where they remain forever even if later layers delete them. Anyone with access to the image can extract them through layer inspection.
Build-time secrets avoid persistence—passed at build time, not stored in the image. Runtime secrets are fetched when the container starts through mounted volumes or environment variable injection. Kubernetes Secrets provide this capability through mounting or environment variables injected at runtime.
Best practice: Never hardcode secrets. Use build-time secrets for build access (via docker buildx secrets), runtime secrets for application access (via Kubernetes Secrets or external vaults).
Pin Dependency Versions
Floating tags (latest, stable) change over time. Rebuilding an image weeks later pulls different packages. Specific version pinning ensures the exact image is pulled every time—reproducible builds, controlled updates, audit trails proving what was used, and guaranteed rollback capability.
Better: Pin by digest (cryptographic hash) ensuring bit-for-bit identical images even if the registry is compromised. Pinning matters for reproducibility (same Dockerfile → same image), security (controlled updates, not automatic), audit trail (immutable proof), and rollback (reverting is guaranteed).
Manage pinning with Renovate/Dependabot (automated pull requests), lock files (requirements.lock, package-lock.json), and digest scanning (warning about floating tags).
Sign Your Images
Image signing proves the image came from you and hasn't been tampered with since building. Without signing, anyone with registry access can push a malicious image with your name. Container runtimes have no way to verify legitimacy.
Cosign (CNCF) is simpler than Docker Content Trust and uses keyless signing with OIDC identity. Sigstore is free, transparent, and uses OIDC for identity (GitHub, Google, Microsoft accounts). Verification happens at pull time—the runtime only pulls if signatures are valid. Kubernetes admission policies enforce signature verification at the cluster level.
Registry-Time Security: Scan and Control
Registry-time controls catch vulnerabilities before production. Even perfect Dockerfiles can include vulnerable packages if the base image or dependencies are compromised.
Scan on Push
Every image pushed to the registry should be scanned automatically with the push failing if critical vulnerabilities are found. Most registries support this: Google Container Registry (GCR), Docker Hub (Trivy), and others including commercial tools (Snyk, Prisma, JFrog Xray). Scan results prevent critical images from being pushed through exit codes.
Limitation: Only detects known CVEs (zero-day vulnerabilities are not caught), only finds system package vulnerabilities (not application code), and requires maintained CVE databases.
Admission Control and Image Policies
Kubernetes admission controllers block container creation if the image doesn't meet policy. Image pull policy ensures Always pulling avoids stale cached images. Image registry policy restricts which registries are allowed. Image signature verification ensures only signed images deploy.
Runtime Security: Lock Down Containers
Runtime security enforces restrictions on what containers can do after they start.
Non-Root User
Container processes run as root (UID 0) by default, which means container escape leads to host compromise. The attacker would have root privileges, write to any file, install packages, modify kernels, and access other processes. Requiring non-root prevents privilege escalation attacks.
Implementation: Create a non-root user in Dockerfile (USER appuser). Kubernetes enforces with runAsNonRoot: true and specific runAsUser: 65532 (standard across distros). If compromised applications try to escalate to root (sudo, su), they fail immediately.
Read-Only Root Filesystem
Containers with writable root filesystems allow any process to modify files. This is fine for development but dangerous in production. Marking the filesystem as read-only prevents persistent modifications.
Implementation: readOnlyRootFilesystem: true in Kubernetes makes / read-only. Applications need explicit writable volumes for /tmp and /var/run through emptyDir. Even with code execution, attackers cannot write files, create backdoors, or persist malware.
Dropped Capabilities
Linux capabilities are fine-grained permissions that traditionally required full root. Containers have many capabilities by default. Dropping all and adding back only what's needed is extremely effective.
Implementation: capabilities: drop: [ALL] drops everything. Then add back only what's needed—CAP_NET_BIND_SERVICE for binding to ports < 1024. Most applications need zero capabilities. When processes try to use capabilities they don't have, the kernel rejects it.
Seccomp Profiles
Seccomp (secure computing mode) restricts which system calls a process can make. A whitelist of allowed syscalls is defined; unlisted calls are blocked. A Python web application needs read/write/socket/connect/listen/accept but not init_module or sysctl.
Benefit: Blocks unusual or dangerous syscalls, prevents privilege escalation, detects anomalous behavior.
No Privileged Mode
Never run containers in privileged mode unless absolutely necessary. Privileged containers have all Linux capabilities, access all devices, can load kernel modules, and can access the host kernel directly. This effectively removes the container isolation boundary.
Only use for: Container runtimes (Docker daemon in Docker, Kubernetes kubelet), system management tools (systemd, network management), device drivers. Regular applications never need privileged mode.
Network Security: Isolate Traffic
Containers have network access by default. Control what they can reach through network policies and service meshes.
Network Policies
Network policies are firewall rules for Kubernetes pods. By default, all pods can reach all other pods and the internet. Network policies make this explicit. The recommended approach: deny all by default, allow specific paths explicitly.
A network policy denying all traffic by default, then allowing ingress on port 8000 only from pods labeled role: frontend, effectively isolates network traffic.
Service Mesh
Service meshes (Istio, Linkerd) provide advanced network control: mutual TLS encrypts traffic between pods, authorization policies enforce fine-grained access, rate limiting prevents abuse, and traffic management enables canary deployments.
Trade-off: Adds complexity and observability but provides strong network security guarantees.
Egress Filtering
Control what outside services your containers reach. Your web application should reach the database and payment API but not the entire internet or attacker infrastructure. Network policies enforce this by allowing only specific egress destinations.
Secrets Management: Never Hardcode Credentials
Hardcoded secrets in images, environment variables, or configmaps are security disasters.
External Secret Vaults
Store secrets in dedicated vault services. Kubernetes Secrets (basic): stored as base64 in etcd (not encrypted by default). HashiCorp Vault (enterprise): Applications authenticate and retrieve secrets at startup. Cloud providers (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager): Native options.
Practice: Never hardcode. Use external vaults. Rotate credentials periodically. Audit access.
Supply Chain Security: SBOM, Provenance, VEX
Modern security includes not just your code but the entire supply chain of dependencies.
Software Bill of Materials (SBOM)
An SBOM lists every component in software: packages, libraries, versions, licenses, vulnerabilities. Generated with tools like Syft, the SBOM details every package. Use SBOMs to track components, cross-reference with CVE databases, detect supply chain attacks, and satisfy compliance requirements.
SLSA Provenance
SLSA provides cryptographic proof of how an artifact was built. A provenance statement includes builder identity (which CI/CD system), source code version (exact git commit), build command (what was executed), build timestamps, and build environment. This supports incident response by rapidly identifying whether a system was affected.
VEX (Vulnerability Exploitability Exchange)
When an image contains a vulnerable package, you might determine it's not applicable (vulnerable code never executed), exploited (known exploitation in the wild), fixed (patched), or under investigation. A VEX document records these decisions, reducing false positives and noise.
The Defense-in-Depth Model
No single control prevents all attacks. Effective security layers multiple controls. Build-Time establishes the foundation with minimal base images, multi-stage builds, no hardcoded secrets, pinned versions, and signed images. Registry-Time scans on push, verifies signatures, performs continuous scanning, and enforces image policies. Runtime applies non-root users, read-only filesystems, dropped capabilities, seccomp profiles, network policies, and secrets injection. Monitoring tracks behavior with tools like Falco, maintains audit logs, alerts on anomalies, and executes incident response.
An attacker must bypass multiple layers. If they compromise the application, they hit non-root restrictions, read-only filesystem, dropped capabilities, and network policies. Each layer adds friction and defense depth.
Common Mistakes and How to Avoid Them
Mistake 1: Running as root. Fix: Always set USER in Dockerfile and runAsNonRoot: true in Kubernetes. Mistake 2: Writable filesystem everywhere. Fix: Default to readOnlyRootFilesystem: true with explicit writable volumes. Mistake 3: Hardcoded secrets. Fix: Use external vaults. Mistake 4: Floating tags. Fix: Pin specific versions or digests. Mistake 5: No image scanning in CI/CD. Fix: Scan every image, fail on critical vulnerabilities. Mistake 6: Trusting all registries equally. Fix: Use image policies and admission controllers. Mistake 7: No network policies. Fix: Default-deny, allow specific flows. Mistake 8: Large base images. Fix: Use Alpine, distroless, or scratch. Mistake 9: No secret rotation. Fix: Rotate monthly, use vaults supporting auto-rotation. Mistake 10: Ignoring CVE scanning post-deployment. Fix: Continuously scan running images, treat new CVEs in old images as urgent.
Compliance Alignment
Container security best practices align with major compliance frameworks. NIST 800-190 maps image scanning to flaw remediation (SI-2), non-root to least privilege (AC-6), and network policies to information system connections (CA-3). CIS Docker Benchmark covers build-time, runtime, and orchestration with 40+ recommendations. PCI-DSS requires patch management (Requirement 6.2) and secure configuration (Requirement 2.2). FedRAMP requires continuous monitoring (CA-7) and vulnerability management (SI-2).
Quick Reference Checklist
Build-time: Using minimal base image, multi-stage Dockerfile for compiled languages, no hardcoded secrets, dependencies pinned to specific versions, images are signed.
Registry-time: Scanning on push, admission policies enforce signature verification, continuous registry scanning enabled, only approved registries allowed.
Runtime: Containers run as non-root, root filesystem is read-only, capabilities dropped, seccomp profile applied, network policies restrict traffic, secrets fetched from vault at runtime.
Supply chain: SBOM generated and stored, provenance information captured, VEX documents for non-exploitable CVEs.
Monitoring: Runtime behavior monitoring enabled, audit logs collected, alerts configured for suspicious behavior.
