The Fundamental Asymmetry: What Lives Where
Containers are not operating systems. They are processes running on a shared operating system. This simple distinction determines every security property of containerization. Understanding what a container can and cannot do requires understanding the boundary between container scope and kernel scope.
Container scope includes everything the application sees and controls. This includes the application code and libraries it depends on, environment variables and configuration passed to the container, the working directory and application data the container creates, virtual network interfaces and routing rules within the container's namespace, mounted volumes and filesystems visible to the container, process IDs as numbered within the container's namespace, and file descriptors open to the application.
Kernel scope includes everything the kernel manages for the entire host system. This includes the actual kernel code and data structures, system calls (syscalls) which form the interface between user space and kernel space, physical memory management and page tables, device drivers and hardware access, real process IDs and process scheduling, real network interfaces and routing tables, physical storage and block devices, and interrupt handlers and system events.
The critical distinction that defines container security is that a container cannot directly access anything in kernel scope. Every interaction with kernel-scope resources must go through syscalls, which the kernel validates and controls. This is the security model: the kernel is the gatekeeper, validating every request from the container.
The Syscall Boundary: Container's Gateway to the Kernel
A syscall is a function call from user space (where container processes run) to kernel space (where the kernel runs). The syscall interface is the only way a container process can accomplish anything with system resources: allocate memory through brk, mmap, or madvise; create processes via clone, fork, or execve; create network connections with socket and connect; read or write files through open, read, and write; create or tear down namespaces with unshare; or set resource limits with setrlimit. Every action a container takes eventually becomes a syscall.
When a container process executes a syscall like write(fd, data, len), a choreographed sequence occurs. The user-space process makes the syscall request, which triggers a transition from user mode to kernel mode. The kernel receives the request and validates it at multiple levels: it validates the file descriptor, checks permissions, verifies that the file descriptor is accessible from the container's namespace, checks cgroups to ensure resource limits are not violated, verifies that the process has the necessary capabilities, applies seccomp filtering to check if this syscall is allowed, and checks AppArmor rules. Based on all these checks, the kernel either executes the operation or denies it with an error. Finally, the result is returned to user space and execution returns to the container process.
The container's process never directly manipulates kernel data structures or accesses hardware directly. Everything is mediated by syscalls and kernel security checks. This mediation is the entire basis of container security.
The security model is: the kernel enforces isolation through syscall validation. Every security mechanism—namespaces, cgroups, capabilities, seccomp, AppArmor—is implemented as validation logic that runs on every syscall. If all syscalls from a container are validated correctly, the container remains isolated.
What a Container CAN Do (Within Its Scope)
A container can perform many operations within its scope. It can read and write its own memory without any isolation—application memory belongs entirely to the container, and the application can read and write its own data structures freely. It can manipulate files in mounted volumes; if the container has a volume mounted at /data, it can read and write files within /data up to its cgroup limits. This is intentional design—volumes are meant to be accessible to the container that needs them.
A container can create child processes, forking and executing child processes. These child processes share the container's namespaces, meaning they see the same PID 1 (the init process), the same network namespace, and the same filesystem root. The container's init process (which becomes PID 1) acts as the parent for all child processes in the container.
A container can use file descriptors that were opened for it. If the container inherited file descriptors from the runtime (for example, a file descriptor pointing to a log aggregation socket), it can write to those file descriptors. A container can listen on network ports within its assigned range. If the runtime assigns the container a port range (for example, 8000-8999), the container can bind to ports within that range.
A container can access mounted special filesystems. It can see /proc (within its own namespace, not the host's), /sys (limited access), and /dev (limited to assigned devices, not all devices). A container can use capabilities that were granted to it. If the container was granted CAP_NET_RAW, it can create raw sockets for network operations. If CAP_CHOWN was granted, it can change file ownership. A container can query its own cgroup limits by reading /proc/self/cgroup.
All of these actions are confined by namespaces, cgroups, capabilities, and seccomp filters. The container experiences the illusion of being a complete system—it has a root filesystem, processes, network interfaces, and the ability to make system calls. But this is an illusion; every aspect is constrained and mediated by the kernel.
What a Container CANNOT Do (Blocked by Scope)
A container cannot:
Access the host's root filesystem: The container's / is not the host's /. The root filesystem is from the image, overlaid with container-specific mounts. The container cannot access /etc/shadow from the host — it can only access the image's /etc/shadow (which is isolated).
See other containers' filesystems: Each container has its own mount namespace. A container cannot mount the host filesystem and browse other containers' data.
See other containers' processes: Each container has its own PID namespace. ps aux inside the container shows only processes within that namespace. Other containers' processes are invisible.
Access the host's network interfaces directly: The container's network namespace has virtual interfaces (veth pairs) connected to host bridges. The container cannot access the host's physical eth0 or wifi0. It cannot sniff traffic on the host network (unless explicitly configured in privileged mode).
Create system-wide resources: The container cannot load kernel modules, configure iptables for the host, bind to all interfaces, or make changes visible to other processes outside its namespaces.
Directly access hardware: The container has no access to /dev/mem, /dev/kmem, or raw hardware. Device access is limited to /dev/null, /dev/zero, /dev/urandom, and explicitly allowed devices.
Escape its cgroup limits: If the cgroup has memory.max=256Mi, the container cannot allocate more than 256 MB of RAM. The kernel enforces this. The container cannot negotiate or exceed its limits.
Use capabilities it wasn't granted: If CAP_SYS_ADMIN was dropped, the container cannot mount filesystems, even if the application tries.
Execute syscalls blocked by seccomp: If the seccomp profile blocks mount, the container cannot execute that syscall. It receives EPERM (operation not permitted).
This intentional limitation is the entire point of containerization: the container is a sandbox.
Container Escape: Breaking the Boundary
A container escape occurs when an attacker breaks out of container scope and gains access to kernel scope (or beyond).
Container escapes are rare but have happened. They are almost always due to vulnerabilities in:
- The kernel itself
- The container runtime (runc, containerd, etc.)
- The hypervisor (in the case of VM-based runtimes)
Real Container Escape Examples
CVE-2019-5736 (runc escape)
Vulnerability: runc's file descriptor handling had a race condition. An attacker could overwrite the runc binary on disk by exploiting a symlink in /proc/self/fd/.
Impact: A container could break out and modify the host's runc binary, allowing arbitrary code execution as root.
Scope breach: Container scope → kernel scope (runc code execution) → host scope (root access).
Fixed in: runc v1.0.0-rc7 (released February 2019).
CVE-2021-4034 (PwnKit)
Vulnerability: Buffer overflow in /usr/bin/pkexec, a setuid binary. A local process could exploit the overflow to escalate to root.
Impact: A container running the exploit code could escalate to root on the host.
Scope breach: Container scope → user-level code execution → kernel scope (via setuid binary) → root.
Fixed in: polkit v0.120 (released January 2022).
Note: This is not a container-specific vulnerability. Any local process could exploit it. But it demonstrates that vulnerabilities outside the kernel can also break containerization.
CVE-2022-0185 (integer overflow in fsconfig)
Vulnerability: Integer overflow in the fsconfig syscall, allowing a privilege escalation from any user to root.
Impact: Any container with the ability to call fsconfig could escalate to root on the host.
Scope breach: Container scope → kernel syscall → kernel code execution → root.
Fixed in: Linux kernel 5.16.11, 5.15.25, etc. (released January 2022).
CVE-2022-0847 (Dirty Pipe)
Vulnerability: A page cache bug allowed writing to read-only files via splice and pipes.
Impact: A container could write to its own binary image, then execute modified code.
Scope breach: Container scope → modify container binary → execute modified binary → (potentially break out).
Fixed in: Linux kernel 5.16.12 and later (released February 2022).
Patterns in Container Escapes
Most container escapes follow these patterns:
- Kernel vulnerability: A flaw in a syscall implementation allows privilege escalation or arbitrary memory access.
- Runtime vulnerability: A flaw in the container runtime (runc, containerd) allows the container to break out of namespaces.
- Privileged mode misuse: A container running with
privileged: truehas CAP_SYS_ADMIN and can mount the host filesystem, potentially breaking out. - Shared kernel risk: A kernel vulnerability affects all containers on the host. Single container compromise can affect all containers.
The common thread: breaking the syscall validation boundary or exploiting privileged configuration.
The Shared Kernel Problem: Why Container Compromise Can Cascade
Because all containers share one kernel, a vulnerability in the kernel affects all containers on the host.
Scenario: Host is running 50 microservice containers, each isolated in their own namespaces.
Attack: One container is compromised. The attacker executes code with UID 1000 (non-root).
Limitation: UID 1000 cannot read other containers' data (different namespace isolation) or compromise the host directly (non-root).
Kernel vulnerability: CVE-2022-0185 allows any user to escalate to root via fsconfig.
Exploitation: The attacker calls fsconfig in the vulnerable way, exploits the overflow, and escalates to root.
Result: The attacker is now root on the host. All 50 containers are compromised. The host's data is accessible.
This cascade is the fundamental risk of shared-kernel containerization. One container escape can compromise the entire host and all other containers.
Mitigation strategies:
- Keep the kernel patched: Security updates are critical. A kernel vulnerability affects every container.
- Run untrusted code in isolated runtimes: gVisor sandboxes or Kata VMs isolate the kernel. A container escape doesn't compromise the host kernel.
- Use non-root containers: Even with kernel escalation, non-root containers can only escalate to non-root. (Not perfect, but better.)
- Minimize capabilities: Without CAP_SYS_ADMIN or similar, exploits that require specific privileges may not work.
- Use seccomp filters: Block dangerous syscalls like
mount,ptrace, etc. An exploit requiring those syscalls fails. - Separate high-risk workloads: Run untrusted or high-risk code on dedicated nodes with different security posture.
Privileged Containers: Intentional Scope Expansion
A container running with privileged: true (in Docker/Kubernetes) gains access to most kernel scope resources.
apiVersion: v1kind: Podspec: containers: - name: privileged-app image: myapp:latest securityContext: privileged: truePrivileged containers have all capabilities including CAP_SYS_ADMIN, CAP_NET_ADMIN, and CAP_SYS_MODULE. They can see all devices such as /dev/sda, /dev/mem, and /dev/kvm. They have the ability to mount filesystems, including the host root filesystem. They can load kernel modules and access physical memory directly.
Why this matters: A privileged container can:
# Mount the host root filesystemmount /dev/sda1 /host-root # Read the host's /etc/shadowcat /host-root/etc/shadow # Modify the host's configurationecho "backdoor-user:x:0:0::/root:/bin/bash" >> /host-root/etc/passwd # Load a malicious kernel moduleinsmod malicious-module.koPrivileged containers are extremely dangerous and should be used only in specific circumstances. Acceptable use cases include Kubernetes DaemonSets that need to modify the host such as CNI plugins and node agents, container runtimes themselves such as Docker-in-Docker, and system administration utilities.
Never run application code in privileged containers.
Defense in Depth: Multiple Layers of Boundaries
Because no single boundary is perfect, containers use multiple overlapping boundaries:
Layer 1: Namespaces isolate resources including process IDs, network, mount, IPC, UTS, user, and cgroup. This prevents one container from seeing or accessing another container's resources.
Layer 2: Cgroups enforce resource limits, prevent resource exhaustion attacks, and prevent one container from starving others of available resources.
Layer 3: Capabilities restrict privileged operations, prevent privilege escalation attempts, and minimize the surface area for kernel vulnerabilities.
Layer 4: Seccomp blocks dangerous syscalls, prevents direct kernel module manipulation, and prevents filesystem mounting.
Layer 5: AppArmor/SELinux enforce mandatory access control, prevent file access violations, and restrict network operations.
Layer 6: Read-only root filesystem prevents persistent modification of container images, forces configuration via environment variables or volumes, and reduces the persistence of compromises.
Layer 7: Non-root user ensures that even with privilege escalation, an attacker is limited to non-root privileges and reduces the blast radius of an attack.
Layer 8: Resource limits prevent denial-of-service attacks and prevent resource-exhaustion-based escape attempts.
If one layer is breached, others remain. An attacker must breach multiple layers to fully compromise a container.
Practical Security Boundaries
Container A (standard microservice) runs as a non-root user with dropped capabilities, seccomp enabled, and a read-only root filesystem. It uses mounted volumes for configuration. While such a container can be compromised, the compromise remains contained and an attacker gains only limited access to mounted volumes.
Container B (privileged DaemonSet) requires full host access for legitimate purposes, meaning any compromise results in full host compromise. The code surface should be minimized with a small image and minimal attack surface, and only trusted code should run in this context.
Container C (untrusted/multi-tenant) runs in gVisor or Kata for additional isolation. Even if the container is fully compromised, the VM or sandbox boundary remains intact. This approach has higher cost but provides maximum isolation.
Kubernetes Security Context: Implementing Scope Boundaries
Kubernetes securityContext allows you to configure scope boundaries:
apiVersion: v1kind: Podmetadata: name: bounded-containerspec: containers: - name: app image: myapp:latest securityContext: # User scope runAsNonRoot: true runAsUser: 1000 # Capability scope capabilities: drop: - ALL add: - NET_BIND_SERVICE # Filesystem scope readOnlyRootFilesystem: true # Privilege escalation scope allowPrivilegeEscalation: falseEach setting restricts scope in important ways. The runAsNonRoot: true setting restricts the container to user scope rather than root. The drop: ALL setting restricts the container to minimal capabilities. The readOnlyRootFilesystem: true setting restricts filesystem modification. The allowPrivilegeEscalation: false setting restricts privilege escalation attempts.
These settings do not prevent vulnerabilities but reduce what an attacker can do if they gain code execution.
The Shared Kernel As a Feature and Risk
The shared kernel is why containerization is lightweight and fast: no kernel boot, no hardware emulation, no per-VM memory overhead.
But it is also why kernel vulnerabilities are catastrophic: one vulnerability affects all containers.
Trade-off matrix:
Approach | Kernel Isolation | Performance | Density | Overhead |
|---|---|---|---|---|
Standard containers | Shared | Excellent | High | Minimal |
gVisor | Sandboxed | Good | High | 5-10% CPU |
Kata Containers | Separate VM | Moderate | Low | 100-500 MB RAM |
The choice depends on risk tolerance. Standard containers trust the kernel and optimize for performance. gVisor assumes untrusted code and accepts a performance cost. Kata Containers provide maximum isolation and accept higher overhead.
Threat Models and Scope Boundaries
Different threat models require different scope boundaries:
Threat Model 1: Compromised application assumes your code has a vulnerability. The scope needed is to isolate the compromised container from other containers and the host. Mitigation includes namespaces, non-root users, dropped capabilities, and seccomp. The acceptable risk level is that an attacker can read the application's own data.
Threat Model 2: Malicious dependency assumes a third-party library contains malicious code. The scope needed is to prevent the library from accessing the host or other containers. Mitigation includes all of the above measures plus read-only root filesystem. The acceptable risk is that an attacker cannot modify the running system.
Threat Model 3: Untrusted tenant code assumes you are running code from an untrusted source. The scope needed is complete isolation from the host kernel and other tenants. Mitigation uses gVisor or Kata containers with separate kernels. The acceptable risk is that the container can do anything it wants internally, but cannot escape.
Threat Model 4: Supply chain attack assumes the base image is compromised. The scope needed is to detect and prevent modifications. Mitigation includes image signing, admission controllers, and runtime scanning. The acceptable risk is detecting the compromise before it runs in production.
Choose your scope boundaries based on threat model.
The Security Posture: Assume Container Scope, Not Kernel Scope
The fundamental security principle for containerization:
Assume a container can be compromised. Assume an attacker can execute arbitrary code within the container scope. Design your system so that compromise of one container does not compromise the entire system.
This design principle means you should never trust inter-container communication without encryption. Never assume one container cannot read another's volumes. Never assume container credentials are safe even with namespace isolation. Always assume container compromise is possible. Design multi-tenancy with container isolation as one layer among many, not as the only layer of defense.
The scope boundary is real and enforced by the kernel, but it is not a cryptographic guarantee. It is a kernel policy that can be violated by bugs.
Next Steps: Learn about container runtime options and their security trade-offs. See "Container Runtimes Explained" and "How Containers Interact with the Linux Kernel."
