How Containers Interact with the Linux Kernel: Namespaces, Cgroups, and Capabilities

Watch the Lesson

Containers Are Not Virtual Machines: They Are Isolated Processes

This distinction is foundational to understanding container security and threat modeling correctly. A virtual machine runs its own kernel, bootloader, and filesystem. It is isolated from the host at the hypervisor level, which is a separate software layer that mediates all interaction with the underlying hardware. A VM cannot directly interact with the host kernel because the hypervisor stands between them as a security boundary. A container is fundamentally different in its architecture: it is a process running on the host kernel. The container process shares the kernel with every other process on the host, including other containers. The only difference between a container and a non-containerized process is that the container process runs inside isolated namespaces that logically separate its resources.

This architecture distinction matters profoundly for security: a kernel vulnerability affects every container on the host. There is no hypervisor boundary to contain the damage. A container process can trigger a kernel exploit, and the attacker gains host kernel access, potentially compromising all containers and the host itself.

A container shares the host kernel while VMs have separate kernels. CVE-2021-4034 (PwnKit) allowed any local process to escalate to root by exploiting a bug in /usr/bin/pkexec. Inside a container, this vulnerability worked identically. A container process could exploit it and become root on the host.

The benefit of containers is resource efficiency and startup speed: they don't need to boot a kernel, allocate fixed memory, or emulate hardware. The cost is shared kernel risk.

Understanding this is essential: containers provide process-level isolation, not kernel-level isolation. Defend accordingly.

Linux Namespaces: The Isolation Mechanism

Linux namespaces are the kernel's core mechanism for isolating groups of processes from each other. A namespace is a logical partition of kernel resources where processes can be grouped. Processes in different namespaces cannot see or interact with each other's resources, creating the illusion of isolation. Container runtimes leverage eight different namespace types to provide comprehensive process-level isolation, with most of them utilized in typical container deployments to create the container's isolated environment.

PID Namespace: Process Isolation

The PID namespace isolates process IDs, creating a separate numbering scheme for containers. In the host system, process IDs are global: PID 1 is the init system, and subsequent PIDs are assigned sequentially. Inside a container's PID namespace, the process ID numbering is completely local to that container. The container sees its own PID 1 as the main application and subsequent PIDs for additional child processes, even though these processes have completely different host-level PIDs.

To illustrate this concretely, consider the host namespace containing PID 1 (systemd init), PID 234 (kubelet), and PID 567 (containerd). Within containerd, a container's own PID namespace sees the application running as PID 1 (which actually corresponds to host PID 891 at the kernel level), a child process as PID 2 (actually host PID 892), and critically, the application cannot see or interact with host PIDs 234, 567, or any other host processes.

This matters profoundly for container security. A container cannot send signals to or inspect host processes since it cannot perceive them—it sees only its own isolated process tree. This isolation prevents a container from directly interfering with other containers or the host system through process manipulation. However, if the PID namespace is not properly set, the ps aux command inside the container reveals all host processes. An attacker with access to the container could identify and potentially target sensitive processes running on the host system. This is why containers must run in their own PID namespace as a fundamental security requirement.

Network Namespace: Network Isolation

The network namespace isolates network interfaces, routing tables, and firewall rules. Each container sees a virtual network interface rather than having direct access to the host's physical network. The host network namespace contains the physical eth0 interface and the docker0 bridge. Each container has its own network namespace with a virtual eth0 interface, its own routing table, and no direct access to the host's physical eth0 interface.

Network isolation operates through a clever kernel mechanism. The container's network interface is a virtual interface that connects to a host bridge. When the container sends packets, they travel through this bridge, which routes them appropriately based on the networking configuration. The container sees only its own virtual interface and its own routing rules, preventing visibility into the host's networking infrastructure or other containers' networking.

This isolation prevents containers from accessing the host network. Containers cannot sniff network traffic on the host system since they cannot access the physical interfaces or the host's network metadata. They cannot see or redirect traffic meant for other containers or the host. They cannot bind to the host's physical network interfaces, preventing port hijacking or network hijacking attacks. However, if the network namespace is not properly set, the security boundaries collapse. Without network namespace isolation, the container shares the host's entire network stack.

Mount Namespace: Filesystem Isolation

The mount namespace isolates mounted filesystems and provides each container with its own root filesystem hierarchy. Each container has its own root filesystem and cannot see the host's filesystem or other containers' filesystems. The host mount namespace contains the root filesystem with host-specific files. The container's mount namespace has its own completely separate root filesystem sourced from the container image.

This filesystem isolation is implemented through a sophisticated layering system. The container's filesystem is constructed from image layers (typically using copy-on-write mechanisms) plus any runtime overlays. The kernel creates new mount points that are visible only within the container's namespace, preventing the container from accessing or perceiving mount points in the host namespace. This is fundamental to container security: a container cannot read the host's /etc/passwd, cannot access /root/.ssh/ containing administrative credentials, and cannot see any other sensitive files on the host system.

However, this isolation can be breached under certain conditions. If the mount namespace is not properly configured or if the container is run with privileged: true, the security model collapses. A privileged container can access /dev and potentially mount the entire host filesystem into its own namespace. An attacker with such access can read any host file, exfiltrate secrets, modify system configuration, or compromise the entire system.

UTS Namespace: Hostname Isolation

The UTS namespace isolates hostname and domain name information. The container sees its own distinct hostname rather than adopting the host's hostname. On the host system, the hostname might be "production-server," while inside the container, the hostname is "container-id" (typically assigned by Kubernetes or the container orchestration platform), isolating the container's identity from the host. The security impact of UTS namespace isolation is minimal compared to other namespace types, though it provides clarity and logging benefits by ensuring that container log entries and process listings clearly identify whether they originated from container processes or host processes.

IPC Namespace: Inter-Process Communication Isolation

The IPC namespace isolates shared memory segments, semaphores, and message queues. The container cannot see or access the host's IPC mechanisms. The host IPC namespace might contain shared memory segment 12345 used by host processes. The container's IPC namespace has its own separate shared memory segment that is also numbered 12345, but this is completely isolated from the host's original segment. The container cannot access the host's original segment even though the segment IDs are identical, because the namespaces provide complete logical separation.

This isolation ensures that applications using POSIX shared memory or message queues remain isolated. The IPC objects created by one container cannot interfere with IPC objects in other containers. However, if the IPC namespace is not properly set—for example, by using the --ipc=host flag in Docker—the container can access the host's shared memory and potentially read sensitive data or interfere with host processes.

Cgroup Namespace: Cgroup Isolation

The cgroup namespace isolates the cgroup hierarchy and prevents containers from seeing the complete cgroup resource management hierarchy of the system. The container sees cgroups only relative to its own cgroup, not the full system hierarchy. In the host cgroup hierarchy, the root cgroup contains subdirectories such as /system.slice and /docker. From the perspective of a container's cgroup namespace, it sees only its own cgroup root and does not perceive other cgroups. This prevents containers from directly modifying other containers' resource limits or even inspecting their cgroup settings.

Time Namespace: Clock Isolation

The time namespace is an advanced feature that allows containers to have a different system clock offset from the host. A container could have a clock that is offset from the host's actual time. This is a rare use case in production systems, but it enables containers to have isolated time for testing scenarios and simulations where different containers need to perceive time differently.

Cgroups v2: The Resource Control Mechanism

Cgroups (control groups) are the kernel's fundamental mechanism for controlling and limiting resource usage by groups of processes. While namespaces provide logical isolation of resources, cgroups enforce hard limits on resource consumption. A cgroup is essentially a collection of processes, and the kernel enforces resource limits on all processes within a cgroup as a unit.

What Cgroups Control

CPU (cpu controller): The CPU controller manages computational resources available to a cgroup. CPU time quota specifies the maximum percentage of CPU that a cgroup can consume. CPU shares establish a relative weight for the cgroup when multiple cgroups compete for CPU resources, implementing proportional CPU allocation.

Memory (memory controller): The memory controller enforces limits on RAM consumption. A hard limit specifies the total memory the cgroup can allocate—when this limit is reached, the kernel invokes the out-of-memory killer. A soft limit specifies preferred memory usage but can be exceeded when memory pressure is high. A swap limit controls how much memory can be swapped to disk.

Block I/O (io controller): The I/O controller manages disk and storage access patterns. Read and write limits can be set per storage device to prevent disk I/O from saturating the host storage. IOPS (input/output operations per second) limits prevent excessive filesystem operation rates.

PIDs (pids controller): The PIDs controller limits the maximum number of processes a cgroup can create. A container with pids-max=100 can spawn at most 100 processes, preventing fork bomb attacks where an attacker exhausts the system's process table.

Device Access (devices controller): The devices controller specifies which hardware devices a cgroup can access. A normal container might only access /dev/null, /dev/zero, and /dev/urandom (null devices), while a privileged container can access all devices including block devices like /dev/sda and special devices like /dev/mem and /dev/kvm.

Cgroup Hierarchy

Cgroups are organized into a hierarchical tree structure. The root cgroup encompasses all processes and contains all available system resources. Child cgroups branch from the root and represent subsets of resources with their own distinct limits. This hierarchical organization enables sophisticated resource management policies. The root cgroup might contain 100% of CPU time and 32GB of memory. This root cgroup contains two main branches: the system cgroup (allocated 30% CPU and 8GB memory) which runs kubelet and system processes, and the user.slice (allocated 60% CPU and 20GB memory) which contains multiple application containers.

Kubernetes leverages this cgroup hierarchy by mapping each pod to a cgroup. When you specify resources.requests and resources.limits in a pod's specification, Kubernetes translates these declarations into actual cgroup configuration settings in the kernel, ensuring that the pod's processes stay within the specified resource constraints.

Cgroup Limits and Container Behavior

When a container hits a cgroup limit, the kernel enforces the restriction with different consequences depending on the resource type. When a CPU limit is exceeded, the container processes are throttled. When a memory limit is exceeded, the kernel invokes the Out-Of-Memory (OOM) killer. When a container attempts to access a device that is not permitted, the syscall fails with a permission denied error. When a PID limit is exceeded, the fork() syscall fails with EAGAIN, preventing the creation of new child processes.

Cgroups v1 vs v2

Linux provides two versions of cgroups. Cgroups v1 uses multiple independent hierarchies, with one hierarchy per resource type. This approach is more complex to manage but provides greater flexibility for advanced use cases. Cgroups v2 consolidates these multiple hierarchies into a single unified hierarchy, simplifying the overall architecture at the cost of slightly less flexibility. Cgroups v2 provides simpler configuration, clearer resource management, and better precision in resource control. Cgroups v2 is required for Kubernetes 1.25 and later versions, making it the recommended approach for modern container deployments.

Check your system:

mount | grep cgroup# cgroup2 on /sys/fs/cgroup type cgroup2# If you see just "cgroup on /sys/fs/cgroup", you're on v1

Linux Capabilities: Fine-Grained Root Privileges

Traditional Unix security treats root (UID 0) as an all-or-nothing proposition: either a process has all privileges, or it has none. Linux capabilities break this binary model and divide root privileges into granular, independent permissions that can be granted or revoked individually. This approach allows the principle of least privilege to be applied effectively—instead of running processes as root and hoping they don't abuse these extensive privileges, capabilities can be dropped selectively, leaving only those permissions necessary for the application's function.

Common Capabilities and Their Risks

CAP_NET_RAW: This capability allows creation of raw sockets, enabling network sniffing. A compromised application with this capability can sniff network traffic and read sensitive data. This capability should be dropped unless the application specifically needs it for functions like network diagnostics or packet capture tools such as tcpdump.

CAP_NET_ADMIN: This capability permits modification of network interfaces, configuration of routing rules, and manipulation of iptables firewall rules. A malicious application with this capability can redirect network traffic, perform denial-of-service attacks, or intercept communications. It should be dropped unless the application is a network service that inherently needs to modify network configuration.

CAP_SYS_ADMIN: This is an extremely powerful capability that allows mounting filesystems, loading kernel modules, and accessing hardware. A process with CAP_SYS_ADMIN can escape containerization, mount the host filesystem into the container, load malicious kernel modules, or perform numerous other privileged operations. This capability should always be dropped unless absolutely necessary.

CAP_DAC_OVERRIDE: This capability allows bypassing file permission checks, effectively granting the ability to read or write files regardless of standard Unix permission bits. A process with this capability can access files it should not have permission to access. It should always be dropped unless the application specifically needs to write to restricted files.

CAP_SYS_PTRACE: This capability allows attaching a debugger to other processes. A compromised application with this capability can inspect and modify other processes' memory, potentially extracting secrets or hijacking other applications. It should be dropped unless the application is a debugger.

CAP_CHOWN: This capability allows changing file ownership. An application with this capability can alter file ownership and permissions on sensitive files. It should be dropped unless specifically required.

CAP_SETUID / CAP_SETGID: These capabilities allow changing user and group IDs. A process with these capabilities can elevate privileges to other users or escalate to root. They should be dropped unless specifically needed by init systems or shells that must switch user contexts.

CAP_KILL: This capability allows sending signals to other processes, potentially terminating them. An application with this capability can kill system processes or other applications. It should usually be dropped.

Minimum Capability Set

The reality of modern applications is that most need almost no capabilities to function. The secure approach is to start with all capabilities dropped and then selectively add back only those capabilities that the specific application requires. A Kubernetes pod demonstrating this principle would drop all capabilities then add back only NET_BIND_SERVICE if the application needs to listen on ports less than 1024. In practice, many applications function perfectly well with no added capabilities at all, relying on standard user-level operations to accomplish their work.

Default Capabilities in Containers

Container runtimes provide default capability sets that are automatically granted to all containers. Docker's default added capabilities include CHOWN, DAC_OVERRIDE, FSETID, FOWNER, SETGID, SETUID, SETFAP, NET_RAW, and SYS_CHROOT. Docker's default dropped capabilities include SYS_ADMIN, NET_ADMIN, SYS_MODULE, SYS_PTRACE, and others. These defaults are reasonable from a compatibility perspective, preventing many applications from failing, but they are not minimal from a security perspective. For maximum security, the best practice is to explicitly drop all default capabilities and add back only those specific capabilities that your application requires.

Seccomp: System Call Filtering

Seccomp (secure computing mode) is a kernel feature that allows filtering of system calls. By using seccomp, you can specify which system calls a process is permitted to execute, creating an additional security layer that prevents access to dangerous syscalls even if a vulnerability allows arbitrary code execution. When a process attempts to execute a forbidden syscall, the kernel intercepts the call before it reaches the system and applies the configured action. The kernel can handle forbidden syscalls in multiple ways: SCMP_ACT_KILL terminates the entire process; SCMP_ACT_TRAP sends a SIGSYS signal to the process; SCMP_ACT_ERRNO returns an error code; SCMP_ACT_ALLOW permits the syscall to execute normally.

Default Docker Seccomp Profile

Docker provides a default seccomp profile that implements a whitelist approach. The profile blocks approximately 44 dangerous syscalls while allowing everything else to execute. The blocked syscalls include mount and umount2, which prevent remounting filesystems; kmod_* family syscalls, which prevent kernel module operations; ptrace, which prevents process tracing and debugging attacks; open_by_handle_at, which prevents direct inode access; and reboot and swapon, which prevent system-level changes. This profile represents a practical balance, blocking most dangerous operations that would allow container escape while allowing legitimate applications to function normally.

Custom Seccomp Profiles

For applications with specific syscall requirements, you can define custom seccomp profiles that are more restrictive than the default. A custom profile might set a default action of SCMP_ACT_ERRNO, causing all syscalls to fail by default unless explicitly allowed. The profile specifies allowed syscalls such as "read", "write", "exit", "exit_group", and "brk" for basic operation, while blocking process creation syscalls like "clone", "fork", and "vfork". This approach restricts the application to a minimal set of allowed syscalls, preventing it from spawning child processes or executing most other operations.

In Kubernetes:

apiVersion: v1kind: Podmetadata:  name: custom-seccompspec:  securityContext:    seccompProfile:      type: Localhost      localhostProfile: my-profile.json  containers:  - name: app    image: myapp:latest

AppArmor and SELinux: Mandatory Access Control

Capabilities and seccomp represent discretionary access control, applying rules based on what the process is permitted to do but allowing the process to choose how to use those permissions. AppArmor and SELinux, by contrast, implement mandatory access control (MAC), enforcing restrictions that apply regardless of the process's permissions or choices. With mandatory access control, the system administrator defines absolute restrictions, and no process can exceed those restrictions no matter what capabilities it holds.

AppArmor (Mandatory Access Control)

AppArmor operates through profiles that define what a process can access. A process confined by an AppArmor profile cannot exceed the profile's restrictions even if it has the necessary Linux capabilities. AppArmor profiles are defined in a readable, human-friendly syntax that specifies allowed file accesses, network operations, and other system interactions. An example AppArmor profile restricting the nginx web server would allow nginx to read /var/www/html (the web content directory) and /etc/nginx/nginx.conf (the configuration file), and to read and write /var/run/nginx.pid (the process ID file). The profile would explicitly deny access to /etc/passwd, /root/ (home directories with administrative credentials), and /home/ (user home directories), preventing nginx from reading these sensitive files even if a vulnerability allowed arbitrary file access.

SELinux (Security-Enhanced Linux)

SELinux implements a more comprehensive mandatory access control system than AppArmor, using label-based access control where every file, process, and network object carries a security context label. Every file on the system has a label such as system_u:object_r:httpd_sys_rw_content_t:s0, and every process has a corresponding label such as system_u:system_r:httpd_t:s0. The SELinux policy defines which labeled processes can interact with which labeled resources. The same httpd process labeled system_u:system_r:httpd_t:s0 is only allowed to access files labeled with contexts that the policy permits. SELinux is substantially more powerful than AppArmor but requires deep knowledge of security policies and is notoriously difficult to debug due to the complexity of label interactions.

Both AppArmor and SELinux provide defense-in-depth security. Even if a process somehow escapes cgroup limits or exploits a way to use capabilities beyond what was intended, the mandatory access control policy enforces absolute restrictions on what the process can access, confining it to the subset of the system authorized by the policy.

The Complete Kernel Interaction Stack

When a container process executes a syscall, the kernel enforces multiple security layers in sequence. The process first encounters namespace checks, which verify that the syscall stays within the process's allowed namespace and can only affect resources in that namespace. Next, cgroup limits are checked to determine if the process has exceeded its resource quotas (CPU throttling, memory OOM, or device access). The kernel then verifies whether the process has the required capabilities (such as CAP_SYS_ADMIN or CAP_NET_RAW). If the capabilities check passes, seccomp filtering validates that the syscall is in the allowlist. Finally, AppArmor or SELinux checks whether the process's security context (label) permits access to the requested resource. Only after passing all these checks does the kernel execute the syscall; otherwise it is denied.

Each layer is independent. A process can bypass one layer (e.g., have necessary capabilities) but be blocked by another (seccomp or SELinux). This is defense-in-depth: no single security mechanism is relied upon.

Kernel Vulnerability Risk: The Shared Kernel Problem

Because all containers share the host kernel, a kernel vulnerability affects every container.

Example: CVE-2022-0185 (integer overflow in fsconfig syscall) allows local privilege escalation to root and affects all containers on a host running vulnerable kernel. A single container escape compromises the entire host.

Example: CVE-2022-0847 (Dirty Pipe, write to arbitrary files through pipe) allows writing to read-only files. A container could modify its own binary or other files.

Example: CVE-2021-4034 (PwnKit) is a buffer overflow in /usr/bin/pkexec that any local process (including container) can exploit.

Kernel vulnerabilities are why defense-in-depth matters. Even if a vulnerability exists, seccomp can block the exploit syscall, or SELinux can prevent the privilege escalation.

Practical Example: Running a Container with Minimal Privileges

Secure Kubernetes pod configuration includes runAsNonRoot set to true, runAsUser set to 1000, fsGroup set to 1000, allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, capabilities dropped ALL with optional NET_BIND_SERVICE added back, and seccompProfile using RuntimeDefault. Volume mounts include /tmp and /var/cache as emptyDir volumes.

This pod runs as non-root (UID 1000), drops all capabilities then adds back only NET_BIND_SERVICE, uses default seccomp profile, has read-only root filesystem, and has writable /tmp and /var/cache. The container cannot escalate privileges, cannot mount filesystems, and cannot access host resources.

Combined with namespace isolation, this container has minimal attack surface.

Security Boundaries and Kernel Risk

The critical takeaway that underpins all container security understanding is that container isolation is process-level, not kernel-level. This fundamental difference has profound implications for security architecture.

Defenses at the process level include namespaces, which isolate process resources; cgroups, which enforce resource limits; capabilities, which restrict privileged operations; seccomp, which filters dangerous system calls; and AppArmor or SELinux, which enforce mandatory access control policies. These mechanisms combine to create a robust defense against many categories of attacks and provide good isolation between containers in the typical case.

However, this process-level defense faces a critical limitation: the kernel is shared among all containers. When a kernel vulnerability is discovered and exploited, it bypasses all these process-level defenses. A container process can trigger a kernel vulnerability, and the attacker gains not just container-level privileges but kernel-level access, potentially compromising the entire host and all containers running on it.

For organizations requiring maximum isolation, several approaches exist. gVisor provides a sandboxed runtime that intercepts all syscalls at the application level, providing an additional layer between the container and the kernel. Kata Containers use a VM-based approach where each container runs in its own lightweight virtual machine with its own kernel, completely eliminating the shared kernel risk. User namespace remapping, while not yet natively supported in Kubernetes, maps container root to unprivileged users on the host, further limiting the damage from container escape.

For standard namespace-based containers running on shared kernels, the approach must assume that kernel vulnerability risk exists. The best practice is to layer defenses: drop all capabilities except those absolutely needed, apply restrictive seccomp profiles, enable AppArmor or SELinux policies, run containers as non-root, and implement other hardening measures. Additionally, actively monitor kernel security advisories and patches, applying security updates promptly when kernel vulnerabilities are disclosed.

Next Steps: Understand the security boundary implications. See "Container Scope vs Kernel Scope" and "Container Runtimes Explained."