A container starts. Code executes. Network connections open. Files are read and written. This is the runtime stage. An attacker with access to a running container can exploit processes, modify files, escape to the host, or pivot to other containers. Runtime security controls prevent these attacks.
The Runtime Attack Surface
A running container is a process with specific permissions. By default, containers run as root with broad capabilities and unrestricted filesystem access. This creates multiple attack vectors that a sophisticated attacker can exploit.
The container process presents the first attack surface: it typically runs with UID 0 (root), possesses dangerous capabilities like CAP_SYS_ADMIN and CAP_NET_ADMIN, and can read, write, and execute any file within the container. The filesystem itself is a second vector—most directories including the root filesystem, /tmp, /var, and /home are writable, allowing an attacker to create or modify any file.
Network access provides a third vector: all ports are exposed unless explicitly closed, there is no egress filtering to prevent outbound connections, and the container can resolve DNS and reach external services. Syscall filtering is another exposure—by default all syscalls are permitted, allowing use of ptrace, mount, and module loading operations, plus unrestricted access to /proc and /sys kernel interfaces.
Process execution represents a final vector: the container can spawn child processes, shell access may be available if /bin/sh is present, and an attacker can launch reverse shells or execute arbitrary payloads. Each of these attack surfaces provides opportunities for exploitation if an attacker gains code execution within the container.
Running as Non-Root: Principle of Least Privilege
By default, containers run as root (UID 0). Root has no restrictions. If a process is compromised, the attacker has complete control over the container.
Running as non-root removes the attacker's ability to execute certain privileged operations. Even if they gain code execution, they cannot install packages (which requires write to /usr or /var), change ownership of files via chown, open ports below 1024, load kernel modules, or change system configuration in /proc and /sys.
Creating a non-root user:
FROM ubuntu:22.04RUN useradd -u 65532 -d /nonexistent -s /sbin/nologin appuserUSER appuserCOPY --chown=appuser:appuser app /appENTRYPOINT ["/app"]The UID 65532 is conventionally used for the nobody user, representing an unprivileged user. The shell is set to /sbin/nologin (a shell that refuses login) to prevent the user from gaining an interactive shell if there's a shell escape.
In Kubernetes:
apiVersion: v1kind: Podmetadata: name: secure-appspec: securityContext: runAsNonRoot: true runAsUser: 65532 containers: - name: app image: myapp:latest securityContext: allowPrivilegeEscalation: falseThe runAsNonRoot field enforces that the container cannot run as root. If the image's ENTRYPOINT runs as root, the pod fails to start.
CleanStart images are built with UID 65532 by default. The base runtime image includes a non-root user. The application runs as that user.
Read-Only Root Filesystem: Preventing Runtime Modification
A writable filesystem is an attacker's playground. After gaining code execution, an attacker can write malicious code to disk, modify application binaries, create backdoors, or establish persistence.
A read-only root filesystem prevents all of this. Once the container starts, no files can be created or modified (except in designated write paths like /tmp, /var, mounted volumes).
Implementing read-only root filesystem:
FROM ubuntu:22.04RUN mkdir -p /tmp /var/tmp /var/logVOLUME ["/tmp", "/var"]RUN chmod 1777 /tmp /var/tmpRUN apt-get update && apt-get install -y myappRUN rm -rf /var/apt /var/cache/aptUSER appuserRUN echo "readonly" > /etc/hostname# Remaining filesystem is read-onlyIn Kubernetes:
apiVersion: v1kind: Podmetadata: name: secure-appspec: containers: - name: app image: myapp:latest securityContext: readOnlyRootFilesystem: true volumeMounts: - name: tmp mountPath: /tmp - name: var mountPath: /var volumes: - name: tmp emptyDir: {} - name: var emptyDir: {}The readOnlyRootFilesystem: true field makes the root filesystem read-only. Applications that need to write temporary files must use mounted volumes (emptyDir, etc.).
Why this matters: If an attacker exploits a vulnerability and gains execution, they cannot write to disk. They cannot modify binaries or configuration files. They cannot create backdoors. They cannot establish persistence across container restarts.
Shell-Less Images: Removing the Post-Exploitation Tool
If a container is compromised, an attacker's first move is to gain an interactive shell (/bin/sh). From a shell, they can explore the container, run commands, and establish a reverse shell to an external attacker.
If the container image has no shell, this becomes difficult (not impossible, but much harder). The attacker must drop into assembly-level code execution or find alternative command-execution mechanisms.
Building shell-less images:
Use distroless images or minimal base images that exclude shells and standard utilities.
# Instead of:FROM ubuntu:22.04 # Use:FROM gcr.io/distroless/base-debian11Distroless images contain only the runtime dependencies needed by your application: libc, SSL certificates, and tzdata. They exclude shells (/bin/sh, /bin/bash), package managers (apt, yum, apk), standard utilities (curl, wget, grep, sed), programming language interpreters (Python, Node.js), and any development tools.
This makes interactive exploitation extremely difficult. An attacker cannot run arbitrary commands. They cannot easily explore the filesystem or the network.
Verify shell-less images:
docker run --rm myapp:latest /bin/sh# Error: exec /bin/sh: file not foundCleanStart images are shell-less by default. The application runs in the container. No shell is present. If the application is compromised, the attacker has a much harder time executing arbitrary commands.
Linux Capabilities: Minimum Privilege via Granular Permissions
Linux capabilities split root's privileges into granular units. Instead of running as root (all capabilities), you can drop all capabilities and add only those the application needs.
Common capabilities include CAP_NET_BIND_SERVICE (bind to ports below 1024), CAP_NET_ADMIN (configure network interfaces, routing, firewall), CAP_SYS_ADMIN (mount filesystems, load kernel modules, quotas), CAP_DAC_OVERRIDE (bypass file permission checks), CAP_SETUID (change UID), CAP_SYS_PTRACE (trace other processes), CAP_SYS_CHROOT (change root directory), and CAP_SYS_MODULE (load/unload kernel modules).
By default, containers have many capabilities. Drop all of them and add back only what the application needs.
Dropping all capabilities:
FROM ubuntu:22.04RUN apt-get update && apt-get install -y myappUSER appuserRUN setcap -r /usr/bin/myappIn Kubernetes:
apiVersion: v1kind: Podmetadata: name: secure-appspec: containers: - name: app image: myapp:latest securityContext: capabilities: drop: - ALL add: - NET_BIND_SERVICE # Only if needed readOnlyRootFilesystem: true runAsNonRoot: trueThe drop: [ALL] removes all capabilities. The add: [NET_BIND_SERVICE] adds only the capability needed to bind to port 80 (if required).
Why this matters: Many privilege escalation exploits require specific capabilities. By dropping all capabilities, you eliminate entire classes of attacks. An attacker who gains code execution inside the container cannot use privilege escalation techniques that rely on CAP_SYS_ADMIN, CAP_SETUID, or other capabilities.
Seccomp: Syscall Filtering
Seccomp (secure computing) is a Linux kernel feature that filters syscalls. A seccomp profile defines which syscalls a process can execute. Attempts to execute blocked syscalls are denied (or killed, depending on the policy).
A default seccomp profile allows most syscalls but blocks dangerous ones: clone with CLONE_NEWNET (network namespace creation), mount (filesystem mounting), umount2 (filesystem unmounting), ptrace (process tracing), add_key and request_key (keyring operations), keyctl (keyring control), fsconfig, fsmount, and fsopen (filesystem configuration).
Kubernetes seccomp:
apiVersion: v1kind: Podmetadata: name: secure-appspec: securityContext: seccompProfile: type: RuntimeDefault containers: - name: app image: myapp:latestThe type: RuntimeDefault applies the runtime's default seccomp profile. For Docker/containerd, this is a permissive profile that blocks only dangerous syscalls.
Custom seccomp profiles:
If your application needs specific syscalls, you can define a custom profile:
{ "defaultAction": "SCMP_ACT_ERRNO", "defaultErrnoRet": 1, "archMap": [ { "architecture": "SCMP_ARCH_X86_64", "subArchitectures": [ "SCMP_ARCH_X86", "SCMP_ARCH_X32" ] } ], "syscalls": [ { "names": ["read", "write", "exit", "exit_group"], "action": "SCMP_ACT_ALLOW" } ]}This profile denies all syscalls by default (SCMP_ACT_ERRNO) and allows only read, write, exit, and exit_group. This is extremely restrictive and works only for applications with minimal syscall requirements.
Network Policies: East-West Traffic Control
Network policies restrict which containers can communicate with each other. By default, all containers can reach all other containers and external services.
A network policy allows traffic only between specific pods, namespaces, or CIDR ranges. Traffic that doesn't match any allow rule is denied.
Kubernetes network policy:
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: deny-all namespace: productionspec: podSelector: {} policyTypes: - Ingress - Egress---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-frontend namespace: productionspec: podSelector: matchLabels: app: frontend policyTypes: - Egress egress: - to: - podSelector: matchLabels: app: api ports: - protocol: TCP port: 8080 - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53 # DNSThe first policy (deny-all) blocks all ingress and egress traffic by default. The second policy allows the frontend to reach the api service on port 8080 and allows DNS queries.
Network policy benefits are substantial. An attacker who compromises one pod cannot reach other pods unless explicitly allowed. Lateral movement is prevented. Data exfiltration is blocked (unless egress is explicitly allowed). The blast radius of a compromise is limited to the pod and its explicitly allowed peers.
Pod Security Standards: Enforcing Constraints
Pod Security Standards (formerly Pod Security Policies) enforce security constraints at the Kubernetes API level. Three levels exist: Privileged (no restrictions, permits everything), Baseline (prevents known privilege escalations, permits most workloads), and Restricted (enforces strict security, works only with hardened images).
apiVersion: policy/v1beta1kind: PodSecurityPolicymetadata: name: restrictedspec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: rule: 'MustRunAsNonRoot' seLinux: rule: 'MustRunAs' seLinuxOptions: level: "s0:c123,c456" fsGroup: rule: 'MustRunAs' readOnlyRootFilesystem: falseA restricted Pod Security Policy enforces multiple requirements: non-root execution, no privilege escalation, all capabilities dropped, and no host access (network, IPC, PID).
Enforcing a policy at the namespace level:
kubectl label namespace production \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/audit=restricted \ pod-security.kubernetes.io/warn=restrictedAny pod created in the production namespace must comply with the restricted policy. Non-compliant pods are rejected.
Runtime Monitoring: Detecting Attacks in Progress
Monitoring detects attacks that slip through preventive controls. Runtime monitoring tools watch running containers for suspicious behavior.
eBPF-based runtime security: Extended Berkeley Packet Filter (eBPF) allows userspace programs to hook into kernel events. Runtime monitoring tools like Falco, Tetragon (Cilium), and Tracee use eBPF to observe process creation and execution, file access and modification, network connections, and syscall invocation. These tools generate alerts when suspicious behavior is detected.
Falco example rules:
- rule: Unauthorized Process desc: Detect unauthorized process execution condition: > spawned_process and container and not proc.name in (allowed_processes) and not parent.name in (allowed_parents) output: > Unauthorized process started (user=%user.name proc=%proc.name parent=%proc.pname) priority: WARNING - rule: Write to System Directories desc: Detect writes to /etc or /bin condition: > write and container and (fd.name startswith "/etc/" or fd.name startswith "/bin/") output: > Write to system directory (user=%user.name file=%fd.name) priority: CRITICAL - rule: Reverse Shell desc: Detect reverse shell connections condition: > outbound and container and (proc.name="sh" or proc.name="bash") and fd.type="ipv4" output: > Reverse shell connection detected (user=%user.name container=%container.name) priority: CRITICALFalco runs as a daemon on the host (or in a privileged container). It monitors all container events using eBPF. When a rule matches (e.g., /bin/bash making a network connection), it generates an alert.
Behavioral detection monitors unexpected process spawning (shell, compiler, network tool), file writes to read-only locations (if the image is supposed to be immutable), network connections to unexpected destinations, privilege escalation attempts, kernel module loading, and suspicious syscall patterns.
Runtime attestation: After a container has been running for a period, verify that the running image matches the deployed image. Verify that the filesystem hasn't been modified. This prevents drift: when a container diverges from its expected state over time.
Runtime Drift Detection: When Containers Change
A container should remain in a known state. Any deviation from that state (file modifications, new processes, network changes) is suspicious.
Detecting drift involves hashing the filesystem at startup and periodically rehashing and comparing. Record expected processes and alert on unexpected processes. Record expected network connections and alert on unexpected connections. Log all file modifications and alert if modifications occur outside of expected locations.
CleanStart approach: CleanStart images use a read-only root filesystem and a minimal runtime. The container's state is static. Any modification is immediately suspicious and should trigger an alert.
Incident Response: When Runtime Detection Fires
When runtime monitoring detects an attack, follow these steps in order. First, isolate the container by immediately stopping it or restricting its network access. Second, preserve logs by copying container logs, kernel logs, and eBPF traces to immutable storage. Third, analyze the breach by reviewing the runtime event that triggered the alert and identifying the root cause. Fourth, trace lateral movement to determine if the attacker accessed other containers or host systems. Fifth, revoke credentials if secrets were exposed by immediately revoking them. Sixth, update scanning by rescanning the image for vulnerabilities. Seventh, update policies by tightening network policies, seccomp profiles, or capabilities. Finally, patch the vulnerability by fixing the vulnerability that allowed the initial exploitation.
CleanStart Runtime Security: From Image to Execution
CleanStart images are built for runtime security. The resulting image is deployed to a hardened runtime environment.
CleanStart image characteristics include being small and minimal (only runtime dependencies, no development tools), running as a non-root user (UID 65532 by default), being shell-less (no /bin/sh, no package managers, no shells), having a read-only root filesystem (once started, the filesystem cannot be modified), having dropped capabilities (all capabilities dropped by default, applications declare required capabilities), and running cleanimg-init as PID 1 (a 1 MB Rust-based init system that handles signal forwarding, zombie reaping, and process management).
CleanStart runtime deployment:
apiVersion: v1kind: Podmetadata: name: appspec: securityContext: seccompProfile: type: RuntimeDefault fsGroup: 65532 containers: - name: app image: registry.cleanstart.com/myorg/myapp:v1.2.3@sha256:abc123... securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 65532 volumeMounts: - name: tmp mountPath: /tmp resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 volumes: - name: tmp emptyDir: {} affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - myapp topologyKey: kubernetes.io/hostname---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: app-policyspec: podSelector: matchLabels: app: myapp policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: ingress-controller ports: - protocol: TCP port: 8080 egress: - to: - namespaceSelector: matchLabels: name: databases ports: - protocol: TCP port: 5432 - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53This deployment combines multiple runtime security controls: non-root user (UID 65532), all capabilities dropped, read-only root filesystem, network policy restricting ingress (only from ingress controller) and egress (only to databases and DNS), resource limits, health checks, and pod anti-affinity (spread across nodes).
An attacker who exploits the application and gains code execution inside this container will find: no root privileges, no capabilities to abuse, no writable filesystem (except /tmp), no shell to interact with, no outbound network access (except to database and DNS), and behavioral monitoring detecting the exploit attempt.
From Pre-Build Through Runtime
Container security is a chain. Security at each stage builds on the previous stage.
See Pre-Build Stage Security: Securing Your Supply Chain Before a Single Line Compiles in Pillar 1 for how to control source code and dependencies before the build.
See Build Stage Security: What Happens Inside the Build and Why It Matters in Pillar 2 for how to compile and assemble the image safely.
The runtime stage executes the image and enforces its security properties through policy and monitoring. A container that passes pre-build and build security, deployed with runtime security controls, and monitored for suspicious behavior, presents an extremely difficult target for attackers.
Runtime Security Checklist
[ ] Container runs as non-root user (UID > 1000, ideally 65532). [ ] Root filesystem is read-only. [ ] Image contains no shell (/bin/sh, /bin/bash). [ ] All Linux capabilities are dropped (CAP_DROP=ALL). [ ] Only required capabilities are explicitly added. [ ] Seccomp default profile is applied (blocks dangerous syscalls). [ ] Network policies restrict ingress and egress traffic. [ ] Pod Security Policy (or Pod Security Standards) is enforced. [ ] Resource limits are set (memory, CPU). [ ] Liveness and readiness probes are configured. [ ] Runtime monitoring (Falco, Tetragon, or equivalent) is enabled. [ ] Alerts are configured for suspicious behavior. [ ] Logs are stored immutably and retained for audit. [ ] Images are verified (digest, signature, SBOM) before deployment. [ ] Regular vulnerability scanning of running images is performed.
Summary
Runtime security protects containers from exploitation and lateral movement. It operates on multiple levels: user privileges, filesystem mutability, capabilities, syscall filtering, network policies, and behavioral monitoring.
Pre-build and build security ensure that the image is trustworthy and free of known vulnerabilities. Runtime security enforces constraints on that image's execution and detects attacks.
The three stages are interconnected. Pre-build security (see Pre-Build Stage Security) controls what code enters the build. Build security (see Build Stage Security) ensures the code is compiled safely. Runtime security ensures the compiled image executes safely and remains unmodified throughout its lifecycle.
Together, these three stages form a defense-in-depth strategy that protects the entire container lifecycle from source code through running processes.
