Combining Read-Only + Shell-Less: The 97% Attack Surface Reduction
Read-only filesystems and shell-less containers are powerful independently, but together they form a security model that eliminates entire categories of attacks. This eliminates malware injection, interactive exploitation, and persistence mechanisms — reducing attack surface by 97%.
Part 1: Why Both Together
The following diagram visualizes how combining read-only filesystems and shell-less containers progressively eliminates attack paths:
1Traditional Container<br/>~400 viable paths2Read-Only Only<br/>~200 viable paths3Shell-Less Only<br/>~150 viable paths4Combined: Read-Only + Shell-Less<br/>~12 viable paths (97% reduction)The Attack Surface Reduction Equation
Attack surface is not just the number of entry points, but the number of viable exploitation paths. Let's model a realistic container attack to understand how security controls affect the adversary's options.
Traditional Container (Writable + Shell)
In a traditional container with a writable filesystem and shell access, attackers have multiple exploitation paths available. The first path involves a buffer overflow or injection vulnerability in the application code, such as SQL injection or command injection, which grants the attacker code execution with root privileges. From there, the attacker can write malware to the filesystem at a location like /tmp/malware.so, then load that malware using dlopen("/tmp/malware.so"). Once loaded, the malware persists in memory or spreads to other containers, resulting in complete compromise of the system.
The second path takes advantage of supply chain vulnerabilities. An attacker injects a shell command into the base image build process. When the container starts, this command executes automatically, installing a backdoor or exfiltrating secrets from environment variables. This results in immediate compromise before the application even begins running.
The third path exploits privilege escalation opportunities. An attacker finds a setuid binary or kernel exploit and escalates their privileges to root. With root access, they can disable audit logging by running rm /var/log/audit.log, covering their tracks completely and enabling long-term, undetectable compromise.
The total number of viable exploitation paths in this scenario is approximately 400, representing every possible way an attacker could gain persistence and control within the container.
Read-Only Only (Read-Only Filesystem + Shell Present)
When the filesystem is read-only but a shell is still available, the attack landscape changes significantly. In the first path, an attacker exploits an application vulnerability and gains code execution. However, when they attempt to write malware to /tmp, this operation fails because the filesystem is read-only. The attacker cannot persist malware, which severely limits their ability to establish long-term control. However, code execution has still occurred, which means the attack is not entirely mitigated.
In the second path, an attacker tries to spawn a shell using system("/bin/sh"), but /bin/sh doesn't exist in the container. Shell execution fails, but the attacker can still compile and execute binaries directly, which provides an alternative path to achieving their goals.
In the third path, information gathering opportunities are limited. The attacker cannot modify the filesystem but can still read it. However, containers typically don't have files like /etc/shadow or readable /var/log directories, making information gathering harder but not impossible.
With a read-only filesystem, the number of viable exploitation paths decreases to approximately 200. The filesystem write capability, which was fundamental to many persistence strategies, has been eliminated.
Shell-Less Only (Writable Filesystem + No Shell)
When the filesystem is writable but no shell is available, a different set of constraints appears. The first path allows an attacker to write to /tmp successfully, since the filesystem remains writable. However, when they try to execute something from /tmp, the absence of a shell complicates matters significantly. The attacker cannot run shell commands, but they could write a compiled binary and execute it directly via dlopen, providing a workaround that bypasses some of the protection.
The second path attempts to establish interactive access by opening a reverse shell. This fails because no shell binary exists. However, an attacker could still execute a compiled reverse shell binary directly, using direct system calls rather than shell commands.
The third path for privilege escalation remains largely intact. The attacker can still find setuid binaries, exploit kernel vulnerabilities, and write a rootkit to the writable filesystem, establishing root-level persistence.
Shell-less containers reduce viable paths to approximately 150. The removal of shell commands blocks one dimension of attacks, but the ability to write and execute binaries directly still enables most classical exploitation techniques.
BOTH: Read-Only + Shell-Less (Immutable + No Shell)
When both read-only filesystems and shell-less containers are combined, the attack landscape becomes fundamentally constrained. In the first path, an attacker exploits an application vulnerability and gains code execution as a non-root user. However, when they attempt to write to /tmp, /var, or any system directory, all of these operations fail because the filesystem is read-only. The attacker cannot write anywhere on the filesystem, which means they cannot persist malware, cache executables, or establish any foothold. Even if they attempt to load a compiled binary via dlopen, they cannot create or write that binary to disk.
In the second path, the block is complete. An attacker cannot spawn a shell because /bin/sh doesn't exist in the image. They cannot open a reverse shell, invoke system commands, or use any shell functionality whatsoever.
In the third path for privilege escalation, the defense becomes highly effective. If an attacker gains code execution as a non-root user, they cannot read /etc/passwd due to permission restrictions. They cannot find and execute setuid binaries because the filesystem is read-only. Even if a kernel exploit exists, they cannot write a rootkit to disk to establish persistence.
In the fourth path for environmental access, the attacker's options are severely limited. They cannot read /etc/shadow due to non-root permissions. They cannot read audit logs or other system files. They can read process memory through /proc, but they only have access to non-root process memory, which significantly reduces the value of any leaked information.
With both controls in place, the number of viable exploitation paths drops to approximately 12. These remaining paths represent inherent risks that cannot be eliminated through container configuration alone: application-level bugs in the running process, direct memory corruption attacks, Kubernetes API server misconfigurations, and kernel vulnerabilities on the host node.
The Security Equation: Mathematically
If we model vulnerability surface as a mathematical function:
Total Attack Surface = (Filesystem Write Paths × Execution Paths) + (Shell Commands × Payload Delivery) + (Privilege Escalation Chains) + (Information Disclosure Paths)Then we can calculate the attack surface for each configuration:
Traditional containers have an attack surface of approximately 100 × 100 + 50 × 50 + 50 + 50, which equals roughly 13,600 points. Read-Only Only configurations reduce this to 0 × 100 + 50 × 50 + 50 + 50, which equals approximately 2,600 points, representing an 80% reduction. Shell-Less Only configurations achieve 100 × 0 + 0 × 50 + 50 + 50, which equals roughly 100 points or a 99% reduction in write and execution capabilities. Combined Read-Only and Shell-Less configurations reach 0 × 0 + 0 × 0 + 0 + 12, which equals approximately 12 points, representing a 99% reduction overall.
The combination is not additive; it's multiplicative. When you eliminate two independent dimensions of attack (filesystem writes and shell execution), the attack surface collapses exponentially rather than linearly. This is why combining read-only with shell-less is so powerful.
Part 2: The Complete Kubernetes Security Context
This section provides the definitive, copy-paste, production-ready SecurityContext that implements both read-only and shell-less containerization in a Kubernetes environment.
The Master SecurityContext Block
securityContext: # Non-root user (65532 is the "nobody" equivalent) runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 # Group for volume ownership (ensures app can read/write mounted volumes) fsGroup: 65532 # Read-only root filesystem (MUST be true for production) readOnlyRootFilesystem: true # Prevent privilege escalation (no setuid, no CAP_SYS_PTRACE) allowPrivilegeEscalation: false # Drop all capabilities and don't grant any capabilities: drop: ["ALL"] # Use the default seccomp profile (blocks ~50 dangerous syscalls) seccompProfile: type: RuntimeDefaultWhat Each Field Does
Field | Value | Why |
|---|---|---|
runAsNonRoot | true | Prevents running as root; can't escalate if you're not root |
runAsUser | 65532 | Specific user ID (not root = 0); "nobody" in many systems |
runAsGroup | 65532 | Specific group ID; ensures clean permission model |
fsGroup | 65532 | Makes mounted volumes readable/writable by this group |
readOnlyRootFilesystem | true | CRITICAL: Filesystem is immutable; no writes to /, /etc, /bin, etc. |
allowPrivilegeEscalation | false | Prevents |
capabilities.drop | ["ALL"] | Removes all Linux capabilities (CAP_NET_RAW, CAP_SYS_ADMIN, etc.) |
seccompProfile.type | RuntimeDefault | Applies default seccomp filter (blocks ptrace, reboot, etc.) |
Compliance Mapping: CIS, DISA, NIST, FedRAMP
CIS Docker Benchmark v1.7.0
The CIS Docker Benchmark establishes best practices for securing container images and runtime configurations. Section 5.1 covers image and build best practices, requiring containers to be built from known base images. CleanStart images provide hardened, minimal base images that are scanned and verified, satisfying this requirement. Section 5.2 requires a HEALTHCHECK to be configured, which is implemented below in per-application examples. Section 5.3 mandates that layer count be kept as low as possible to reduce the potential attack surface. CleanStart images leverage multi-stage builds to minimize unnecessary layers. Section 5.4 requires that container images be scanned for known vulnerabilities before deployment. CleanStart images are scanned using grype at build time before being pushed to production. Section 5.12 specifies that COPY should be used instead of ADD in Dockerfiles to avoid automatic decompression vulnerabilities. CleanStart images follow this pattern consistently. Section 5.25 addresses runtime security, requiring that containers be restricted from acquiring additional privileges. This is enforced using allowPrivilegeEscalation: false in the SecurityContext. Section 5.26 requires limiting container read access to system process information, achieved through non-read-only filesystems and the seccompProfile: RuntimeDefault configuration. Section 5.27 mandates that container and application logs be configured appropriately, which CleanStart implements by directing all logs to stdout and stderr where they can be accessed through kubectl logs.
DISA STIG for Kubernetes
The Defense Information Systems Agency provides strict security requirements for Kubernetes deployments. Requirement V-242376 mandates that containers must be configured with a read-only root filesystem. This is implemented using readOnlyRootFilesystem: true in the SecurityContext. Requirement V-242377 requires that containers must run as a non-root user, enforced through runAsNonRoot: true and runAsUser: 65532. Requirement V-242378 specifies that containers must not have privileged escalation enabled, which is ensured by allowPrivilegeEscalation: false. Requirement V-242379 requires that all capabilities be dropped from containers, achieved through capabilities.drop: ["ALL"]. Together, these four requirements form the foundation of DISA STIG compliance for container security.
NIST 800-190 (Application Container Security)
The National Institute of Standards and Technology published guidelines for container security that address four key areas. Guideline 4.1 establishes the need to maintain an image repository, which is satisfied through the use of private registries such as Google Container Registry or Amazon ECR. Guideline 4.2 requires performing image scanning for known vulnerabilities before deployment. CleanStart images are scanned, and custom images should be scanned using tools like grype before being pushed to production. Guideline 4.3 requires implementing the principle of least privilege, which is achieved through read-only filesystems, non-root user execution, and dropped capabilities. Guideline 4.4 mandates implementing network isolation, which is addressed through Kubernetes NetworkPolicy objects in a separate guide.
FedRAMP Security Requirements
FedRAMP compliance requires security controls across multiple dimensions. Control AC-6 implements least privilege access, which is satisfied through non-root user execution, dropped capabilities, and read-only filesystems. Control CM-7 implements least functionality, requiring that systems only include necessary components. This is addressed through shell-less images that contain only essential binaries. Control SC-7 manages information in transit and at rest, requiring TLS for network communications and encrypted storage for sensitive data.
Part 3: Per-Application Complete Production Manifests
PostgreSQL: Complete Production Deployment
apiVersion: v1kind: ConfigMapmetadata: name: postgres-configdata: postgresql.conf: | # PostgreSQL Configuration max_connections = 100 shared_buffers = 256MB effective_cache_size = 1GB maintenance_work_mem = 64MB work_mem = 2621kB log_statement = 'all' log_duration = on log_connections = on log_disconnections = on---apiVersion: v1kind: Secretmetadata: name: postgres-secrettype: OpaquestringData: username: postgres password: "YOUR_STRONG_PASSWORD_HERE"---apiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 50Gi---apiVersion: apps/v1kind: Deploymentmetadata: name: postgres labels: app: postgresspec: replicas: 1 selector: matchLabels: app: postgres template: metadata: labels: app: postgres annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault containers: - name: postgres image: cleanstart/postgresql:15-prod@sha256:abc123def456... imagePullPolicy: IfNotPresent ports: - containerPort: 5432 name: postgres protocol: TCP env: - name: POSTGRES_DB value: "production" - name: POSTGRES_USER valueFrom: secretKeyRef: name: postgres-secret key: username - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password - name: POSTGRES_INITDB_ARGS value: "-c shared_buffers=256MB -c max_connections=100" volumeMounts: - name: data mountPath: /var/lib/postgresql/data - name: socket mountPath: /var/run/postgresql - name: tmp mountPath: /tmp - name: shm mountPath: /dev/shm - name: config mountPath: /etc/postgresql readOnly: true resources: requests: memory: 1Gi cpu: 500m ephemeral-storage: 2Gi limits: memory: 4Gi cpu: 2 ephemeral-storage: 5Gi livenessProbe: tcpSocket: port: 5432 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: tcpSocket: port: 5432 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2 lifecycle: preStop: exec: command: ["/opt/cleanimg/cleanimg-init", "graceful-shutdown"] volumes: - name: data persistentVolumeClaim: claimName: postgres-data - name: socket emptyDir: sizeLimit: 10Mi - name: tmp emptyDir: sizeLimit: 5Gi - name: shm emptyDir: medium: Memory sizeLimit: 1Gi - name: config configMap: name: postgres-config terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst restartPolicy: Always---apiVersion: v1kind: Servicemetadata: name: postgresspec: selector: app: postgres ports: - port: 5432 targetPort: 5432 protocol: TCP type: ClusterIPThe PostgreSQL deployment above demonstrates how to run a stateful database with read-only root filesystems and non-root users. Data persistence is handled through a PersistentVolumeClaim mounted at /var/lib/postgresql/data. Temporary runtime directories like sockets, temporary files, and shared memory use emptyDir volumes to provide writable space without compromising the immutable filesystem. Configuration is mounted from a ConfigMap as read-only, ensuring that the application cannot accidentally modify its own settings. The database runs as a non-root user (65532) with all capabilities dropped, following the security model outlined in Part 2.
Redis: Complete Production Deployment (Cache Mode)
apiVersion: v1kind: ConfigMapmetadata: name: redis-configdata: redis.conf: | maxmemory 2gb maxmemory-policy allkeys-lru timeout 0 tcp-keepalive 300 loglevel notice---apiVersion: apps/v1kind: Deploymentmetadata: name: redis-cache labels: app: redis-cachespec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 selector: matchLabels: app: redis-cache template: metadata: labels: app: redis-cache annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault containers: - name: redis image: cleanstart/redis:7-prod@sha256:abc123def456... imagePullPolicy: IfNotPresent ports: - containerPort: 6379 name: redis protocol: TCP volumeMounts: - name: data mountPath: /data - name: tmp mountPath: /tmp - name: socket mountPath: /var/run/redis - name: config mountPath: /etc/redis readOnly: true resources: requests: memory: 512Mi cpu: 250m ephemeral-storage: 1Gi limits: memory: 2Gi cpu: 1 ephemeral-storage: 3Gi livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 readinessProbe: tcpSocket: port: 6379 initialDelaySeconds: 5 periodSeconds: 3 timeoutSeconds: 2 failureThreshold: 2 volumes: - name: data emptyDir: sizeLimit: 10Gi - name: tmp emptyDir: sizeLimit: 1Gi - name: socket emptyDir: sizeLimit: 10Mi - name: config configMap: name: redis-config terminationGracePeriodSeconds: 10---apiVersion: v1kind: Servicemetadata: name: redis-cachespec: selector: app: redis-cache ports: - port: 6379 targetPort: 6379 protocol: TCP type: ClusterIPThe Redis cache deployment is configured for stateless operation since cache data is ephemeral. Three replicas are deployed for redundancy and load distribution. The read-only root filesystem is maintained while allowing Redis to store cache data in an emptyDir volume. Configuration is provided via ConfigMap and mounted as read-only to prevent runtime modifications. The deployment uses TCP socket probes for health checks rather than shell commands, respecting the shell-less constraint.
Kafka: Complete Production Deployment
A Kafka StatefulSet requires special handling because each broker needs consistent identity and persistent storage. The configuration below shows how to run Kafka with read-only root filesystems and shell-less containers while maintaining the requirements for stateful workloads.
apiVersion: v1kind: ConfigMapmetadata: name: kafka-configdata: server.properties: | broker.rack=rack1 num.network.threads=8 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/var/lib/kafka/data num.partitions=3 num.recovery.threads.per.data.dir=1 offsets.topic.replication.factor=3 transaction.state.log.replication.factor=1 transaction.state.log.min.isr=1---apiVersion: v1kind: PersistentVolumeClaimmetadata: name: kafka-logsspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 700Gi---apiVersion: apps/v1kind: StatefulSetmetadata: name: kafkaspec: serviceName: kafka replicas: 1 selector: matchLabels: app: kafka template: metadata: labels: app: kafka annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault containers: - name: kafka image: cleanstart/kafka:3.5-prod@sha256:abc123def456... imagePullPolicy: IfNotPresent ports: - containerPort: 9092 name: kafka protocol: TCP env: - name: KAFKA_BROKER_ID valueFrom: fieldRef: fieldPath: metadata.name - name: KAFKA_ADVERTISED_HOSTNAME valueFrom: fieldRef: fieldPath: status.podIP - name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR value: "1" - name: KAFKA_LOG_RETENTION_DAYS value: "7" - name: KAFKA_LOG_SEGMENT_BYTES value: "1073741824" volumeMounts: - name: logs mountPath: /var/lib/kafka/data - name: tmp mountPath: /tmp - name: config mountPath: /etc/kafka readOnly: true resources: requests: memory: 2Gi cpu: 1 ephemeral-storage: 1Gi limits: memory: 4Gi cpu: 2 ephemeral-storage: 5Gi livenessProbe: tcpSocket: port: 9092 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: tcpSocket: port: 9092 initialDelaySeconds: 15 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2 volumes: - name: logs persistentVolumeClaim: claimName: kafka-logs - name: tmp emptyDir: sizeLimit: 70Gi - name: config configMap: name: kafka-config terminationGracePeriodSeconds: 30---apiVersion: v1kind: Servicemetadata: name: kafkaspec: clusterIP: None selector: app: kafka ports: - port: 9092 targetPort: 9092 name: kafkaKafka brokers maintain state that must persist across restarts, so the StatefulSet uses a PersistentVolumeClaim for the log directory. The StatefulSet naming convention provides consistent pod identities (kafka-0, kafka-1, etc.), which are used to determine the broker ID through the metadata.name field reference. The large ephemeral storage allocation for /tmp (70Gi) accommodates Kafka's temporary file creation needs while remaining on the read-only root filesystem.
Nginx: Complete Production Deployment
Nginx serves as a reverse proxy and load balancer in many architectures. The configuration below demonstrates how to run Nginx with read-only root filesystems while allowing it to cache responses and maintain active connections.
apiVersion: v1kind: ConfigMapmetadata: name: nginx-configdata: nginx.conf: | user www-data www-data; worker_processes auto; pid /var/run/nginx/nginx.pid; events { worker_connections 1024; use epoll; } http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /dev/stdout main; error_log /dev/stderr warn; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 65; types_hash_max_size 2048; client_max_body_size 20M; gzip on; gzip_vary on; gzip_min_length 1000; gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml+rss; server { listen 8080 default_server; listen [::]:8080 default_server; server_name _; location / { proxy_pass http://backend:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } location /health { access_log off; return 200 "ok\n"; } } }---apiVersion: apps/v1kind: Deploymentmetadata: name: nginx labels: app: nginxspec: replicas: 3 selector: matchLabels: app: nginx strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 template: metadata: labels: app: nginx annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault containers: - name: nginx image: cleanstart/nginx:1.25-prod@sha256:abc123def456... imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: http protocol: TCP volumeMounts: - name: config mountPath: /etc/nginx/nginx.conf subPath: nginx.conf readOnly: true - name: cache mountPath: /var/cache/nginx - name: run mountPath: /var/run/nginx - name: tmp mountPath: /tmp resources: requests: memory: 256Mi cpu: 100m ephemeral-storage: 500Mi limits: memory: 512Mi cpu: 500m ephemeral-storage: 2Gi livenessProbe: httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 2 periodSeconds: 3 timeoutSeconds: 1 failureThreshold: 2 volumes: - name: config configMap: name: nginx-config - name: cache emptyDir: sizeLimit: 5Gi - name: run emptyDir: sizeLimit: 10Mi - name: tmp emptyDir: sizeLimit: 100Mi affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 10---apiVersion: v1kind: Servicemetadata: name: nginxspec: selector: app: nginx ports: - port: 80 targetPort: 8080 protocol: TCP type: LoadBalancerThe Nginx configuration logs directly to stdout and stderr rather than writing to disk, which respects the read-only root filesystem constraint. Cache directories are mounted as ephemeral emptyDir volumes, allowing Nginx to maintain performance through caching without compromising the immutable root filesystem. The deployment includes pod anti-affinity rules to distribute Nginx instances across different nodes, improving availability and resilience.
Python Web App: Complete Production Deployment
Python web applications often require database migrations and other initialization steps before the main application can start. The configuration below shows how to handle init containers while maintaining security constraints throughout the deployment lifecycle.
apiVersion: v1kind: ConfigMapmetadata: name: app-configdata: app.env: | ENVIRONMENT=production LOG_LEVEL=info DEBUG=false---apiVersion: apps/v1kind: Deploymentmetadata: name: python-app labels: app: python-appspec: replicas: 3 selector: matchLabels: app: python-app strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 template: metadata: labels: app: python-app annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default spec: securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault initContainers: - name: migrate image: myregistry/python-app:1.0.0@sha256:abc123def456... imagePullPolicy: IfNotPresent securityContext: runAsUser: 65532 readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] command: - python args: - /app/manage.py - migrate - --no-input env: - name: DATABASE_URL valueFrom: secretKeyRef: name: app-secrets key: database-url - name: DJANGO_SETTINGS_MODULE value: "config.settings.production" volumeMounts: - name: tmp mountPath: /tmp resources: requests: memory: 256Mi cpu: 250m ephemeral-storage: 500Mi limits: memory: 512Mi cpu: 1 ephemeral-storage: 1Gi containers: - name: app image: myregistry/python-app:1.0.0@sha256:abc123def456... imagePullPolicy: IfNotPresent ports: - containerPort: 8000 name: http protocol: TCP env: - name: ENVIRONMENT value: "production" - name: DATABASE_URL valueFrom: secretKeyRef: name: app-secrets key: database-url - name: REDIS_URL value: "redis://redis-cache:6379/0" - name: SECRET_KEY valueFrom: secretKeyRef: name: app-secrets key: secret-key - name: DJANGO_SETTINGS_MODULE value: "config.settings.production" volumeMounts: - name: tmp mountPath: /tmp - name: cache mountPath: /app/cache - name: config mountPath: /etc/app readOnly: true resources: requests: memory: 512Mi cpu: 250m ephemeral-storage: 500Mi limits: memory: 1Gi cpu: 1 ephemeral-storage: 2Gi livenessProbe: httpGet: path: /health/live port: 8000 scheme: HTTP initialDelaySeconds: 15 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 readinessProbe: httpGet: path: /health/ready port: 8000 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 3 timeoutSeconds: 2 failureThreshold: 2 volumes: - name: tmp emptyDir: sizeLimit: 1Gi - name: cache emptyDir: sizeLimit: 500Mi - name: config configMap: name: app-config affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - python-app topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 30---apiVersion: v1kind: Servicemetadata: name: python-appspec: selector: app: python-app ports: - port: 8000 targetPort: 8000 protocol: TCP type: ClusterIPThe Python application deployment uses an initContainer to run database migrations before the main application starts. This ensures that the database schema is up-to-date without requiring manual intervention. The initContainer runs with the same security constraints as the main container, enforcing the security model from initialization through runtime. The main application container uses HTTP health checks to verify liveness and readiness, enabling Kubernetes to automatically restart failed instances and route traffic only to healthy replicas.
Part 4: Verification and Compliance Checks
Checklist: Is My Manifest Production-Ready?
The following comprehensive checklist guides you through verifying that your Kubernetes manifests implement all required security controls. Start with the SecurityContext section and confirm that runAsNonRoot is set to true. Check that runAsUser is set to a specific non-root ID like 65532, and verify that runAsGroup matches this value. Ensure that fsGroup is set to the same value for proper volume permission handling. Confirm that readOnlyRootFilesystem is set to true, making the filesystem immutable. Check that allowPrivilegeEscalation is set to false to prevent privilege escalation attacks. Verify that capabilities.drop is set to ["ALL"] to remove all Linux capabilities. Confirm that seccompProfile.type is set to RuntimeDefault to apply syscall filtering.
Review your storage configuration to ensure that persistent data uses PersistentVolumeClaims for long-term storage needs. Verify that temporary data uses emptyDir for transient storage that's lost when the pod terminates. Confirm that sensitive temporary data uses emptyDir with medium: Memory for tmpfs mounting to keep data in RAM only. Check that configuration uses ConfigMap mounts with readOnly: true to prevent modification. Verify that secrets use Secret mounts with readOnly: true. Ensure that no writable system directories are mounted into the container.
Examine your ENTRYPOINT and CMD instructions to confirm that ENTRYPOINT uses exec form (array notation) rather than shell form. Verify that your ENTRYPOINT or shell script doesn't invoke /bin/sh. Check that complex initialization uses cleanimg-init.toml for declarative setup. Verify that database migrations and setup tasks use initContainers rather than embedding them in the main container logic.
Confirm that livenessProbe is defined for your container to detect and restart failed processes. Verify that readinessProbe is defined to control traffic routing. Ensure that probes use HTTP, TCP, or direct cleanimg-init endpoints rather than shell exec probes. Review the initialDelaySeconds values to ensure they're appropriate (30+ seconds for databases, 10+ for applications). Verify that probes have reasonable timeout values (2-5 seconds).
Check that resources.requests.memory is set to an appropriate value based on expected memory usage. Verify that resources.requests.cpu is set based on expected computational load. Ensure that resources.requests.ephemeral-storage is set to prevent disk exhaustion. Confirm that resources.limits.memory is set, typically 2-4 times the request value. Verify that resources.limits.cpu is set based on maximum acceptable usage. Check that resources.limits.ephemeral-storage is set to 2-3 times the request value.
Review capabilities to ensure NET_RAW is dropped if not needed for your workload. Verify that NetworkPolicy is implemented if multiple applications share the cluster. Confirm that internal services use ClusterIP rather than exposing on public IPs.
Examine image references to confirm they come from cleanstart/* or are built from cleanstart base images. Verify that image digests are pinned using SHA256 hashes rather than tags to prevent unexpected image updates. Confirm that images have been scanned for vulnerabilities before being pushed to the registry. Review image signature verification settings in your admission webhook configuration.
Verify that application logs are written to stdout and stderr where they can be collected by Kubernetes. Check that log level is set to info or warn (not debug in production) to reduce noise and security risks. Review application code to ensure sensitive data like passwords and API keys are never logged.
Audit Command (kubectl)
The following kubectl commands help you audit your cluster to identify non-compliant deployments. To find all Deployments that don't have a read-only root filesystem, execute:
kubectl get deployments -A -o json | \ jq -r '.items[] | select(.spec.template.spec.securityContext.readOnlyRootFilesystem != true) | .metadata.name'To identify Deployments that don't enforce non-root execution, run:
kubectl get deployments -A -o json | \ jq -r '.items[] | select(.spec.template.spec.securityContext.runAsNonRoot != true) | .metadata.name'To find containers that haven't dropped all capabilities, execute:
kubectl get pods -A -o json | \ jq -r '.items[] | select(.spec.containers[].securityContext.capabilities.drop[] != "ALL") | .metadata.name'To discover pods using images that likely contain shells (common base images like ubuntu and debian), run:
kubectl get pods -A -o json | \ jq -r '.items[] | select(.spec.containers[].image | startswith("ubuntu") or startswith("debian")) | .metadata.name'Summary: The Production Security Model
Property | Status | Benefit |
|---|---|---|
Read-only filesystem | ✅ Enabled | 80% attack surface reduction (no writes) |
Shell-less image | ✅ Enabled | 99% reduction in command-injection attacks |
Non-root user | ✅ Enforced | Can't escalate to root if not running as root |
Dropped capabilities | ✅ Enabled | 50+ Linux capabilities removed |
Immutable configuration | ✅ ConfigMap RO mounts | Can't modify app behavior at runtime |
Health checks | ✅ Defined | Automatic restart on failure |
Signal handling | ✅ cleanimg-init or direct app | Graceful shutdown, no orphaned processes |
Ephemeral storage isolation | ✅ emptyDir + tmpfs | No persistence of attacker payloads |
Persistent storage isolation | ✅ PVC + fsGroup | Access control, encryption at rest |
This combination of read-only filesystems, shell-less containers, non-root user execution, dropped capabilities, and proper health checking represents the modern production baseline for Kubernetes deployments. Every application deployed to Kubernetes in a production environment should implement these security controls as a foundational layer of defense. The controls are not optional additions but rather essential elements of responsible cloud-native application deployment.
