Enforcing Immutability: Read-Only Root with Ephemeral Storage
Read-only filesystems are fundamental to modern container security, but they require careful orchestration with ephemeral and persistent storage to work correctly in production. Immutability detects tampering immediately, prevents operator drift, and automatically satisfies PCI-DSS, SOC 2, and FedRAMP compliance controls.
1Security Benefits2Storage StrategyPart 1: The Immutability Principle
Why the Root Filesystem MUST Be Read-Only
The immutability principle is not just about security—it's about enforcing a contract between the application, the platform, and the compliance regime. Understanding the reasoning behind this requirement helps teams adopt it with confidence.
A read-only root filesystem functions as a tamper seal. When an attacker attempts to write to the filesystem, the write fails immediately and the error is logged, providing clear evidence of the attack attempt. Without this seal, attackers can modify application code, inject malware, alter logs after the fact, and potentially cover their tracks, leaving no trace of compromise. The read-only enforcement creates an immediate detection mechanism that stops tampering at the source.
Container drift is a critical operational problem that read-only filesystems prevent entirely. Without read-only enforcement, operators can write to the container filesystem directly via operations like kubectl exec to modify a config file on the fly. This creates a divergence between the running instance and the image it was launched from, making reproducibility and debugging nearly impossible. When configuration changes are made directly to a running container rather than through CI/CD pipelines, you lose the ability to understand what changes have been made, when they were made, or why. Read-only enforcement forces all changes through CI/CD systems, ensuring complete traceability of every modification.
Compliance frameworks recognize the value of this control. PCI-DSS 1.2, SOC 2 AC-1, and FedRAMP SC-7 all require file integrity monitoring and prevention of unauthorized modification. A read-only root filesystem satisfies these controls automatically because no application can modify the operating system or system binaries, providing automatic evidence of compliance. Every production deployment should set readOnlyRootFilesystem: true in the Kubernetes SecurityContext as a foundational security requirement.
What Happens at Container Start: OCI Layers
When a Kubernetes pod starts a container from a CleanStart -prod image, a precise sequence of layering events ensures security while maintaining functionality.
The base image, such as cleanstart/python:3.11-prod or cleanstart/postgresql:15-prod, consists of immutable OCI layers that are stacked in a read-only union filesystem. Each layer is sealed with a SHA256 hash and verified when pulled from the registry, ensuring that any binary, library, or file in the image cannot be modified at runtime. This layered architecture is the foundation of container portability and reproducibility.
By default, without explicit SecurityContext configuration, Kubernetes adds a thin, temporary, writable copy-on-write overlay layer on top of the image layers. This overlay allows applications to write to paths like /tmp, /var/log, and /var/run transparently. The overlay is fundamentally ephemeral—it exists only for the lifetime of the pod and is discarded when the pod terminates, ensuring stateless behavior.
However, when you explicitly set readOnlyRootFilesystem: true in the SecurityContext, Kubernetes disables this writable overlay completely. Any attempt to write to the root filesystem fails immediately with a "Read-only file system" error. This enforcement is absolute and uncompromising. Applications that require writable state must use explicit volume mounts (emptyDir for temporary data, tmpfs for sensitive temporary data, or PVC for durable state) to function correctly. This forces a clear separation between immutable image content and application state.
Image Layers vs. Runtime State: The Distinction
Understanding the boundary between image layers and runtime state is essential for designing applications that work well with read-only filesystems.
Image layers form the immutable foundation of every container. These layers are baked into the OCI image at build time and are SHA256-verified when pulled from a registry, providing cryptographic proof of authenticity and integrity. Once the image is built, these layers cannot be modified at runtime. The image layers encompass the operating system, language runtime, application code, and management utilities like cleanimg-init. Together they represent a complete, reproducible definition of the container's software stack.
Runtime state, in contrast, is everything that the application creates, modifies, or generates while running. This state exists in explicitly mounted volumes rather than in the image layers. Ephemeral state is deleted when the pod terminates and is typically stored in emptyDir volumes or tmpfs mounts. Persistent state survives pod restarts and cluster failures and is stored in PersistentVolumeClaims that bind to underlying storage infrastructure.
For PostgreSQL as a concrete example, the immutable image layer contains /usr/lib/postgresql/15/bin/postgres—the database binary itself. The runtime state includes /var/lib/postgresql/data (the actual database files, stored on a persistent PVC), /var/run/postgresql (Unix sockets created at startup, stored on an ephemeral emptyDir), and /tmp (temporary space for queries, stored on ephemeral tmpfs). This separation ensures that the binary is immutable while allowing the application to manage its own runtime data without ever needing to write to system directories.
Part 2: Ephemeral Storage Deep Dive
emptyDir — Lifecycle Tied to Pod
An emptyDir volume is fundamentally a temporary directory created on the Kubernetes node's local disk when the pod is scheduled and deleted when the pod terminates, regardless of the pod's exit code. The volume is shared between all containers in the pod, making it useful for inter-process communication or shared working space. Because the data disappears with the pod, emptyDir is appropriate for any state that doesn't need to survive container restarts.
Common use cases for emptyDir include temporary files and caches that improve performance during execution, working directories for intermediate computations, session data that persists only for the current request, Unix sockets required by the application for IPC, and scratch space for database queries or other processing. In any of these scenarios, the ephemeral nature of the volume is a feature, not a limitation.
Performance characteristics of emptyDir are straightforward: the volume is stored on the node's local filesystem (typically SSD or HDD), so I/O performance depends directly on the node's disk hardware. Because data never travels over the network, there is no network overhead or latency, making emptyDir suitable for high-frequency reads and writes compared to persistent volumes.
Size management with emptyDir requires careful planning. You specify a size limit when declaring the volume:
emptyDir: sizeLimit: 5GiIf the pod exceeds the size limit, Kubernetes evicts the pod and reschedules it, forcing you to be intentional about sizing. Your size estimate should be based on the expected temporary data: 100MB typically suffices for most web applications, 1GB works for cache-intensive applications, and 5-10GB is appropriate for big data processing or data transformation workloads.
Example:
volumes:- name: tmp emptyDir: sizeLimit: 500Micontainers:- name: app volumeMounts: - name: tmp mountPath: /tmptmpfs (Memory-Backed) — Faster, More Secure, Limited
tmpfs (Memory-Backed) — Faster, More Secure, Limited represents a fundamentally different storage approach than emptyDir. A tmpfs volume is a temporary filesystem stored entirely in the pod's memory (RAM) and never written to disk at any point. To the application, it appears as a normal directory with standard filesystem semantics, but the entire contents exist in volatile memory.
tmpfs is the right choice for security-sensitive temporary data that must never touch persistent storage: encryption keys waiting to be used, decoded tokens that should not be persisted, intermediate values during cryptographic operations, and similar sensitive temporary values. Small temporary files that your compliance framework requires never hit disk should use tmpfs. High-performance caches requiring sub-millisecond latency are also excellent candidates. Shared memory buffers for multi-process applications like PostgreSQL parallel query execution are another important use case.
The security advantage of tmpfs is absolute and uncompromising: data is never persisted to disk under any circumstances. If the pod crashes or the node reboots, the data is lost immediately, which is exactly what you want for sensitive values. If the node is physically attacked or decommissioned carelessly, tmpfs data cannot be recovered from disk because it was never written there. This property makes tmpfs ideal for sensitive operations like decryption, secret handling, and temporary credential management.
Performance characteristics of tmpfs are dramatically different from emptyDir. Tmpfs is orders of magnitude faster than disk-based emptyDir because all operations are in-memory. However, the memory used by tmpfs counts against your pod's memory request and limit, so you must be careful with sizing. Tmpfs is suitable for sub-MB temporary allocations or carefully bounded allocations where you control the memory usage precisely.
Size management for tmpfs requires explicit declaration and understanding of your pod's memory constraints:
tmpfs: sizeLimit: 256MiA tmpfs volume must fit entirely within the pod's memory request and limit. It is directly counted against the container's total memory usage, so allocating 256Mi of tmpfs means your application has 256Mi less memory available for heap, stack, and other memory needs.
Example (PostgreSQL parallel query shared memory):
volumes:- name: shm emptyDir: medium: Memory sizeLimit: 1Gicontainers:- name: postgres volumeMounts: - name: shm mountPath: /dev/shmPersistentVolumeClaim — Surviving Restarts and Failures
A PersistentVolumeClaim is a request for persistent block storage volume provisioned by your cluster's storage class. Most modern clusters use cloud storage backends like Google Cloud Persistent Disks, AWS Elastic Block Store, or Azure Managed Disks. The critical property of a PVC is that it survives pod termination, node failure, and even cluster restarts. The underlying storage persists independently of any running container. Most storage classes also support dynamic resizing, allowing you to grow storage capacity without recreating the volume.
PersistentVolumeClaims are appropriate for databases and data stores like PostgreSQL, Redis, Kafka, or MongoDB where data loss is catastrophic. They are necessary for logs that must be archived and analyzed, enabling compliance auditing and forensic investigation. Stateful application data that must persist across deployments requires PVCs. Backups and snapshots that need to outlive the pod also require persistent storage. In general, any state that would be unacceptable to lose should be stored on a PVC.
The lifecycle of a PVC differs significantly from ephemeral volumes. A PVC is created by Kubernetes, either through dynamic provisioning or pre-created by administrators. Once created, the PVC is attached to the pod's node via the storage orchestration layer. Importantly, the PVC remains even if the pod is deleted, unless the storage class is explicitly configured to delete the volume on pod termination (controlled by the persistentVolumeReclaimPolicy). This persistence allows a new pod to mount the same PVC and access the previous pod's data immediately, enabling seamless failover and data continuity.
Performance characteristics of PVCs vary depending on the storage backend. In cloud environments, persistent volumes are network-attached (GCP Persistent Disk offers 0.5-1ms latency, AWS EBS offers similar latency), making them slower than local emptyDir or tmpfs. However, database engines are specifically optimized for network storage with caching, write-ahead logging, and intelligent prefetching. PVC performance is typically sufficient for production database workloads, and many organizations achieve multi-thousand transaction-per-second throughput with persistent volumes.
Storage sizing requires planning based on your data retention policy:
spec: resources: requests: storage: 100GiSize your PVC based on your data retention and growth expectations. For PostgreSQL, allocate approximately 2x the current data size to account for write-ahead logs and overhead from index structures. For Redis, the size depends on whether you use RDB (snapshot-based) or AOF (append-only file) persistence; AOF requires approximately 1.5x the working dataset. For Kafka, calculate by multiplying your daily message throughput by the number of days of retention you require.
Example (PostgreSQL with 50GB storage):
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 50GiDecision Matrix: Which Storage for What?
Use Case | emptyDir | tmpfs | PVC |
|---|---|---|---|
App cache | ✅ | ❌ | ❌ |
Sensitive temp data | ❌ | ✅ | ❌ |
Logs (short-term) | ✅ | ❌ | ❌ |
Logs (archived) | ❌ | ❌ | ✅ |
Database data | ❌ | ❌ | ✅ |
Session files | ✅ | ❌ | ❌ |
Encryption keys (temp) | ❌ | ✅ | ❌ |
Message queue (durable) | ❌ | ❌ | ✅ |
Unix sockets | ✅ | ❌ | ❌ |
Shared memory (IPC) | ❌ | ✅ | ❌ |
Working directory | ✅ | ❌ | ❌ |
Bulk data processing | ✅ | ❌ | ❌ |
Part 3: Per-Application Ephemeral Storage Patterns
PostgreSQL
PostgreSQL requires three distinct storage locations, each with different durability and performance characteristics.
The data directory at /var/lib/postgresql/data contains the database clusters, write-ahead log files, and all user data. This location must be persistent—any data loss is catastrophic and unacceptable. The size should be approximately 2x the largest expected database size to accommodate WAL files and overhead. PostgreSQL requires ReadWriteOnce access mode because the database engine does not support concurrent filesystem access from multiple instances (which is why you run single-replica PostgreSQL instances, not sharded clusters on the same storage).
The Unix socket directory at /var/run/postgresql contains the pg_ctl.sock and pgbench.sock sockets used for local connections between client and server. This can be ephemeral since sockets are recreated automatically on startup. A 10MB emptyDir is always sufficient for socket files.
Temporary query space at /tmp is used by PostgreSQL for large sorts, hash joins, and temporary tables during query execution. The required size depends on the largest single query your workload executes; 1-5GB is typical for OLTP (Online Transaction Processing) workloads. For analytical queries, you might need larger allocations.
Shared memory at /dev/shm (implemented as tmpfs) is used for parallel query execution and the dynamic_shared_memory_type = posix configuration. The size should be approximately 25% of your PostgreSQL shared_buffers setting, typically 1-2GB. This is only needed if you run parallel queries; it is highly recommended for OLAP (Online Analytical Processing) workloads that benefit from parallelism.
Complete Manifest:
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 50Gi---apiVersion: apps/v1kind: Deploymentmetadata: name: postgresspec: replicas: 1 selector: matchLabels: app: postgres template: metadata: labels: app: postgres spec: securityContext: fsGroup: 999 containers: - name: postgres image: cleanstart/postgresql:15-prod@sha256:abc123... ports: - containerPort: 5432 name: postgres env: - name: POSTGRES_DB value: "production" - name: POSTGRES_USER valueFrom: secretKeyRef: name: postgres-secret key: username - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password volumeMounts: - name: data mountPath: /var/lib/postgresql/data - name: socket mountPath: /var/run/postgresql - name: tmp mountPath: /tmp - name: shm mountPath: /dev/shm resources: requests: memory: 1Gi cpu: 500m ephemeral-storage: 2Gi limits: memory: 4Gi cpu: 2 ephemeral-storage: 5Gi livenessProbe: tcpSocket: port: 5432 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: exec: command: - /usr/lib/postgresql/15/bin/pg_isready - -h - localhost initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: data persistentVolumeClaim: claimName: postgres-data - name: socket emptyDir: sizeLimit: 10Mi - name: tmp emptyDir: sizeLimit: 5Gi - name: shm emptyDir: medium: Memory sizeLimit: 1GiRedis
Redis has simpler storage requirements than databases like PostgreSQL, but the choice between ephemeral and persistent storage is significant.
The data directory at /data can use either ephemeral emptyDir or persistent PVC depending on your use case. If you are using Redis as a cache where data loss is acceptable (you can rebuild the cache from source), use an ephemeral emptyDir. If you are using Redis as a durable database where data loss is unacceptable, use a persistent PVC with either AOF (append-only file) or RDB (snapshot) persistence. For cache mode, allocate approximately 1x your expected hot dataset size. For database mode, allocate 1.5x if using AOF persistence (which grows larger) or 1x if using RDB-only snapshots.
Temporary files at /tmp are small and used for logging and temporary Redis operations. An emptyDir with 100MB is sufficient for most deployments.
Complete Manifest (Cache Mode with Ephemeral Storage):
apiVersion: apps/v1kind: Deploymentmetadata: name: redis-cachespec: replicas: 3 selector: matchLabels: app: redis-cache template: metadata: labels: app: redis-cache spec: securityContext: runAsNonRoot: true runAsUser: 65532 fsGroup: 65532 readOnlyRootFilesystem: true containers: - name: redis image: cleanstart/redis:7-prod@sha256:abc123... ports: - containerPort: 6379 name: redis volumeMounts: - name: data mountPath: /data - name: tmp mountPath: /tmp - name: socket mountPath: /var/run/redis resources: requests: memory: 512Mi cpu: 250m ephemeral-storage: 1Gi limits: memory: 2Gi cpu: 1 ephemeral-storage: 3Gi livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 5 readinessProbe: tcpSocket: port: 6379 initialDelaySeconds: 5 periodSeconds: 3 volumes: - name: data emptyDir: sizeLimit: 10Gi - name: tmp emptyDir: sizeLimit: 100Mi - name: socket emptyDir: sizeLimit: 50MiComplete Manifest (Database Mode with Persistent Storage):
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: redis-dataspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 20Gi---apiVersion: apps/v1kind: Deploymentmetadata: name: redis-dbspec: replicas: 1 selector: matchLabels: app: redis-db template: metadata: labels: app: redis-db spec: securityContext: runAsNonRoot: true runAsUser: 65532 fsGroup: 65532 readOnlyRootFilesystem: true containers: - name: redis image: cleanstart/redis:7-prod@sha256:abc123... ports: - containerPort: 6379 name: redis command: - redis-server args: - /etc/redis/redis.conf - --appendonly - "yes" volumeMounts: - name: config mountPath: /etc/redis readOnly: true - name: data mountPath: /data - name: tmp mountPath: /tmp resources: requests: memory: 1Gi cpu: 500m ephemeral-storage: 500Mi limits: memory: 4Gi cpu: 2 ephemeral-storage: 1Gi livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 5 readinessProbe: tcpSocket: port: 6379 initialDelaySeconds: 5 periodSeconds: 3 volumes: - name: config configMap: name: redis-config - name: data persistentVolumeClaim: claimName: redis-data - name: tmp emptyDir: sizeLimit: 500MiKafka
Kafka requires careful storage sizing due to its message retention model. The log directory at /var/lib/kafka/data contains all message log segments and must be persistent. The required size depends critically on your retention policy. For a 7-day retention window, multiply your daily throughput by 7. For 30-day retention, multiply by 30. As a concrete example, if your system processes 100GB of messages per day, you need 700GB for 7 days of retention or 3TB for 30 days.
Temporary files at /tmp are used during log compaction and temporary operations. Allocate approximately 10% of your total log data size; for 700GB of logs, 70GB of temporary space is appropriate.
Complete Manifest (Single Broker, 7-Day Retention):
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: kafka-logsspec: accessModes: - ReadWriteOnce storageClassName: standard-rwo resources: requests: storage: 700Gi---apiVersion: apps/v1kind: StatefulSetmetadata: name: kafkaspec: serviceName: kafka replicas: 1 selector: matchLabels: app: kafka template: metadata: labels: app: kafka spec: securityContext: runAsNonRoot: true runAsUser: 65532 fsGroup: 65532 readOnlyRootFilesystem: true containers: - name: kafka image: cleanstart/kafka:3.5-prod@sha256:abc123... ports: - containerPort: 9092 name: kafka env: - name: KAFKA_BROKER_ID valueFrom: fieldRef: fieldPath: metadata.name - name: KAFKA_ADVERTISED_LISTENERS value: "PLAINTEXT://kafka-0.kafka:9092" - name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR value: "1" - name: KAFKA_LOG_RETENTION_DAYS value: "7" volumeMounts: - name: logs mountPath: /var/lib/kafka/data - name: tmp mountPath: /tmp resources: requests: memory: 2Gi cpu: 1 ephemeral-storage: 1Gi limits: memory: 4Gi cpu: 2 ephemeral-storage: 5Gi livenessProbe: tcpSocket: port: 9092 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: tcpSocket: port: 9092 initialDelaySeconds: 15 periodSeconds: 5 volumes: - name: logs persistentVolumeClaim: claimName: kafka-logs - name: tmp emptyDir: sizeLimit: 70GiNginx
Nginx has minimal storage requirements reflecting its role as a stateless reverse proxy and load balancer.
The proxy cache directory at /var/cache/nginx is optional and only needed if you are using Nginx's proxy_cache directive to cache upstream responses. The cache is ephemeral and cleared on each restart. Size this directory based on your expected cached content volume; 1-10GB is typical for medium-traffic deployments.
The working directory at /var/run/nginx stores the PID file and lock files used internally by Nginx. 10MB is always sufficient for these small files.
Logs at /var/log/nginx are optional and depend on your logging architecture. If logs are sent to stdout/stderr (recommended for containerized Nginx), no mount is needed—the container runtime captures output. If you persist logs within the container, use tmpfs with a size of 100MB-1GB depending on your traffic volume, and ship the logs to a centralized logging service before the tmpfs fills up.
Complete Manifest (No Persistent Storage):
apiVersion: apps/v1kind: Deploymentmetadata: name: nginxspec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: securityContext: runAsNonRoot: true runAsUser: 65532 fsGroup: 65532 readOnlyRootFilesystem: true containers: - name: nginx image: cleanstart/nginx:1.25-prod@sha256:abc123... ports: - containerPort: 8080 name: http volumeMounts: - name: config mountPath: /etc/nginx/nginx.conf subPath: nginx.conf readOnly: true - name: cache mountPath: /var/cache/nginx - name: run mountPath: /var/run/nginx - name: tmp mountPath: /tmp resources: requests: memory: 256Mi cpu: 100m ephemeral-storage: 500Mi limits: memory: 512Mi cpu: 500m ephemeral-storage: 2Gi livenessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 5 periodSeconds: 5 readinessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 2 periodSeconds: 3 volumes: - name: config configMap: name: nginx-config - name: cache emptyDir: sizeLimit: 5Gi - name: run emptyDir: sizeLimit: 10Mi - name: tmp emptyDir: sizeLimit: 100MiCustom Applications (Python, Node, Java, Go)
Most application frameworks have minimal storage requirements, making them simpler to deploy than stateful services.
Application code at / is baked into the image layer and is read-only. No storage mount is needed for application code.
Temporary files at /tmp are used by most frameworks for temporary caches and working files. An emptyDir with 100MB to 1GB depending on your workload is appropriate. Web frameworks often use /tmp for session management, file uploads, and template compilation.
Application cache at /app/cache or equivalent is sometimes needed by frameworks that maintain local caches. The required size depends on your application's design; some applications do not use local caches at all, while others benefit from large in-memory caches.
Complete Manifest (Python + Gunicorn):
apiVersion: apps/v1kind: Deploymentmetadata: name: python-appspec: replicas: 3 selector: matchLabels: app: python-app template: metadata: labels: app: python-app spec: securityContext: runAsNonRoot: true runAsUser: 65532 fsGroup: 65532 readOnlyRootFilesystem: true containers: - name: app image: myregistry/python-app:1.0.0@sha256:abc123... ports: - containerPort: 8000 name: http env: - name: APP_ENV value: "production" volumeMounts: - name: tmp mountPath: /tmp - name: cache mountPath: /app/cache resources: requests: memory: 256Mi cpu: 250m ephemeral-storage: 500Mi limits: memory: 512Mi cpu: 1 ephemeral-storage: 2Gi livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 3 volumes: - name: tmp emptyDir: sizeLimit: 1Gi - name: cache emptyDir: sizeLimit: 500MiPart 4: Storage Sizing Guide
Calculating emptyDir Sizes
Estimating ephemeral storage sizes requires analyzing your application's specific usage patterns and peak demands.
For web applications, start with a 100MB base to account for logs and request handling. Add 10MB per maximum concurrent request stream to account for file uploads and temporary processing of request data. Reserve an additional 10-20% of your application's working memory for application-level caches. The total emptyDir should be at least 100MB plus (max_concurrent_requests × 10MB) plus your cache estimate.
For data processing applications, the calculation is different: add the size of your largest input batch being staged for processing, plus 2-3x the size of the largest intermediate dataset you generate during processing, plus the size of the final result set being staged for output. The total is the sum of these three components.
As a concrete example, consider PostgreSQL with 4GB of shared buffers and large sort operations. Allocate 2-3GB for /tmp to handle large sorts and hash joins. Allocate 100MB for /var/run for Unix sockets and temporary files. The total ephemeral storage needed is 2.5-3.5GB.
Kubernetes ephemeral-storage Resource Limits
Kubernetes tracks ephemeral storage (all writes outside persistent volumes) under the ephemeral-storage resource metric. This metric applies to all temporary storage within the container, separate from persistent volumes. You specify ephemeral-storage in both requests and limits, similar to how you specify CPU and memory.
The request for ephemeral storage tells the scheduler how much space it should assume the pod will use when making node selection decisions. Nodes with insufficient available ephemeral storage will not be selected for scheduling the pod.
The limit for ephemeral storage is the maximum space the pod is allowed to use. If the pod exceeds this limit, Kubernetes evicts the pod from the node without a graceful shutdown, which is disruptive. Setting limits carefully prevents catastrophic failures.
What counts toward ephemeral-storage includes all emptyDir volumes (their actual used space), log files written to /var/log, container images cached on the node, and container layer writes from the copy-on-write filesystem. What does not count includes data on PersistentVolumes (charged separately to the storage system) and writes to the host filesystem outside the container (not tracked as container ephemeral storage).
Setting guidelines are practical: set the request to the sum of all emptyDir sizeLimit values plus a 20% buffer for operating system overhead and temporary growth. Set the limit to twice the request, allowing some bursting for temporary spikes while preventing complete exhaustion.
Monitoring Ephemeral Storage Usage
# Check node ephemeral storage usagekubectl top nodes # Check pod ephemeral storage usage (requires metrics-server)kubectl describe pod <pod-name> | grep -i ephemeral # Get detailed usagekubectl exec <pod> -- du -sh /tmp /var/log /app/cacheIn Prometheus:
container_fs_usage_bytes{pod!=""} / container_fs_limit_bytes{pod!=""}What Happens When Ephemeral Storage is Exhausted
When a pod exceeds its ephemeral-storage limit, a cascade of events occurs that disrupts the service. Kubernetes continuously monitors actual disk usage and, upon detecting an overage, immediately sends a SIGKILL signal to the pod containers (no graceful shutdown period). The pod is evicted from the node, and Kubernetes schedules a replacement pod on a different node. An event is logged with kubelet.kubernetes.io/eviction-reason: DiskPressure, allowing you to trace the cause of the disruption.
Preventing ephemeral-storage evictions requires a multi-faceted approach. Right-size your ephemeral-storage limits by monitoring actual usage with kubectl top pods and understanding your peak temporary data needs. Use tmpfs instead of emptyDir for truly temporary data when possible, which reduces pressure on disk by keeping data in memory. Implement log rotation if you log to emptyDir volumes, preventing logs from consuming unlimited space. Most importantly, design your application to clean up temporary files proactively rather than accumulating them until eviction occurs.
Part 5: Security Implications
Why tmpfs Is More Secure Than emptyDir for Secrets
emptyDir and tmpfs have fundamentally different security implications for sensitive data. emptyDir is stored on the node's filesystem (disk) and survives node reboots until manually cleaned. This means sensitive data persists on disk after the container exits. If the node is decommissioned carelessly without proper data wiping, sensitive data could be recovered from disk by an attacker with physical access. Additionally, tmpfs data is not visible to other processes on the node, but emptyDir data is potentially visible to other processes via the /proc/[pid]/fd interface.
tmpfs offers superior security for sensitive data. Data is stored only in RAM and is cleared completely when the pod terminates. If the node crashes, the data is unrecoverable because it was never persisted to disk. Data in tmpfs is not visible to other processes via the filesystem, making it safer from privilege escalation attacks. This property is sometimes called "secrecy in motion"—the data is never written to persistent storage.
Security best practices are straightforward: use tmpfs for sensitive temporary values like decoded authentication tokens, intermediate cryptographic data, and temporary credentials. Use emptyDir for application cache and working files that do not contain sensitive data. Avoid logging secrets at all, but if you must log sensitive values temporarily, use tmpfs plus a pipe to stdout (where your logging system can capture it securely) rather than persisting to disk.
Filesystem Permissions on Volumes
Filesystem permissions on volumes require careful configuration to ensure both security and functionality. The fsGroup setting in the security context sets the group ID for all mounted volumes, ensuring that non-root user processes can write to mounted volumes. When you set fsGroup: 65532, all files in mounted volumes have their group ownership set to 65532, allowing the unprivileged user running in the container to write to those volumes even though the user is not root.
Per-mount read-only configuration adds an additional layer of security. You can mark individual volume mounts as readOnly: true to prevent accidental or malicious modification. Use readOnly mounts for configuration files, TLS certificates, and secrets that should never change during the container's lifetime. This catch provides defense in depth: even if the application has a bug or is compromised, it cannot modify these critical files.
PVC access modes specify how multiple pods can access a persistent volume. ReadWriteOnce allows a single pod to have read-write access and is the typical mode for database volumes where only one instance should write. ReadOnlyMany allows multiple pods to read from the volume simultaneously, appropriate for shared configuration that multiple pods need to read. ReadWriteMany allows multiple pods to read and write to the volume, typically implemented via NFS. Avoid ReadWriteMany for security-critical data because multiple concurrent writers create synchronization challenges and complicate auditing.
Encryption at Rest for PersistentVolumes
Encryption at rest for PersistentVolumes depends on your cloud provider and storage class. In Google Cloud, you enable disk encryption at the node pool level with the --enable-disk-encryption flag and can specify a customer-managed encryption key (CMEK). Within Kubernetes, you specify storageClassName: pd-ssd-encrypted or similar to use an encrypted storage class.
Most modern cloud providers implement encryption at rest transparently—it happens automatically without explicit configuration. For compliance frameworks like PCI-DSS or HIPAA, use customer-managed encryption keys (CMEK) rather than provider-managed keys. This gives you control over key rotation and access policies. You can verify which storage classes use encryption by running kubectl get storageclass -o yaml | grep encrypted.
Part 6: Troubleshooting
Error: "Read-only file system"
When an application fails with "Read-only file system" error, it means the application is trying to write to a path that is not mounted as writable. The most common cause is that readOnlyRootFilesystem: true is correctly set in the SecurityContext, but the application's required writable path has not been mounted as a volume.
The solution is to add the required volume mount and corresponding volume definition. If your application needs to write to /tmp, add a volumeMount for /tmp and create an emptyDir volume. Ensure that the path in the volumeMount exactly matches what the application expects, as absolute path matching is required. Some applications expect writable directories in non-standard locations like /var/cache/app or /app/data, so inspect the error message carefully to understand which specific path needs to be writable.
Error: "No space left on device"
When an application fails with "No space left on device" and the pod is evicted with DiskPressure, the pod has exceeded its ephemeral-storage limit or the node's local disk is full.
Start by checking current disk usage within the pod. Run kubectl exec <pod> -- du -sh /tmp /var/log to see which directories are consuming the most space. Run kubectl top pods --sort-by=ephemeral-storage across your cluster to identify which pods are using the most disk. This data-driven investigation reveals whether one pod is leaking disk space or whether multiple pods are all approaching their limits simultaneously.
If disk usage is below the limit but approaching it, increase the ephemeral-storage limit to allow normal operation with headroom. If disk usage is far above what you expected, your application is likely leaking temporary files. Implement log rotation to prevent logs from growing unbounded. Add cleanup logic to your application code to delete temporary files after processing completes. Consider using tmpfs instead of emptyDir for truly temporary data—tmpfs stores data in memory rather than on disk, reducing disk pressure and often improving performance.
If multiple pods are all suffering simultaneously, the node itself may be running out of local disk capacity. Add additional disk capacity to the node pool or provision additional nodes to distribute the workload.
Error: "Permission denied" on Mounted Volume
When an application fails with "permission denied" on a mounted volume, the cause is usually a mismatch between the fsGroup setting and the volume's actual group ownership, or the volume is mounted as readOnly when the application expects to write.
The solution is to ensure the fsGroup in the SecurityContext matches the expected group for the mounted volumes. Set fsGroup: 65532 to ensure all mounted volumes have group ownership 65532. Also set runAsUser: 65532 and runAsGroup: 65532 to ensure the application process runs as that user and group. For volumes that need to be writable, ensure the volumeMount does not have readOnly: true (the default is writable). After making these changes, new pods will have correct permissions on mounted volumes.
Summary: The Complete Storage Strategy
Component | Technology | Lifecycle | Use Case |
|---|---|---|---|
Application code | OCI image layer | image lifetime | immutable |
Configuration | ConfigMap + read-only mount | kubernetes object lifetime | dynamic but read-only |
Secrets | Secret + tmpfs | pod lifetime | sensitive temporary data |
Application cache | emptyDir | pod lifetime | non-critical temp data |
Log files (short) | emptyDir | pod lifetime | transient logs |
Large temp data | emptyDir with sizeLimit | pod lifetime | working space |
Database data | PVC | cluster lifetime | durability-critical |
Message queue logs | PVC | cluster lifetime | durability-critical |
Shared memory (IPC) | tmpfs or emptyDir medium=Memory | pod lifetime | inter-process communication |
This layered approach—read-only root + ephemeral + persistent—is the production standard for Kubernetes. It ensures security (immutability), performance (fast tmpfs), and reliability (persistent storage), all while maintaining compliance with container security frameworks like CIS Docker Benchmark and FedRAMP.
