Container Filesystem: Definition, Works, Layers, Build and Purpose

Containers may look lightweight on the surface, but everything they do depends on how their filesystem is built and controlled. This article explains what a container filesystem is, why it exists, and how it works under the hood. It walks through filesystem layers, OverlayFS and overlay2, the role of runtimes in creating the root filesystem, and how mounts and volumes change what a container sees. You will also learn how container filesystems differ from host filesystems, how to explore and export them, and which runtime security controls protect files and directories during execution.
What is a container filesystem and why does it exist?
A container filesystem is the file and directory view that a container (including a Docker container) sees as its own root (/). It is assembled at runtime from a base image that is typically read-only, plus a small writable area for runtime changes, and then presented to processes inside a container as a normal Linux filesystem.
It exists for three reasons:
- Isolation of files and paths: It gives each container a separate filesystem view so an application cannot automatically browse or modify unrelated host directories on the same machine. This reduces accidental exposure when a process gains local access.
- Reproducibility through read-only bases: A read-only image provides a consistent filesystem layout (directories, binaries, libraries) across environments, which makes Docker-based deployments predictable, and container monitoring verifies at runtime that critical files and paths remain unchanged from that baseline.
- Efficient disk usage and change tracking: Many containers can share the same read-only image layers on disk, while each container only stores its unique runtime changes in its writable area. This minimizes storage duplication while keeping each container filesystem logically separate.
What is VFS (Virtual FS) and how does it abstract container filesystems?
VFS (Virtual File System) is a Linux kernel abstraction that provides a single, consistent interface for all filesystem operations, regardless of how data is actually stored or mounted underneath. In the context of Linux containers, VFS is what makes a container filesystem behave like a normal filesystem to processes, even though it is composed of multiple layers, mounts, and kernel-backed paths, which directly influences container security controls around how files are accessed and modified at runtime.
VFS abstracts container filesystems in the following ways:
- Single root view for container processes: When a running container starts, the kernel mounts a merged filesystem as its root (/). VFS resolves all file paths against this root, so processes inside the container see a complete and coherent directory tree, even though it may combine image layers and a writable layer using overlayfs.
- Isolation via mount namespaces: Each container runs in its own mount namespace. VFS uses the mount table associated with that namespace, not the host’s. As a result, path lookups inside the container resolve against the container’s mounts, not the host filesystem, even though the same kernel is shared.
- Layer transparency: VFS hides the complexity of layered storage. Whether a file comes from a read-only image layer or the writable container layer, applications access it through the same filesystem calls. This matters for container networking configurations, because network plugins and DNS settings often write runtime files that land in the writable layer while the base image remains shared. This is why multiple containers can share the same underlying image layers on disk while behaving as independent filesystems.
- Virtual kernel filesystems: Paths such as /proc and /dev are not traditional disk-backed files. VFS exposes them as part of the container’s filesystem so processes can interact with kernel data using standard file operations, scoped to the container’s process and namespace boundaries.
How does a container filesystem work?
A container filesystem works by giving a container its own root filesystem (/) that appears complete and independent, even though it is assembled dynamically by the operating system kernel and the container runtime.
- When a new container starts, the runtime takes the filesystem layers from a docker image and mounts them as a single unified view. This containerization process creates a consistent, isolated filesystem view for the workload, using image layers that are mounted as read-only, which means files from the image cannot be modified directly. This guarantees that the base filesystem remains consistent across environments and across multiple containers using the same image.
- To allow changes at runtime, the container runtime adds a writable layer on top of the read-only image layers. Any files created, modified, or deleted inside the container are written only to this writable layer. From the application’s perspective, the docker container’s filesystem behaves like a normal Linux filesystem, even though changes are isolated to that container.
- A storage driver such as an overlay driver merges the read-only image layers and the writable layer into a single filesystem view. This merged view is what processes see when they explore the filesystem using tools like docker run -it or exec. Applications do not interact with image layers or the host layout directly.
- The runtime also mounts kernel-managed paths like /proc, /dev, /tmp, and /usr so processes can access system information, devices, and temporary storage. These mounts sit alongside container image layering, which supplies the read-only base files under paths like /usr while runtime mounts provide live kernel interfaces. These paths are scoped to the container using Linux isolation mechanisms, not copied from the filesystem on the host.
- Data persistence depends on where files are written. Changes stored only in the container’s writable layer are lost when the container is removed. Data written through bind mounts or other host-backed mounts persists on the host machine, surviving container restarts and the container lifecycle.
What are container filesystem layers?
Container filesystem layers are the stacked filesystem changes that make up a container image, which packages the application, its dependencies, and configuration into a reusable artifact that, at runtime, becomes the base of the container file system.
Each layer represents a set of file additions, modifications, or deletions produced during an image build, typically from steps in a Dockerfile, which defines the sequence of build instructions used to assemble the image. The runtime then reuses those layers across containers to save disk space and speed up creation.
A layer works like this:
- Most layers are readonly at runtime. Multiple containers can share the same read-only layers on the docker host, which is why starting a newly created container from an existing image is fast and space-efficient.
- A running container adds a thin writable layer on top. Files created or changed inside that container are stored there, which is why changes are isolated per container.
- Layered storage is content-based. The platform stores layers as separate objects (often distributed as a tar-style archive format in registries) and reuses them when the underlying layer content is identical, while cgroups enforce per-container limits on CPU and memory so filesystem operations like unpacking layers and reading image data do not overwhelm the host.
How do you explore a container filesystem?
You explore a container filesystem by entering the container’s runtime view of / and inspecting directories and mounts, or by extracting the image or container filesystem without entering it. The safest approach depends on whether the container is running, stopped, or only available as an image, because the container entrypoint controls which process starts first and what filesystem paths are initialized before you inspect them.
- Start an interactive shell and browse files inside a container: If the image includes a shell (common for alpine and many ubuntu images), start it and inspect paths like /usr, /tmp, and /dev to understand what the container actually sees at runtime.
- Inspect the filesystem of an already running container: Attach a shell to the running container and list directories, check file permissions, and confirm where data is stored. This is usually the quickest way to validate “what exists” inside a container.
- Explore /proc for process and namespace-scoped details: /proc lets you correlate what the container sees to pid and process state (for example, verifying which process is PID 1 in the container and what environment it is running with).
- Use nsenter for deep inspection from the host: When you need host-level visibility into the container’s namespaces, nsenter can enter the target container’s mount and process context and let you inspect the filesystem view more directly, including verifying which files were pulled from an image repository, where container images are stored, versioned, and distributed as the authoritative source for container files, versus those created at runtime in the container. This is commonly used when you need deeper access to files and runtime mounts than normal container exec access provides.
- Inspect without entering the container: If the container is stopped or you only have an image file, you can export the filesystem to a tar archive and inspect it offline. This is useful for review, auditing, or comparing the filesystem across versions.
- Kubernetes-specific exploration: In kubernetes, you typically exec into the container through the cluster API to inspect the runtime filesystem view, especially when validating mounted paths and what the workload can access.
How do you export or copy a container filesystem?
You export or copy a container filesystem by either (1) exporting the container’s root filesystem as an archive, or (2) copying specific paths in and out of the container. The right method depends on whether you need the whole filesystem or only selected data.
- Export the entire container filesystem as a tar archive: This captures the container’s root filesystem (/) into a single tarball for offline inspection or transfer. In container orchestration workflows, this export helps teams capture a consistent filesystem snapshot from a managed workload for debugging across replicas and nodes. It is the most direct option when you need the full filesystem rather than a few files.
- Copy specific files or directories out of the container: This is best when you only need targeted paths (for example, /usr/local/app/config or /tmp/output) rather than the whole filesystem. It avoids the size and noise of a full export.
- Copy files or directories into a container: Use this when you need to inject configuration, scripts, or binaries into a running container for debugging or temporary fixes, rather than rebuilding an image.
- Export for analysis without requiring an interactive session: If you “need access” for review or auditing but cannot log in interactively, exporting to a tar archive remains a practical all-in-one approach because it is readable with standard tooling and does not depend on runtime shells.
How do you build a container filesystem from scratch?
You build a container filesystem from scratch by manually creating a minimal Linux root filesystem, mounting the required kernel-backed paths, and then running a process inside that filesystem using isolation primitives. This approach is commonly used in low-level container tutorials to understand how containers work beneath tools like Docker.
- Create a minimal root filesystem: Start with an empty directory and populate only the essentials such as /usr, /bin, /lib, /tmp, and /dev. This directory becomes the future root (/) seen by the container process. Ownership is typically set to root root so permissions behave correctly.
- Add required binaries and libraries: Copy a minimal shell and its dependent libraries into the filesystem. Many examples use a tiny userland similar to Alpine to keep the filesystem small and focused, then publish that rootfs as an image to a container registry, which stores and serves the built filesystem layers for consistent pulls across environments.
- Mount kernel-backed filesystems: Mount /proc and /dev into the filesystem so processes can access process metadata and devices. These mounts are required for normal Linux behavior and are attached under the filesystem’s mnt tree.
- Run the process with isolation: Use Linux tools to start a process that treats this directory as its root filesystem. At this point, the process is interacting with the filesystem directly, not with the host’s /.
- Understand layering conceptually: While this setup creates a single filesystem, modern runtimes build on this idea by stacking each layer of the container on top of another. Building it manually shows the base that higher-level tooling automates.
- Clean up mounts after execution: Once the process exits, unmount /proc, /dev, and any temporary mounts using sudo if required, to avoid leaking resources, especially when you build image artifacts repeatedly during testing and need a clean mount state for each build.
This manual approach does not rely on a daemon and does not store data beyond the lifecycle of the process, but it provides a clear foundation for understanding how higher-level container systems assemble and manage filesystems, including how an SBOM documents the files and dependencies present in the filesystem for visibility and auditability.
container filesystem vs host filesystem: What’s the difference?
What is OverlayFS and how does overlay2 implement container filesystems?
OverlayFS is a Linux filesystem that merges multiple directories into a single, unified view. It is widely used in containers to combine immutable image content, including a hardened container image baseline that reduces unnecessary packages and writable surface area, with container-specific changes while keeping storage efficient.
In Docker, overlay2 is the OverlayFS-based implementation that builds the container filesystem as follows:
OverlayFS stacks one or more read-only directories (the image layers) beneath a writable directory created for the running container. Docker’s overlay2 driver mounts these directories together so applications see a single filesystem tree. From the container’s perspective, this looks like a normal Linux layout with paths such as /usr, /tmp, /dev, and /proc, even though the data may come from different underlying locations.
When a container modifies a file that exists in a read-only image layer, overlay2 performs a copy-up operation. The file is copied into the container’s writable area and modified there, leaving the original image layer unchanged. New files are written directly to the writable area. This design allows multiple containers to share the same image layers without duplicating data on disk.
Overlay2 improves scalability and performance compared to older drivers by mapping image layers more directly to OverlayFS directories and reducing mount complexity. This efficiency is especially useful for distroless container images, which minimize userland files and rely on clean, shared layers for faster pulls and smaller attack surface. As a result, overlay2 has become the default storage driver for many Linux-based Docker installations and is central to understanding how container filesystems work under the hood.
Which runtime components create and mount the container rootfs?
The container rootfs is created and mounted by a small runtime stack that splits responsibilities between image management, filesystem snapshotting, and process startup:
- Docker daemon (dockerd): Orchestrates the overall workflow (image selection, container creation, mount setup requests) and delegates low-level work to containerd.
- containerd: Pulls/unpacks image layers, manages the container’s snapshot (the prepared root filesystem), and prepares the mount instructions for the container’s filesystem tree (including /usr and /tmp coming from the rootfs view).
- Storage driver / snapshotter: Implements the actual rootfs assembly on disk (for example, OverlayFS via overlay2). This is the core of docker image vs container behavior, because the image provides read-only layers while the container adds writable changes at runtime. It merges read-only image layers with the container’s writable changes and produces the final mountable rootfs.
- runc (OCI runtime): Applies the mount configuration, sets up the container’s mount namespace, performs the pivot_root/chroot into the new rootfs, mounts kernel-backed filesystems like /proc and /dev, and then starts the container process.
- Linux kernel: Enforces mount namespaces and mount rules, and provides the kernel-backed filesystems (/proc, /dev) that are mounted into the container’s filesystem view.
How do mounts and volumes change what a container sees in its filesystem?
Mounts and volumes change what a container sees by replacing parts of the container’s filesystem tree with content from another source at specific paths. The container still sees a single root (/), but selected directories are effectively “swapped” with mounted content.
- They override image-provided files at the mount path: If an image includes files under /usr or /tmp, mounting something at those paths makes the mounted content take precedence. The underlying image content at that location becomes hidden for as long as the mount exists.
- They control where data actually lives: A mounted path can point to storage outside the container’s writable layer. This is why data written under a mounted directory behaves differently from data written elsewhere in the container filesystem.
- They change persistence and lifecycle: Writes to a mounted path can remain available even if the container is recreated, because the mount source is not tied to the container’s own writable changes.
- They can expose system interfaces intentionally: Some mounts provide access to kernel-managed views like /proc and /dev so the container can read process information or interact with devices. Unlike docker images, which package static filesystem content, these mounts inject live kernel interfaces at runtime. These mounts influence what system details and device nodes the container can see.
What container filesystem security controls apply at runtime?
At runtime, container filesystem security controls limit what processes can see, access, and modify within the filesystem, even though containers share the same kernel, while container scanning identifies vulnerable files, packages, and configurations in images before deployment to reduce runtime risk.
Key controls include:
- Read-only root filesystem: The container’s root (/) can be mounted as read-only so binaries and libraries under paths like /usr cannot be modified at runtime, reducing the risk of persistence after compromise.
- Mount options and path restrictions: Specific directories can be mounted with restrictive flags (noexec, nodev, nosuid) to prevent execution or device access from locations such as /tmp.
- Linux security modules: Mechanisms like SELinux or AppArmor enforce mandatory access rules on filesystem paths, controlling which processes can read or write files under directories like /usr or access special files under /dev.
- Scoped virtual filesystems: Kernel-backed paths such as /proc and /dev are mounted in a constrained form so the container only sees its own process and device context, not the full host view.
- User and permission isolation: Containers typically run with limited privileges, ensuring standard Unix file permissions apply and preventing unrestricted access to sensitive paths.
Together, these controls ensure the container filesystem remains isolated, minimally writable, and tightly scoped to what the workload needs at runtime, which is the operational goal of container image security practices that protect image content and runtime file access.
FAQs
Q1. Does a container filesystem include the Linux kernel?
No. The kernel is shared with the host; the container filesystem provides user space files and a root filesystem view.
Q2. How can you tell which image layer a specific file came from?
Use image-layer inspection tooling to map the file path to the layer diff that introduced or modified it, rather than assuming it came from the top layer.
Q3. What happens to filesystem changes when a container is deleted?
Changes stored only in the container’s writable layer are removed with the container; only data stored through mounts can remain.
Q4. Can you make only part of a container filesystem read-only?
Yes. You can mount specific paths as read-only while keeping other paths writable, which tightens runtime write access without breaking the app.
Q5. Why do two containers from the same image show different files at the same path?
Because mounts can override image paths, and each container has its own writable changes, even when both share the same base image layers.


.webp)
.webp)
.webp)




%20(1).png)

