Docker Desktop runs a Virtual Machine to host Docker containers. Each component within the VM (including the Docker engine itself) runs as a separate isolated container. This extra layer of isolation introduces an interesting new problem: how do we capture all the logs so we can include them in Docker Desktop diagnostic reports? If we do nothing then the logs will be written separately into each individual container which obviously isn’t very useful!
The Docker Desktop VM boots from an ISO which is built using LinuxKit from a list of Docker images together with a list of capabilities and bind mounts. For a minimal example of a LinuxKit VM definition, see https://github.com/linuxkit/linuxkit/blob/master/examples/minimal.yml — more examples and documentation are available in the LinuxKit repository. The LinuxKit VM in Docker Desktop boots in two phases: in the first phase, the init process executes a series of one-shot “on-boot” actions sequentially using runc to isolate them in containers. These actions typically format disks, enable swap, configure sysctl settings and network interfaces. The second phase contains “services” which are started concurrently and run forever as containerd tasks.
The following diagram shows a simplified high-level view of the boot process:
By default the “on-boot” actions’ stdout and stderr are written both to the VM console and files in /var/log/onboot
.* while the “services” stdout
and stderr
are connected directly to open files in /var/log
which are left to grow forever.
Initially we considered adding logging to the VM by running a syslog compatible logging daemon as a regular service that exposes /dev/log
or a port (or both). Other services would then connect to syslog to write logs. Unfortunately a logging daemon running in a service would start later — and therefore miss all the logs from — the “on-boot” actions. Furthermore, since services start concurrently, there would be a race between the syslog daemon starting and syslog clients starting: either logs would be lost or each client startup would have to block waiting for the syslog service to start. Running a syslog daemon as an “on-boot” action would avoid the race with services, but we would have to choose where to put it in the “on-boot” actions list. Ideally we would start the logging daemon at the beginning so that no logs are lost, but then we would not have access to persistent disks or the network to store the logs anywhere useful.
In summary we wanted to add a logging mechanism to Docker Desktop that:
We decided to implement first-class support for logging by adding a “memory log daemon” called memlogd which starts before the first on-boot action and buffers in memory the last few thousand lines of console output from each container. Since it is only buffering in memory, memlogd does not require any network or persistent storage. A log downloader starts later, after the network and persistent storage is available, connects to memlogd and streams the logs somewhere permanent.
As long as the logs are streamed before the in-memory buffer is full, no lines will be lost. The use of memlogd is entirely optional in LinuxKit; if it is not included in the image then logs are written to the console and directly to open files as before.
We decided to use the Go library container/ring to create a bounded circular buffer. The buffer is bounded to prevent a spammy logging client consuming too much memory in the VM. However if the buffer does fill, then the oldest log lines will be dropped. The following diagram shows the initial design:
Logging clients send log entries via a file descriptor (labelled “linuxkit-external-logging.sock”). Log downloading programs connect to a query socket (labelled “memlogdq.sock”), read logs from the internal buffer and write them somewhere else.
Recall that one of our design goals was to avoid making changes to each individual container definition to use the new logging system. We don’t want to explicitly bind-mount a logging socket into the container or have to modify the container’s code to connect to it. How then do we capture the output from containers automatically and pass it all to the linuxkit-external-logging.sock?
When an on-boot action or service is launched, the VM’s init system creates a FIFO (for containerd) or a socketpair (for runc) for the stdout and stderr. By convention LinuxKit containers normally write their log entries to stderr. Therefore if we modify the init system, we can capture the logs written to the stderr FIFOs and the socketpairs without having to change the container definition or the code. Once the logs have been captured, the next step is to send them to memlogd — how do we do that?
A little known feature of Linux is that you can pass open file descriptors to other processes via Unix domain sockets. We can, instead of proxying log lines, just pass an open socket directly to memlogd. We modified the design for memlogd to take advantage of this:
When the container is started, the init system passes the stdout and stderr file descriptors to memlogd along with the name of the container. Memlogd monitors all its file descriptors in a select-loop. When data is available it will be read, tagged with the container name and timestamped before it is appended to the in-memory ringbuffer. When the container terminates, the fd is closed and memlogd removes the fd from the loop.
So this means:
To see memlogd in action on a Docker Desktop system, try the following command:
docker run -it --privileged --pid=host justincormack/nsenter1 /usr/bin/logread -F -socket /run/guest-services/memlogdq.sock
This will run a privileged container in the root namespace (containing the “memlogdq.sock” used for querying the logs) and run the utility “logread”, telling it to “follow” the stream i.e. to keep copying from memlogd to the terminal until interrupted. The output looks like this:
2019-02-22T16:04:23Z,docker;time="2019-02-22T16:04:23Z" level=debug msg="registering ttrpc server"
Where the initial timestamp indicates when memlogd received the message and “docker” shows that the log came from the docker service. The rest of the line is the output written to stderr.
In Docker Desktop we include the Linux kernel logs in diagnostic reports to help us understand and fix Linux kernel bugs. We created the kmsg-package for this purpose. When this service is started, it will connect to /dev/kmsg, stream the kernel logs and output them to stderr. As the stderr is automatically sent to memlogd, the kernel logs will then also be included in the VM’s logs and will be included in the diagnostic report. Note that reading kernel logs via /dev/kmsg is a privileged operation and so the kmsg service needs the capability CAP_SYSLOG.
In Docker Desktop we persist the log entries to files (one per service), rotate them when they become large and then delete the oldest to avoid leaking space. We created the logwrite package for this purpose. When this service is started, it connects to the query socket memlogdq.sock, downloads the logs as they are written and manages the log files.
We now have a relatively simple and lightweight, yet extendable logging system that provides the features we need in Docker Desktop: it captures logs from both “on-boot” actions and services, and persists logs to files with rotation after the file system has been mounted. We developed the logging system in the upstream LinuxKit project where we hope the simple and modular design will allow it to be easily extended by other LinuxKit developers.
This post was joint work with Magnus Skjegstad.
]]>The latest Edge release of Docker Desktop for Windows 2.1.7.0 has a completely new filesharing system using FUSE instead of Samba. The initial blog post we released presents the performance improvements of this new implementation and explains how to give feedback. Please try it out and let us know what you think. Now, we are going to go into details to give you more insight about the new architecture.
Instead of Samba running over a Hyper-V virtual network, the new system uses a Filesystem in Userspace (FUSE) server running over gRPC over Hypervisor sockets.
The following diagram shows the path taken by a single request from a container, for example to read a PHP file:
In step (1) the web-server in the container calls “read” which is a Linux system call handled by the kernel’s Virtual File System (VFS) layer. The VFS is modular and supports many different filesystem implementations. In our case we use Filesystem in Userspace (FUSE) which sends the request to a helper process running inside the VM labelled “FUSE client.” This process runs within the same namespace as the Docker engine. The FUSE client can handle some requests locally, but when it needs to access the host filesystem it connects to the host via “Hypervisor sockets.”
Hypervisor sockets are a shared-memory communication mechanism which enables VMs to communicate with each other and with the host. Hypervisor sockets have a number of advantages over using regular virtual networking, including:
Docker Desktop already uses these sockets to forward the Docker API, to forward container ports, and now we use them for filesharing on Windows too!
Returning to the diagram, the FUSE client creates sockets using the AF_VSOCK address family, see step (3). The kernel contains a number of low-level transports, one per hypervisor. Since the underlying hypervisor is Hyper-V, we use the VMBus transport. In step (4) filesystem requests are written into the shared memory and read by the VMBus implementation in the Windows kernel. A FUSE server userspace process running in Windows reads the filesystem request over an AF_HYPERV socket in step (5).
The request to open/close/read/write etc is received by the FUSE server, which is running as a regular Windows process. Finally in step (6) the FUSE server uses the Windows APIs to perform the read or write and then returns the result to the caller.
The FUSE server runs as the user who is running the Docker app, so it only has access to the user’s files and folders. There is no possibility of the VM gaining access to any other files, as could happen in the previous design if a local admin account is used to mount the drive in the VM.
When files are modified in Linux, the kernel generates inotify events. Interested applications can watch for these events and take action. For example, a React app run with
$ npm start
will watch for inotify events and automatically recompile when code changes and trigger the browser to refresh automatically, as shown in this video. In previous versions of Docker Desktop on Windows we weren’t able to generate inotify events so these styles of development simply wouldn’t work.
Injecting inotify events is quite tricky. Normally a Linux VFS implementation like FUSE wouldn’t generate the events itself; instead the common code in the higher layer generates events as a side-effect of performing actions. For example when the VFS “unlink” is called and returns successfully then the “unlink” event will be generated. So when the user calls “unlink” under Windows, how does Linux find out about it?
Docker Desktop watches for events on the host when the user runs docker run -v
. When an “unlink” event is received on the host, a request to “please inject unlink” is forwarded over gRPC to the Linux VM. The following diagram shows the sequence of operations:
A thread with a well-known pid inside the FUSE client in Linux “replays” the request by calling “unlink,” even though the directory has actually already been removed. The FUSE client intercepts requests from this well-known pid and pretends that the unlink hasn’t happened yet. For example, when FUSE_GETATTR is called, the FUSE client will say, “yes the directory is still here” (instead of ENOENT). When the FUSE_UNLINK is called, the FUSE client will say, “yes that worked” (instead of ENOENT). As a result of the successful FUSE_UNLINK the Linux kernel generates the inotify event.
As you can see from the architecture diagram above, each I/O request has to make several user/kernel transitions and VM/host transitions to complete. This means the latency of a filesystem operation is much higher than the case when all the files are local in the VM. We have mitigated this with aggressive use of kernel caching, so many requests can be avoided altogether. We:
Since we have enabled so many caches we have to carefully handle cache invalidation. When a user runs docker run -v and we are monitoring the filesystem events for inotify event injection, we also use these events to invalidate cache entries. When the docker run -v exits and the watches are disabled we invalidate all the cache entries.
We have lots of ideas to improve the performance further by even more aggressive use of caching. For example, in the Symfony benchmark above, the majority of the remaining FUSE calls in the cached case are calls to open and close file handles; even though the file contents is itself cached (and hasn’t changed). We may be able to make these open and close calls lazy and only call them when needed.
The new filesystem implementation is not relevant on WSL 2 (currently available on early Windows Insider builds), since that already has a native filesharing mode which uses 9P. Of course we will keep benchmarking, optimising and incorporating user feedback to always use the best available filesharing implementation across all OS versions.
]]>