Docker Deep Dive – Docker

Capturing Logs in Docker Desktop

David Scott — Fri, 17 Jan 2020 17:33:08 +0000

Photo by Agence Olloweb on Unsplash

Docker Desktop runs a Virtual Machine to host Docker containers. Each component within the VM (including the Docker engine itself) runs as a separate isolated container. This extra layer of isolation introduces an interesting new problem: how do we capture all the logs so we can include them in Docker Desktop diagnostic reports? If we do nothing then the logs will be written separately into each individual container which obviously isn’t very useful!

The Docker Desktop VM boots from an ISO which is built using LinuxKit from a list of Docker images together with a list of capabilities and bind mounts. For a minimal example of a LinuxKit VM definition, see https://github.com/linuxkit/linuxkit/blob/master/examples/minimal.yml — more examples and documentation are available in the LinuxKit repository. The LinuxKit VM in Docker Desktop boots in two phases: in the first phase, the init process executes a series of one-shot “on-boot” actions sequentially using runc to isolate them in containers. These actions typically format disks, enable swap, configure sysctl settings and network interfaces. The second phase contains “services” which are started concurrently and run forever as containerd tasks.

The following diagram shows a simplified high-level view of the boot process:

By default the “on-boot” actions’ stdout and stderr are written both to the VM console and files in /var/log/onboot.* while the “services” stdout and stderr are connected directly to open files in /var/log which are left to grow forever.

Initially we considered adding logging to the VM by running a syslog compatible logging daemon as a regular service that exposes /dev/log or a port (or both). Other services would then connect to syslog to write logs. Unfortunately a logging daemon running in a service would start later — and therefore miss all the logs from — the “on-boot” actions. Furthermore, since services start concurrently, there would be a race between the syslog daemon starting and syslog clients starting: either logs would be lost or each client startup would have to block waiting for the syslog service to start. Running a syslog daemon as an “on-boot” action would avoid the race with services, but we would have to choose where to put it in the “on-boot” actions list. Ideally we would start the logging daemon at the beginning so that no logs are lost, but then we would not have access to persistent disks or the network to store the logs anywhere useful.

In summary we wanted to add a logging mechanism to Docker Desktop that:

was able to capture all the logs — both the on-boot actions and the service logs;
could write the logs to separate files to make them easier to read in a diagnostics report;
could rotate the log files so they don’t grow forever;
could be developed within the upstream LinuxKit project; and
would not force existing LinuxKit users to rewrite their YAML definitions or modify their existing code.

We decided to implement first-class support for logging by adding a “memory log daemon” called memlogd which starts before the first on-boot action and buffers in memory the last few thousand lines of console output from each container. Since it is only buffering in memory, memlogd does not require any network or persistent storage. A log downloader starts later, after the network and persistent storage is available, connects to memlogd and streams the logs somewhere permanent.

As long as the logs are streamed before the in-memory buffer is full, no lines will be lost. The use of memlogd is entirely optional in LinuxKit; if it is not included in the image then logs are written to the console and directly to open files as before.

Design

We decided to use the Go library container/ring to create a bounded circular buffer. The buffer is bounded to prevent a spammy logging client consuming too much memory in the VM. However if the buffer does fill, then the oldest log lines will be dropped. The following diagram shows the initial design:

Logging clients send log entries via a file descriptor (labelled “linuxkit-external-logging.sock”). Log downloading programs connect to a query socket (labelled “memlogdq.sock”), read logs from the internal buffer and write them somewhere else.

Recall that one of our design goals was to avoid making changes to each individual container definition to use the new logging system. We don’t want to explicitly bind-mount a logging socket into the container or have to modify the container’s code to connect to it. How then do we capture the output from containers automatically and pass it all to the linuxkit-external-logging.sock?

When an on-boot action or service is launched, the VM’s init system creates a FIFO (for containerd) or a socketpair (for runc) for the stdout and stderr. By convention LinuxKit containers normally write their log entries to stderr. Therefore if we modify the init system, we can capture the logs written to the stderr FIFOs and the socketpairs without having to change the container definition or the code. Once the logs have been captured, the next step is to send them to memlogd — how do we do that?

A little known feature of Linux is that you can pass open file descriptors to other processes via Unix domain sockets. We can, instead of proxying log lines, just pass an open socket directly to memlogd. We modified the design for memlogd to take advantage of this:

When the container is started, the init system passes the stdout and stderr file descriptors to memlogd along with the name of the container. Memlogd monitors all its file descriptors in a select-loop. When data is available it will be read, tagged with the container name and timestamped before it is appended to the in-memory ringbuffer. When the container terminates, the fd is closed and memlogd removes the fd from the loop.

So this means:

we don’t have to modify container YAML definitions or code to be aware of the logging system; and
we don’t have to proxy logs between the container and memlogd.

Querying memlogd

To see memlogd in action on a Docker Desktop system, try the following command:

docker run -it --privileged --pid=host justincormack/nsenter1 /usr/bin/logread -F -socket /run/guest-services/memlogdq.sock

This will run a privileged container in the root namespace (containing the “memlogdq.sock” used for querying the logs) and run the utility “logread”, telling it to “follow” the stream i.e. to keep copying from memlogd to the terminal until interrupted. The output looks like this:

2019-02-22T16:04:23Z,docker;time="2019-02-22T16:04:23Z" level=debug msg="registering ttrpc server"

Where the initial timestamp indicates when memlogd received the message and “docker” shows that the log came from the docker service. The rest of the line is the output written to stderr.

Kernel logs (kmsg)

In Docker Desktop we include the Linux kernel logs in diagnostic reports to help us understand and fix Linux kernel bugs. We created the kmsg-package for this purpose. When this service is started, it will connect to /dev/kmsg, stream the kernel logs and output them to stderr. As the stderr is automatically sent to memlogd, the kernel logs will then also be included in the VM’s logs and will be included in the diagnostic report. Note that reading kernel logs via /dev/kmsg is a privileged operation and so the kmsg service needs the capability CAP_SYSLOG.

Persisting the logs

In Docker Desktop we persist the log entries to files (one per service), rotate them when they become large and then delete the oldest to avoid leaking space. We created the logwrite package for this purpose. When this service is started, it connects to the query socket memlogdq.sock, downloads the logs as they are written and manages the log files.

Summary

We now have a relatively simple and lightweight, yet extendable logging system that provides the features we need in Docker Desktop: it captures logs from both “on-boot” actions and services, and persists logs to files with rotation after the file system has been mounted. We developed the logging system in the upstream LinuxKit project where we hope the simple and modular design will allow it to be easily extended by other LinuxKit developers.

References

Documentation for memlogd in the LinuxKit repo
An example use of memlogd
The source code for memlogd
The kmsg service which allows kernel log messages to be included

This post was joint work with Magnus Skjegstad.

Deep Dive Into the New Docker Desktop filesharing Implementation Using FUSE

David Scott — Mon, 16 Dec 2019 15:00:00 +0000

Photo by Shane Aldendorff on Unsplash

The latest Edge release of Docker Desktop for Windows 2.1.7.0 has a completely new filesharing system using FUSE instead of Samba. The initial blog post we released presents the performance improvements of this new implementation and explains how to give feedback. Please try it out and let us know what you think. Now, we are going to go into details to give you more insight about the new architecture.

New Architecture

Instead of Samba running over a Hyper-V virtual network, the new system uses a Filesystem in Userspace (FUSE) server running over gRPC over Hypervisor sockets.

The following diagram shows the path taken by a single request from a container, for example to read a PHP file:

In step (1) the web-server in the container calls “read” which is a Linux system call handled by the kernel’s Virtual File System (VFS) layer. The VFS is modular and supports many different filesystem implementations. In our case we use Filesystem in Userspace (FUSE) which sends the request to a helper process running inside the VM labelled “FUSE client.” This process runs within the same namespace as the Docker engine. The FUSE client can handle some requests locally, but when it needs to access the host filesystem it connects to the host via “Hypervisor sockets.”

Hypervisor Sockets

Hypervisor sockets are a shared-memory communication mechanism which enables VMs to communicate with each other and with the host. Hypervisor sockets have a number of advantages over using regular virtual networking, including:

since the traffic does not flow over a virtual ethernet/IP network, it is not affected by firewall policies
since the traffic is not routed like IP, it cannot be mis-routed by VPN clients
since the traffic uses shared memory, it can never leave the machine so we don’t have to worry about third parties intercepting it

Docker Desktop already uses these sockets to forward the Docker API, to forward container ports, and now we use them for filesharing on Windows too!

Returning to the diagram, the FUSE client creates sockets using the AF_VSOCK address family, see step (3). The kernel contains a number of low-level transports, one per hypervisor. Since the underlying hypervisor is Hyper-V, we use the VMBus transport. In step (4) filesystem requests are written into the shared memory and read by the VMBus implementation in the Windows kernel. A FUSE server userspace process running in Windows reads the filesystem request over an AF_HYPERV socket in step (5).

FUSE Server

The request to open/close/read/write etc is received by the FUSE server, which is running as a regular Windows process. Finally in step (6) the FUSE server uses the Windows APIs to perform the read or write and then returns the result to the caller.

The FUSE server runs as the user who is running the Docker app, so it only has access to the user’s files and folders. There is no possibility of the VM gaining access to any other files, as could happen in the previous design if a local admin account is used to mount the drive in the VM.

Event Injection

When files are modified in Linux, the kernel generates inotify events. Interested applications can watch for these events and take action. For example, a React app run with

$ npm start

will watch for inotify events and automatically recompile when code changes and trigger the browser to refresh automatically, as shown in this video. In previous versions of Docker Desktop on Windows we weren’t able to generate inotify events so these styles of development simply wouldn’t work.

Injecting inotify events is quite tricky. Normally a Linux VFS implementation like FUSE wouldn’t generate the events itself; instead the common code in the higher layer generates events as a side-effect of performing actions. For example when the VFS “unlink” is called and returns successfully then the “unlink” event will be generated. So when the user calls “unlink” under Windows, how does Linux find out about it?

Docker Desktop watches for events on the host when the user runs docker run -v. When an “unlink” event is received on the host, a request to “please inject unlink” is forwarded over gRPC to the Linux VM. The following diagram shows the sequence of operations:

A thread with a well-known pid inside the FUSE client in Linux “replays” the request by calling “unlink,” even though the directory has actually already been removed. The FUSE client intercepts requests from this well-known pid and pretends that the unlink hasn’t happened yet. For example, when FUSE_GETATTR is called, the FUSE client will say, “yes the directory is still here” (instead of ENOENT). When the FUSE_UNLINK is called, the FUSE client will say, “yes that worked” (instead of ENOENT). As a result of the successful FUSE_UNLINK the Linux kernel generates the inotify event.

Caching

As you can see from the architecture diagram above, each I/O request has to make several user/kernel transitions and VM/host transitions to complete. This means the latency of a filesystem operation is much higher than the case when all the files are local in the VM. We have mitigated this with aggressive use of kernel caching, so many requests can be avoided altogether. We:

use the file attribute cache which minimises FUSE_GETATTR requests
set FOPEN_CACHE_DIR which caches directory contents in the kernel page cache
set FOPEN_KEEP_CACHE which caches file contents
we set CAP_MAX_PAGES to increase the maximum request size
We use a modern 4.19 series kernel with the latest FUSE patches backported

Since we have enabled so many caches we have to carefully handle cache invalidation. When a user runs docker run -v and we are monitoring the filesystem events for inotify event injection, we also use these events to invalidate cache entries. When the docker run -v exits and the watches are disabled we invalidate all the cache entries.

Future Evolution

We have lots of ideas to improve the performance further by even more aggressive use of caching. For example, in the Symfony benchmark above, the majority of the remaining FUSE calls in the cached case are calls to open and close file handles; even though the file contents is itself cached (and hasn’t changed). We may be able to make these open and close calls lazy and only call them when needed.

The new filesystem implementation is not relevant on WSL 2 (currently available on early Windows Insider builds), since that already has a native filesharing mode which uses 9P. Of course we will keep benchmarking, optimising and incorporating user feedback to always use the best available filesharing implementation across all OS versions.