David Scott – Docker

Docker Desktop 4.21: Support for new Wasm runtimes, Docker Init support for Rust, Docker Scout Dashboard enhancements, Builds view (Beta), and more

David Scott — Thu, 06 Jul 2023 13:26:37 +0000

Docker Desktop 4.21 is now available and includes Docker Init support for Rust, new Wasm runtimes support, enhancements to Docker Scout dashboards, Builds view (Beta), performance and filesystem enhancements to Docker Desktop on macOS, and more. Docker Desktop in 4.21 also uses substantially less memory, allowing developers to run more applications simultaneously on their machines without relying on swap.

Added support for new Wasm runtimes

Docker Desktop 4.21 now has added support for the following Wasm runtimes: Slight, Spin, and Wasmtime. These runtimes can be downloaded on demand when the containerd image store is enabled. The following steps outline the process:

In Docker Desktop, navigate to the settings by clicking the gear icon.
Select the Features in development tab.
Check the boxes for Use containerd for pulling and storing images and Enable Wasm.
Select Apply & restart.
When prompted for Wasm Runtimes Installation, select Install.
After installation, these runtimes can be used to run Wasm workloads locally with the corresponding flags, for example:
--runtime=io.containerd.spin.v1 --platform=wasi/wasm32

Docker Init (Beta) added support for Rust

In the 4.21 release, we’ve added Rust server support to Docker Init. Docker Init is a CLI command in beta that simplifies the process of adding Docker to a project. (Learn more about Docker Init in our blog post: Docker Init: Initialize Dockerfiles and Compose files with a single CLI command.)

You can try Docker Init with Rust by updating to the latest version of Docker Desktop and typing docker init in the command line while inside a target project folder.

The Docker team is working on adding more languages and frameworks for this command, including Java and .Net. Let us know if you want us to support a specific language or framework. We welcome feedback as we continue to develop and improve Docker Init (Beta).

Docker Scout dashboard enhancements

The Docker Scout Dashboard helps you share the analysis of images in an organization with your team. Developers can now see an overview of their security status across all their images from both Docker Hub and Artifactory (more registry integrations coming soon) and get remediation advice at their fingertips. Docker Scout analysis helps team members in roles such as security, compliance, and operations to know what vulnerabilities and issues they need to focus on.

Figure 1: A screenshot of the Docker Scout vulnerabilities overview

Visit the Docker Scout vulnerability dashboard to get end-to-end observability into your supply chain.

Docker Buildx v0.11

Docker Buildx component has been updated to a new version, enabling many new features. For example, you can now load multi-platform images into the Docker image store when containerd image store is enabled.

The buildx bake command now supports matrix builds, allowing defining multiple configurations of the same build target that can all be built together.

There are also multiple new experimental commands for better debugging support for your builds. Read more from the release changelog.

Builds (Beta)

Docker Desktop 4.21 includes our Builds view beta release. Builds view gives you visibility into the active builds currently running on your system and enables analysis and debugging of your completed builds.

All builds started with docker build or docker buildx build commands will automatically appear in the Builds view. From there, you can inspect all the properties of a build invocation, including timing information, build cache usage, Dockerfile source, etc. Builds view also provides you full access to all of the logs and properties of individual build steps.

If you are working with multiple Buildx builder instances (for example, running builds inside a Docker container or Kubernetes cluster), Builds view include a new Builders settings view to make it even easier to manage additional builders or set default builder instances.

Builds view is currently in beta as we are continuing to improve them. To enable them, go to Settings > Features in development > Turn on Builds view.

Figure 2: Builds view — List of active and completed builds

Figure 3: Builds view — Build details with logs visible

Figure 4: Builds view — Builder settings with default builder expanded

Faster startup and file sharing for macOS

Launching Docker Desktop on Apple Silicon Macs is at least 25% quicker in 4.21 compared to previous Docker Desktop versions. Previously the start time would scale linearly with the amount of memory allocated to Docker, which meant that users with higher-spec Macs would experience slower startup. This bug has been fixed and now Docker starts in four seconds on Apple Silicon.

Docker Desktop 4.21 uses VirtioFS by default on macOS 12.5+, which provides substantial performance gains when sharing host files with containers (for example, via docker run -v). The time taken to build the Redis engine drops from seven minutes on Docker Desktop 4.20 to only two minutes on Docker Desktop 4.21, for example.

Conclusion

Upgrade now to explore what’s new in the 4.21 release of Docker Desktop. Do you have feedback? Leave feedback on our public GitHub roadmap and let us know what else you’d like to see.

Learn more

Get the latest release of Docker Desktop.
Have questions? The Docker community is here to help.
New to Docker? Get started.

Dockershim not needed: Docker Desktop with Kubernetes 1.24+

David Scott — Thu, 28 Apr 2022 19:00:14 +0000

Docker Desktop includes Kubernetes, optimized and tuned for a fast, distraction-free developer experience. We track upstream Kubernetes changes and manage Kubernetes upgrades so developers can focus on their code rather than cluster administration. Lots of people have been asking us about the removal of a component called “dockershim” from upstream Kubernetes and how this will affect Docker Desktop. It won’t affect anyone at all, because in our recent 4.7.0 release we have already switched Kubernetes from “dockershim” to “cri-dockerd”.

As a developer using Docker Desktop, do I need to do anything?

No, there is no need to take any action, everything continues to work as before. There is no need to change your images or your developer workflows. Your images that ran on Kubernetes yesterday with dockershim will run unchanged on Kubernetes 1.24 without dockershim!

As a developer using Docker Desktop, do I need to care which container runtime implementation is used in production?

No, the container “runtime” is an implementation detail of Kubernetes and you don’t need to think about it. There is no need to rebuild any images or containers. Docker pioneered the industry-standard OCI specifications to ensure that containers can run anywhere on any runtime.

What if I need to bind-mount the Docker socket in my container in a remote Kubernetes cluster?

Provided the Docker engine is still installed on the Kubernetes worker node, the Docker socket can be bind-mounted as before. It’s not necessary for the remote Kubernetes cluster to use the Docker runtime.

What did the dockershim do exactly?

Early versions of Kubernetes created containers directly using the Docker API, through an internal Kubernetes package called “dockershim” (“shim” meaning a small glue layer between Kubernetes internal APIs and the Docker API). Later the Kubernetes project gained the ability to create containers through a new interface called the Container Runtime Interface (CRI) . Until version 1.24, Kubernetes could use the Docker Engine to manage containers in two different ways: one way using the old “dockershim” and a second way using the new Docker CRI implementation from Mirantis (see the post “The future of dockershim is cri-dockerd”). To make Kubernetes easier to manage and maintain, the old “dockershim” has been removed, so all container management is now through the new CRI.

Docker Desktop has already switched from the old dockershim to the new Docker CRI implementation in version 4.7.0, so the removal of dockershim from Kubernetes does not affect Docker Desktop.

What about containerd?

Back in 2016, Docker started working with Google on “containerd”: a container runtime that could be shared by both Docker and Kubernetes. The low-level parts of dockerd were moved to containerd. Today there is a dockerd and containerd side-by-side in every instance of Docker Desktop. Here at Docker we’re committed to improving both dockerd and containerd to bring great new features to developers.

Benefits of Docker’s built-in Kubernetes distribution

Docker Desktop saves developers time and effort by including an optimized distribution of the latest Kubernetes, which can be enabled with a single checkbox.

Developers benefit from a fast “inner-loop” where they can “docker build” an image and then immediately test it from Kubernetes without having to push and then re-pull, all thanks to Docker’s shared image store and cri-dockerd.

Docker Desktop automatically binds network ports on the host and tunnels traffic to Kubernetes, allowing developers to immediately interact with their applications without remembering internal IP addresses or reconfiguring their DNS.

Docker Desktop manages all the complexity of running containers and Kubernetes, allowing developers to focus on their code rather than cluster administration. There is no need to worry about local cluster upgrades, choosing CNI plugins or container runtimes. If you’d like to know more about Docker’s built-in Kubernetes then check out the blog post “How Kubernetes works under the hood with Docker Desktop”.

Managing a custom Kubernetes installation is a complicated task. The advantage of using Docker Desktop over a DIY solution is that the complexity is handled for you. The removal of dockershim from Kubernetes is an example. With a DIY solution you would have to sort out the migration yourself, and then deploy the solution to all your developers’ desktops: with Docker Desktop it’s all done for you. So your developers can focus on developing, while Docker Desktop takes care of the admin.

Keep calm and update Docker Desktop with confidence: you’ll get the latest Kubernetes, integrated with the latest Docker tools, all fully tested and ready to go!

How Docker Desktop Networking Works Under the Hood

David Scott — Tue, 25 Jan 2022 15:00:00 +0000

Modern applications make extensive use of networks. At build time it’s common to apt-get/dnf/yum/apk install a package from a Linux distribution’s package repository. At runtime an application may wish to connect() to an internal postgres or mysql database to persist some state, while also calling listen() and accept() to expose APIs and UIs over TCP and UDP ports. Meanwhile developers need to be able to work from anywhere, whether in an office or at home or on mobile or on a VPN. Docker Desktop is designed to ensure that networking “just works” for all of these use-cases in all of these scenarios. This post describes the tools and techniques we use to make this happen, starting with everyone’s favorite protocol suite: TCP/IP.

TCP/IP

When containers want to connect to the outside world, they will use TCP/IP. Since Linux containers require a Linux kernel, Docker Desktop includes a helper Linux VM. Traffic from containers therefore originates from the Linux VM rather than the host, which causes a serious problem.

Many IT departments create VPN policies which say something like, “only forward traffic which originates from the host over the VPN”. The intention is to prevent the host accidentally acting as a router, forwarding insecure traffic from the Internet onto secure corporate networks. Therefore if the VPN software sees traffic from the Linux VM, it will not be routed via the VPN, preventing containers from accessing resources such as internal registries.

Docker Desktop avoids this problem by forwarding all traffic at user-level via vpnkit, a TCP/IP stack written in OCaml on top of the network protocol libraries of the MirageOS Unikernel project. The following diagram shows the flow of packets from the helper VM, through vpnkit and to the Internet:

When the VM boots it requests an address using DHCP. The ethernet frame containing the request is transmitted from the VM to the host over shared memory, either through a virtio device on Mac or through a “hypervisor socket” (AF_VSOCK) on Windows. Vpnkit contains a virtual ethernet switch (mirage-vnetif) which forwards the request to the DHCP (mirage/charrua) server.

Once the VM receives the DHCP response containing the VM’s IP address and the IP of the gateway, it sends an ARP request to discover the ethernet address of the gateway (mirage/arp). Once it has received the ARP response it is ready to send a packet to the Internet.

When vpnkit sees an outgoing packet with a new destination IP address, it creates a virtual TCP/IP stack to represent the remote machine (mirage/mirage-tcpip). This stack acts as the peer of the one in Linux, accepting connections and exchanging packets. When a container calls connect() to establish a TCP connection, Linux sends a TCP packet with the SYNchronize flag set. Vpnkit observes the SYNchronize flag and calls connect() itself from the host. If the connect() succeeds, vpnkit replies to Linux with a TCP SYNchronize packet which completes the TCP handshake. In Linux the connect() succeeds and data is proxied in both directions (mirage/mirage-flow). If the connect() is rejected, vpnkit replies with a TCP RST (reset) packet which causes the connect() inside Linux to return an error. UDP and ICMP are handled similarly.

In addition to low-level TCP/IP, vpnkit has a number of built-in high-level network services, such as a DNS server (mirage/ocaml-dns) and HTTP proxy (mirage/cohttp). These services can be addressed directly via a virtual IP address / DNS name, or indirectly by matching on outgoing traffic and redirecting dynamically, depending on the configuration.

TCP/IP addresses are difficult to work with directly. The next section describes how Docker Desktop uses the Domain Name System (DNS) to give human-readable names to network services.

DNS

Inside Docker Desktop there are multiple DNS servers:

DNS requests from containers are first processed by a server inside dockerd, which recognises the names of other containers on the same internal network. This allows containers to easily talk to each other without knowing their internal IP addresses. For example in the diagram there are 3 containers: “nginx”, “golang” and “postgres”, taken from the docker/awesome-compose example. Each time the application is started, the internal IP addresses might be different, but containers can still easily connect to each other by human-readable name thanks to the internal DNS server inside dockerd.

All other name lookups are sent to CoreDNS (from the CNCF). Requests are then forwarded to one of two different DNS servers on the host, depending on the domain name. The domain docker.internal is special and includes the DNS name host.docker.internal which resolves to a valid IP address for the current host. Although we prefer if everything is fully containerized, sometimes it makes sense to run part of an application as a plain old host service. The special name host.docker.internal allows containers to contact these host services in a portable way, without worrying about hardcoding IP addresses.

The second DNS server on the host handles all other requests by resolving them via standard OS system libraries. This ensures that, if a name resolves correctly in the developer’s web-browser, it will also resolve correctly in the developer’s containers. This is particularly important in sophisticated setups, such as pictured in the diagram where some requests are sent over a corporate VPN (e.g. internal.registry.mycompany) while other requests are sent to the regular Internet (e.g. docker.com).

Now that we’ve described DNS, let’s talk about HTTP.

HTTP(S) proxies

Some organizations block direct Internet access and require all traffic to be sent via HTTP proxies for filtering and logging. This affects pulling images during build as well as outgoing network traffic generated by containers.

The simplest method of using an HTTP proxy is to explicitly point the Docker engine at the proxy via environment variables. This has the disadvantage that if the proxy needs to be changed, the Docker engine process must be restarted to update the variables, causing a noticeable glitch. Docker Desktop avoids this by running a custom HTTP proxy inside vpnkit which forwards to the upstream proxy. When the upstream proxy changes, the internal proxy dynamically reconfigures which avoids having to restart the Docker engine.

On Mac Docker Desktop monitors the proxy settings stored in system preferences. When the computer switches network (e.g. between WiFi networks or onto cellular), Docker Desktop automatically updates the internal HTTP proxy so everything continues to work without the developer having to take any action.

This just about covers containers talking to each other and to the Internet. How do developers talk to the containers?

Port forwarding

When developing applications, it’s useful to be able to expose UIs and APIs on host ports, accessible by debug tools such as web-browsers. Since Docker Desktop runs Linux containers inside a Linux VM, there is a disconnect: the ports are open in the VM but the tools are running on the host. We need something to forward connections from the host into the VM.

Consider debugging a web-application: the developer types docker run -p 80:80 to request that the container’s port 80 is exposed on the host’s port 80 to make it accessible via http://localhost. The Docker API call is written to /var/run/docker.sock on the host as normal. When Docker Desktop is running Linux containers, the Docker engine (dockerd in the diagram above) is a Linux program running inside the helper Linux VM, not natively on the host. Therefore Docker Desktop includes a Docker API proxy which forwards requests from the host to the VM. For security and reliability, the requests are not forwarded directly over TCP over the network. Instead Docker Desktop forwards Unix domain socket connections over a secure low-level transport such as shared-memory hypervisor sockets via processes labeled vpnkit-bridge in the diagram above.

The Docker API proxy can do more than simply forward requests back and forth. It can also decode and transform requests and responses, to improve the developer’s experience. When a developer exposes a port with docker run -p 80:80, the Docker API proxy decodes the request and uses an internal API to request a port forward via the com.docker.backend process. If something on the host is already listening on that port, a human-readable error message is returned to the developer. If the port is free, the com.docker.backend process starts accepting connections and forwarding them to the container via the process vpnkit-forwarder, running on top of vpnkit-bridge.

Docker Desktop does not run with “root” or “Administrator” on the host. A developer can use docker run –privileged to become root inside the helper VM but the hypervisor ensures the host remains completely protected at all times. This is great for security but it causes a usability problem on macOS: how can a developer expose port 80 (docker run -p 80:80) when this is considered a “privileged port” on Unix i.e. a port number < 1024? The solution is that Docker Desktop includes a tiny helper privileged service which does run as root from launchd and which exposes a “please bind this port” API. This raises the question: “is it safe to allow a non-root user to bind privileged ports?”

Originally the notion of a privileged port comes from a time when ports were used to authenticate services: it was safe to assume you were talking to the host’s HTTP daemon because it had bound to port 80, which requires root, so the admin must have arranged it. The modern way to authenticate a service is via TLS certificates and ssh fingerprints, so as long as system services have bound their ports before Docker Desktop has started – macOS arranges this by binding ports on boot with launchd – there can be no confusion or denial of service. Accordingly, modern macOS has made binding privileged ports on all IPs (0.0.0.0 or INADDR_ANY) an unprivileged operation. There is only one case where Docker Desktop still needs to use the privileged helper to bind ports: when a specific IP is requested (e.g. docker run -p 127.0.0.1:80:80), which still requires root on macOS.

Summary

Applications need reliable network connections for lots of everyday activities including: pulling Docker images, installing Linux packages, communicating with database backends, exposing APIs and UIs and much more. Docker Desktop runs in many different network environments: in the office, at home and while traveling on unreliable wifi. Some machines have restrictive firewall policies installed. Other machines have sophisticated VPN configurations. For all these use-cases in all these environments, Docker Desktop aims to “just work”, so the developer can focus on building and testing their application (rather than debugging ours!)

If building this kind of tooling sounds interesting, come and make Docker Desktop networking even better, we are hiring see https://www.docker.com/career-openings

DockerCon2022

Join us for DockerCon2022 on Tuesday, May 10. DockerCon is a free, one day virtual event that is a unique experience for developers and development teams who are building the next generation of modern applications. If you want to learn about how to go from code to cloud fast and how to solve your development challenges, DockerCon 2022 offers engaging live content to help you build, share and run your applications. Register today at https://www.docker.com/dockercon/

Capturing Logs in Docker Desktop

David Scott — Fri, 17 Jan 2020 17:33:08 +0000

Photo by Agence Olloweb on Unsplash

Docker Desktop runs a Virtual Machine to host Docker containers. Each component within the VM (including the Docker engine itself) runs as a separate isolated container. This extra layer of isolation introduces an interesting new problem: how do we capture all the logs so we can include them in Docker Desktop diagnostic reports? If we do nothing then the logs will be written separately into each individual container which obviously isn’t very useful!

The Docker Desktop VM boots from an ISO which is built using LinuxKit from a list of Docker images together with a list of capabilities and bind mounts. For a minimal example of a LinuxKit VM definition, see https://github.com/linuxkit/linuxkit/blob/master/examples/minimal.yml — more examples and documentation are available in the LinuxKit repository. The LinuxKit VM in Docker Desktop boots in two phases: in the first phase, the init process executes a series of one-shot “on-boot” actions sequentially using runc to isolate them in containers. These actions typically format disks, enable swap, configure sysctl settings and network interfaces. The second phase contains “services” which are started concurrently and run forever as containerd tasks.

The following diagram shows a simplified high-level view of the boot process:

By default the “on-boot” actions’ stdout and stderr are written both to the VM console and files in /var/log/onboot.* while the “services” stdout and stderr are connected directly to open files in /var/log which are left to grow forever.

Initially we considered adding logging to the VM by running a syslog compatible logging daemon as a regular service that exposes /dev/log or a port (or both). Other services would then connect to syslog to write logs. Unfortunately a logging daemon running in a service would start later — and therefore miss all the logs from — the “on-boot” actions. Furthermore, since services start concurrently, there would be a race between the syslog daemon starting and syslog clients starting: either logs would be lost or each client startup would have to block waiting for the syslog service to start. Running a syslog daemon as an “on-boot” action would avoid the race with services, but we would have to choose where to put it in the “on-boot” actions list. Ideally we would start the logging daemon at the beginning so that no logs are lost, but then we would not have access to persistent disks or the network to store the logs anywhere useful.

In summary we wanted to add a logging mechanism to Docker Desktop that:

was able to capture all the logs — both the on-boot actions and the service logs;
could write the logs to separate files to make them easier to read in a diagnostics report;
could rotate the log files so they don’t grow forever;
could be developed within the upstream LinuxKit project; and
would not force existing LinuxKit users to rewrite their YAML definitions or modify their existing code.

We decided to implement first-class support for logging by adding a “memory log daemon” called memlogd which starts before the first on-boot action and buffers in memory the last few thousand lines of console output from each container. Since it is only buffering in memory, memlogd does not require any network or persistent storage. A log downloader starts later, after the network and persistent storage is available, connects to memlogd and streams the logs somewhere permanent.

As long as the logs are streamed before the in-memory buffer is full, no lines will be lost. The use of memlogd is entirely optional in LinuxKit; if it is not included in the image then logs are written to the console and directly to open files as before.

Design

We decided to use the Go library container/ring to create a bounded circular buffer. The buffer is bounded to prevent a spammy logging client consuming too much memory in the VM. However if the buffer does fill, then the oldest log lines will be dropped. The following diagram shows the initial design:

Logging clients send log entries via a file descriptor (labelled “linuxkit-external-logging.sock”). Log downloading programs connect to a query socket (labelled “memlogdq.sock”), read logs from the internal buffer and write them somewhere else.

Recall that one of our design goals was to avoid making changes to each individual container definition to use the new logging system. We don’t want to explicitly bind-mount a logging socket into the container or have to modify the container’s code to connect to it. How then do we capture the output from containers automatically and pass it all to the linuxkit-external-logging.sock?

When an on-boot action or service is launched, the VM’s init system creates a FIFO (for containerd) or a socketpair (for runc) for the stdout and stderr. By convention LinuxKit containers normally write their log entries to stderr. Therefore if we modify the init system, we can capture the logs written to the stderr FIFOs and the socketpairs without having to change the container definition or the code. Once the logs have been captured, the next step is to send them to memlogd — how do we do that?

A little known feature of Linux is that you can pass open file descriptors to other processes via Unix domain sockets. We can, instead of proxying log lines, just pass an open socket directly to memlogd. We modified the design for memlogd to take advantage of this:

When the container is started, the init system passes the stdout and stderr file descriptors to memlogd along with the name of the container. Memlogd monitors all its file descriptors in a select-loop. When data is available it will be read, tagged with the container name and timestamped before it is appended to the in-memory ringbuffer. When the container terminates, the fd is closed and memlogd removes the fd from the loop.

So this means:

we don’t have to modify container YAML definitions or code to be aware of the logging system; and
we don’t have to proxy logs between the container and memlogd.

Querying memlogd

To see memlogd in action on a Docker Desktop system, try the following command:

docker run -it --privileged --pid=host justincormack/nsenter1 /usr/bin/logread -F -socket /run/guest-services/memlogdq.sock

This will run a privileged container in the root namespace (containing the “memlogdq.sock” used for querying the logs) and run the utility “logread”, telling it to “follow” the stream i.e. to keep copying from memlogd to the terminal until interrupted. The output looks like this:

2019-02-22T16:04:23Z,docker;time="2019-02-22T16:04:23Z" level=debug msg="registering ttrpc server"

Where the initial timestamp indicates when memlogd received the message and “docker” shows that the log came from the docker service. The rest of the line is the output written to stderr.

Kernel logs (kmsg)

In Docker Desktop we include the Linux kernel logs in diagnostic reports to help us understand and fix Linux kernel bugs. We created the kmsg-package for this purpose. When this service is started, it will connect to /dev/kmsg, stream the kernel logs and output them to stderr. As the stderr is automatically sent to memlogd, the kernel logs will then also be included in the VM’s logs and will be included in the diagnostic report. Note that reading kernel logs via /dev/kmsg is a privileged operation and so the kmsg service needs the capability CAP_SYSLOG.

Persisting the logs

In Docker Desktop we persist the log entries to files (one per service), rotate them when they become large and then delete the oldest to avoid leaking space. We created the logwrite package for this purpose. When this service is started, it connects to the query socket memlogdq.sock, downloads the logs as they are written and manages the log files.

Summary

We now have a relatively simple and lightweight, yet extendable logging system that provides the features we need in Docker Desktop: it captures logs from both “on-boot” actions and services, and persists logs to files with rotation after the file system has been mounted. We developed the logging system in the upstream LinuxKit project where we hope the simple and modular design will allow it to be easily extended by other LinuxKit developers.

References

Documentation for memlogd in the LinuxKit repo
An example use of memlogd
The source code for memlogd
The kmsg service which allows kernel log messages to be included

This post was joint work with Magnus Skjegstad.

Deep Dive Into the New Docker Desktop filesharing Implementation Using FUSE

David Scott — Mon, 16 Dec 2019 15:00:00 +0000

Photo by Shane Aldendorff on Unsplash

The latest Edge release of Docker Desktop for Windows 2.1.7.0 has a completely new filesharing system using FUSE instead of Samba. The initial blog post we released presents the performance improvements of this new implementation and explains how to give feedback. Please try it out and let us know what you think. Now, we are going to go into details to give you more insight about the new architecture.

New Architecture

Instead of Samba running over a Hyper-V virtual network, the new system uses a Filesystem in Userspace (FUSE) server running over gRPC over Hypervisor sockets.

The following diagram shows the path taken by a single request from a container, for example to read a PHP file:

In step (1) the web-server in the container calls “read” which is a Linux system call handled by the kernel’s Virtual File System (VFS) layer. The VFS is modular and supports many different filesystem implementations. In our case we use Filesystem in Userspace (FUSE) which sends the request to a helper process running inside the VM labelled “FUSE client.” This process runs within the same namespace as the Docker engine. The FUSE client can handle some requests locally, but when it needs to access the host filesystem it connects to the host via “Hypervisor sockets.”

Hypervisor Sockets

Hypervisor sockets are a shared-memory communication mechanism which enables VMs to communicate with each other and with the host. Hypervisor sockets have a number of advantages over using regular virtual networking, including:

since the traffic does not flow over a virtual ethernet/IP network, it is not affected by firewall policies
since the traffic is not routed like IP, it cannot be mis-routed by VPN clients
since the traffic uses shared memory, it can never leave the machine so we don’t have to worry about third parties intercepting it

Docker Desktop already uses these sockets to forward the Docker API, to forward container ports, and now we use them for filesharing on Windows too!

Returning to the diagram, the FUSE client creates sockets using the AF_VSOCK address family, see step (3). The kernel contains a number of low-level transports, one per hypervisor. Since the underlying hypervisor is Hyper-V, we use the VMBus transport. In step (4) filesystem requests are written into the shared memory and read by the VMBus implementation in the Windows kernel. A FUSE server userspace process running in Windows reads the filesystem request over an AF_HYPERV socket in step (5).

FUSE Server

The request to open/close/read/write etc is received by the FUSE server, which is running as a regular Windows process. Finally in step (6) the FUSE server uses the Windows APIs to perform the read or write and then returns the result to the caller.

The FUSE server runs as the user who is running the Docker app, so it only has access to the user’s files and folders. There is no possibility of the VM gaining access to any other files, as could happen in the previous design if a local admin account is used to mount the drive in the VM.

Event Injection

When files are modified in Linux, the kernel generates inotify events. Interested applications can watch for these events and take action. For example, a React app run with

$ npm start

will watch for inotify events and automatically recompile when code changes and trigger the browser to refresh automatically, as shown in this video. In previous versions of Docker Desktop on Windows we weren’t able to generate inotify events so these styles of development simply wouldn’t work.

Injecting inotify events is quite tricky. Normally a Linux VFS implementation like FUSE wouldn’t generate the events itself; instead the common code in the higher layer generates events as a side-effect of performing actions. For example when the VFS “unlink” is called and returns successfully then the “unlink” event will be generated. So when the user calls “unlink” under Windows, how does Linux find out about it?

Docker Desktop watches for events on the host when the user runs docker run -v. When an “unlink” event is received on the host, a request to “please inject unlink” is forwarded over gRPC to the Linux VM. The following diagram shows the sequence of operations:

A thread with a well-known pid inside the FUSE client in Linux “replays” the request by calling “unlink,” even though the directory has actually already been removed. The FUSE client intercepts requests from this well-known pid and pretends that the unlink hasn’t happened yet. For example, when FUSE_GETATTR is called, the FUSE client will say, “yes the directory is still here” (instead of ENOENT). When the FUSE_UNLINK is called, the FUSE client will say, “yes that worked” (instead of ENOENT). As a result of the successful FUSE_UNLINK the Linux kernel generates the inotify event.

Caching

As you can see from the architecture diagram above, each I/O request has to make several user/kernel transitions and VM/host transitions to complete. This means the latency of a filesystem operation is much higher than the case when all the files are local in the VM. We have mitigated this with aggressive use of kernel caching, so many requests can be avoided altogether. We:

use the file attribute cache which minimises FUSE_GETATTR requests
set FOPEN_CACHE_DIR which caches directory contents in the kernel page cache
set FOPEN_KEEP_CACHE which caches file contents
we set CAP_MAX_PAGES to increase the maximum request size
We use a modern 4.19 series kernel with the latest FUSE patches backported

Since we have enabled so many caches we have to carefully handle cache invalidation. When a user runs docker run -v and we are monitoring the filesystem events for inotify event injection, we also use these events to invalidate cache entries. When the docker run -v exits and the watches are disabled we invalidate all the cache entries.

Future Evolution

We have lots of ideas to improve the performance further by even more aggressive use of caching. For example, in the Symfony benchmark above, the majority of the remaining FUSE calls in the cached case are calls to open and close file handles; even though the file contents is itself cached (and hasn’t changed). We may be able to make these open and close calls lazy and only call them when needed.

The new filesystem implementation is not relevant on WSL 2 (currently available on early Windows Insider builds), since that already has a native filesharing mode which uses 9P. Of course we will keep benchmarking, optimising and incorporating user feedback to always use the best available filesharing implementation across all OS versions.

New Filesharing Implementation in Docker Desktop Windows Improves Developer Inner Loop UX

David Scott — Thu, 12 Dec 2019 08:53:23 +0000

A common developer workflow when using frameworks like Symfony or React is to edit the source code using a Windows IDE while running the app itself in a Docker container. The source is shared between the host and the container with a command like the following:

$ docker run -v C:\Users\me:/code -p 8080:8080 my-symfony-app

This allows the developer to edit the source code, save the changes and immediately see the results in their browser. This is where file sharing performance becomes critical.

The latest Edge release of Docker Desktop for Windows 2.1.7.0 has a completely new filesharing implementation using Filesystem in Userspace (FUSE) instead of Samba which:

uses caching to (for example) reduce page load time in Symfony by up to 60%;
supports Linux inotify events, triggering automatic recompilation / reload when the source code is changed;
is independent of how you authenticate to Windows: smartcard, Azure AD are all fine;
always works irrespective of whether your VPN is connected or disconnected;
reduces the amount of code running as Administrator.

Your feedback needed!

This improvement is available today in the Edge 2.1.7.0 release and will roll-out to the stable channel later once we’ve had enough positive feedback. Please download it, give it a try and let us know how it goes. If you discover any problems, please report them on GitHub and make sure you fill descriptions and reproduction steps so that we can quickly investigate.

DOWNLOAD

Big performance improvements

Performance is vital when application source code is being shared between the host and a container. For example when a developer uses the Symfony PHP framework, edits the source code and then reloads the page in the browser, the web-server in the container must re-read many PHP files stored on the host. This must be fast.

The following graph shows the time taken to load a page of a simple symfony demo in three configurations:

Previous version: this is the implementation in earlier versions of Docker Desktop
Docker Desktop Edge 2.1.7.0: this is the new (faster!) implementation
In-container: the files are not shared from the host at all, instead they are stored in the container to show the upper limit on possible future performance.

The two bars on the left hand side show the latency (in seconds) using an older version of Docker Desktop. Note that the second fetch is only slightly better than the first, suggesting that the effect of caching is small.

The two bars on the right hand side show the latency when the files are not shared at all, but are stored entirely inside the VM. This is the upper limit on performance if the volume sharing system were perfect and had zero overheads.

The two bars in the middle show the latency when the files are shared with the new system in Docker Desktop Edge 2.1.7.0. The initial (uncached) fetch is already better than with the previous Desktop version, but the second (cached) fetch is 60% faster!

Additional enhancements

As well as big performance improvements, the new implementation has the following additional benefits:

The new version can’t conflict with organisation-wide security policies as we don’t need to use Administrator privileges to share the drive and create a firewall exception for port 445.
The new version doesn’t require the user to enter their domain credentials. Not only is this fundamentally more secure, but it avoids the user having to re-enter their credentials every time they change their password. Many organisations require regular password changes, which means the user needed to refresh the credentials frequently.
The new version supports users who authenticate via a smartcard, or AzureAD or any other method. Previously we could only support users who login with a username and password.
The new version is immune to a class of problems caused by enterprise VPN clients and endpoint security software clashing with the Hyper-V network adapter.

Stay tuned for a follow up post that deep dives into the new Docker Desktop filesharing implementation using FUSE.

Addressing Time Drift in Docker Desktop for Mac

David Scott — Mon, 25 Feb 2019 10:00:59 +0000

Docker Desktop for Mac runs the Docker engine and Linux containers in a helper LinuxKit VM since macOS doesn’t have native container support. The helper VM has its own internal clock, separate from the host’s clock. When the two clocks drift apart then suddenly commands which rely on the time, or on file timestamps, may start to behave differently. For example “make” will stop working properly across shared volumes (“docker run -v”) when the modification times on source files (typically written on the host) are older than the modification times on the binaries (typically written in the VM) even after the source files are changed. Time drift can be very frustrating as you can see by reading issues such as https://github.com/docker/for-mac/issues/2076.

Wait, doesn’t the VM have a (virtual) hardware Real Time Clock (RTC)?

When the helper VM boots the clocks are initially synchronised by an explicit invocation of “hwclock -s” which reads the virtual RTC in HyperKit. Unfortunately reading the RTC is a slow operation (both on physical hardware and virtual) so the Linux kernel builds its own internal clock on top of other sources of timing information, known as clocksources. The most reliable is usually the CPU Time Stamp Counter (“tsc”) clocksource which measures time by counting the number of CPU cycles since the last CPU reset. TSC counters are frequently used for benchmarking, where the current TSC value is read (via the rdtsc instruction) at the beginning and then again at the end of a test run. The two values can then be subtracted to yield the time the code took to run in CPU cycles. However there are problems when we try to use these counters long-term as a reliable source of absolute physical time, particularly when running in a VM:

There is no reliable way to discover the TSC frequency: without this we don’t know what to divide the counter values by to transform the result into seconds.
Some power management technology will change the TSC frequency dynamically.
The counter can jump back to 0 when the physical CPU is reset, for example over a host suspend / resume.
When a virtual CPU is stopped executing on one physical CPU core and later starts executing on another one, the TSC counter can suddenly jump forwards or backwards.

The unreliability of using TSC counters can be seen on this Docker Desktop for Mac install:

$ docker run --rm --privileged alpine /bin/dmesg | grep clocksource
…
[    3.486187] clocksource: Switched to clocksource tsc
[ 6963.789123] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 6963.792264] clocksource:                       'hpet' wd_now: 388d8fc2 wd_last: 377f3b7c mask: ffffffff
[ 6963.794806] clocksource:                       'tsc' cs_now: 104a0911ec5a cs_last: 10492ccc2aec mask: ffffffffffffffff
[ 6963.797812] clocksource: Switched to clocksource hpet

Many hypervisors fix these problems by providing an explicit “paravirtualised clock” interface, providing enough additional information to the VM to allow it to correctly convert a TSC value to seconds. Unfortunately the Hypervisor.framework on the Mac does not provide enough information (particularly over suspend/resume) to allow the implementation of such a paravirtualised timesource so we reported the issue to Apple and searched for a workaround.

How bad is this in practice?

I wrote a simple tool to measure the time drift between a VM and the host– the source is here. I created a small LinuxKit test VM without any kind of time synchronisation software installed and measured the “natural” clock drift after the VM boots:

Each line on the graph shows the clock drift for a different test run. From the graph it appears that the time in the VM loses roughly 2ms for every 3s of host time which passes. Once the total drift gets to about 1s (after approximately 1500s or 25 minutes) it will start to get really annoying.

OK, can we turn on NTP and forget about it?

The Network Time Protocol (NTP) is designed to keep clocks in sync so it should be ideal. The question then becomes

which client?
which server?

How about using the “default” pool.ntp.org like everyone else?

Many machines and devices uses the free pool.ntp.org as their NTP server. This is a bad idea for us for several reasons:

it’s against their [guidelines](http://www.pool.ntp.org/en/vendors.html) (although we could register as a vendor)
there’s no guarantee clocks in the NTP server pool are themselves well-synchronised
people don’t like their Mac sending unexpected UDP traffic; they fear it’s malware infestation
anyway… we don’t want the VM to synchronise with atomic clocks in some random physics lab, we want it to synchronise with the host (so the timestamps work). If the host itself has drifted 30 minutes away from “real” time, we want the VM to also be 30 minutes away from “real” time.

Therefore in Docker Desktop we should run our own NTP server on the host, serving the host’s clock.

Which server implementation should we use?

The NTP protocol is designed to be robust and globally scalable. Servers with accurate clock hardware (e.g. an atomic clock or a GPS feed containing a signal from an atomic clock) are relatively rare so not all other hosts can connect directly to them. NTP servers are arranged in a hierarchy where lower “strata” synchronise with the stratum directly above and end-users and devices synchronise with the servers at the bottom. Since our use-case only involves one server and one client this is all completely unnecessary and so we use “Simplified NTP” as described in RFC2030 which enables clients (in our case the VM) to synchronise immediately with a server (in our case the host).

Which NTP client should we use (and does it even matter)?

Early versions of Docker Desktop included openntpd from the upstream LinuxKit package. The following graph shows the time drift on one VM boot where openntpd runs for the first 10000s and then we switch to the busybox NTP client:

The diagram shows the clock still drifting significantly with openntpd running but it’s “fixed” by running busybox — why is this? To understand this it’s important to first understand how an NTP client adjusts the Linux kernel clock:

adjtime (3) – this accepts a delta (e.g. -10s) and tells the kernel to gradually adjust the system clock avoiding suddenly moving the clock forward (or backward, which can cause problems with timing loops which aren’t using monotonic clocks)
adjtimex (2) – this allows the kernel clock *rate* itself to be adjusted, to cope with systematic drift like we are suffering from
settimeofday (2) – this immediately bumps the clock to the given time

If we look at line 433 of openntpd ntp.c (sorry no direct link in cvsweb) then we can see that openntpd is using adjtime to periodically add a delta to the clock, to try to correct the drift. This could also be seen in the openntpd logs. So why wasn’t this effective?

The following graph shows how the natural clock drift is affected by a call to adjtime(+10s) and adjtime(-10s):

It seems the “natural” drift we’re experiencing is so large it can’t be compensated for solely by using “adjtime”. The reason busybox performs better for us is because it adjusts the clock rate itself using “adjtimex”.

The following graph shows the change in kernel clock frequency (timex.freq) viewed using adjtimex. For the first 10000s we use openntpd (and hence the adjustment is 0 since it doesn’t use the API) and for the rest of the graph we used busybox:

Note how the adjustment value remains flat and very negative after an initial spike upwards. I have to admit when I first saw this graph I was disappointed — I was hoping to see something zig-zag up and down, as the clock rate was constantly micro-managed to remain stable.

Is there anything special about the final value of the kernel frequency offset?

Unfortunately it is special. From the adjtimex (2) manpage:

ADJ_FREQUENCY
Set frequency offset from buf.freq. Since Linux 2.6.26, the
supplied value is clamped to the range (-32768000, +32768000).

So it looks like busybox slowed the clock by the maximum amount (-32768000) to correct the systematic drift. According to the adjtimex(8) manpage a value of 65536 corresponds to 1ppm, so 32768000 corresponds to 500ppm. Recall that the original estimate of the systematic drift was 2ms every 3s, which is about 666ppm. This isn’t good: this means that we’re right at the limit of what adjtimex can do to compensate for it and are probably also relying on adjtime to provide additional adjustments. Unfortunately all our tests have been on one single machine and it’s easy to imagine a different system (perhaps with different powersaving behaviour) where even adjtimex + adjtime would be unable to cope with the drift.

So what should we do?

The main reason why NTP clients use APIs like adjtime and adjtimex is because they want

monotonicity: i.e. to ensure time never goes backwards because this can cause bugs in programs which aren’t using monotonic clocks for timing, for example How and why the leap second affected Cloudflare DNS; and
smoothness: i.e. no sudden jumping forwards, triggering lots of timing loops, cron jobs etc at once.

Docker Desktop is used by developers to build and test their code on their laptops and desktops. Developers routinely edit files on their host with an IDE and then build them in a container using docker run -v. This requires the clock in the VM to be synchronised with the clock in the host, otherwise tools like make will fail to rebuild changed source files correctly.

Option 1: adjust the kernel “tick”

According to adjtimex(8) it’s possible to adjust the kernel “tick”:

Set the number of microseconds that should be added to the system time for each kernel tick interrupt. For a kernel with USER_HZ=100, there are supposed to be 100 ticks per second, so val should be close to 10000. Increasing val by 1 speeds up the system clock by about 100 ppm,

If we knew (or could measure) the systematic drift we could make a coarse-grained adjustment with the “tick” and then let busybox NTP to manage the remaining drift.

Option 2: regularly bump the clock forward with settimeofday (2)

If we assume that the clock in the VM is always running slower than the real physical clock (because it is virtualised, vCPUs are periodically descheduled etc) and if we don’t care about smoothness, we could use an NTP client which calls settimeofday (2) periodically to immediately resync the clock.

The choice

Although option 1 could potentially provide the best results, we decided to keep it simple and go with option 2: regularly bump the clock forward with settimeofday (2) rather than attempt to measure and adjust the kernel tick. We assume that the VM clock always runs slower than the host clock but we don’t have to measure exactly how much it runs slower, or assume that the slowness remains constant over time, or across different hardware setups. The solution is very simple and easy to understand. The VM clock should stay in close sync with the host and it should still be monotonic but it will not be very smooth.

We use an NTP client called SNTPC written by Natanael Copa, founder of Alpine Linux (quite a coincidence considering we use Alpine extensively in Docker Desktop). SNTPC can be configured to call settimeofday every n seconds with the following results:

As you can see in the graph, every 30s the VM clock has fallen behind by 20ms and is then bumped forward. Note that since the VM clock is always running slower than the host, the VM clock always jumps forwards but never backwards, maintaining monotonicity.

Just a sec, couldn’t we just run hwclock -s every 30s?

Rather than running a simple NTP client and server communicating over UDP every 30s we could instead run hwclock -s to synchronise with the hardware RTC every 30s. Reading the RTC is inefficient because the memory reads trap to the hypervisor and block the vCPU, unlike UDP traffic which is efficiently queued in shared memory; however the code would be simple and an expensive operation once every 30s isn’t too bad. How well would it actually keep the clock in sync?

Unfortunately running hwclock -s in a HyperKit Linux VM only manages to keep the clock in sync within about 1s, which would be quite noticeable when editing code and immediately recompiling it. So we’ll stick with NTP.

Final design

The final design looks like this:

In the VM the sntpc process sends UDP on port 123 (the NTP port) to the virtual NTP server running on the gateway, managed by the vpnkit process on the host. The NTP traffic is forwarded to a custom SNTP server running on localhost which executes gettimeofday and replies. The sntpc process receives the reply, calculates the local time by subtracting an estimate of the round-trip-time and calls settimeofday to bump the clock to the correct value.

In Summary

Timekeeping in computers is hard (especially so in virtual computers)
There’s a serious source of systematic drift in the macOS Hypervisor.framework, HyperKit, Linux system.
Minimising time drift is important for developer use-cases where file timestamps are used in build tools like make
For developer use-cases we don’t mind if the clock moves abruptly forwards
We’ve settled on a simple design using a standard protocol (SNTP) and a pre-existing client (sntpc)
The new design is very simple and should be robust: even if the clock drift rate is faster in future (e.g. because of a bug in the Hypervisor.framework or HyperKit) the clock will still be kept in sync.
The new code has shipped on both the stable and the edge channels (as of 18.05)