Robots and hard drives

Fighting with hard drives

TL;DR Link to heading

Changing kubelet’s root directory can break CSI drivers and other components. You can work around this, but the effort isn’t worth it: stick with the default /var/lib/kubelet.

Motivation Link to heading

I was setting up Kubernetes kubelet on a host with numerous NVMe drives . I thought “Let’s use this for ephemeral-storage in Kubernetes”. The host had no need for all these drives and Kubernetes could allow pods to claim them for ephemeral storage.

From the documentation, ephemeral-storage is typically the root-dir for kubelet: /var/lib/kubelet. Since the base path of Kubernetes is configurable my initial thought was to mount the NVMe as RAID and change the kubelet base path to this raid volume location. Kubelet has just the parameter for this: --root-dir

–root-dir string Default: /var/lib/kubelet

Directory path for managing kubelet files (volume mounts, etc).

Here was my setup:

# make RAID at /raid
mdadm ... 
mkdir -p /raid/kubelet mkdir -p /raid/containers #(TODO: containerd as well?)
kubelet --root-dir /raid/kubelet

Everything works! The node joins. I do kubectl describe node and see lots of available ephemeral storage. Everything is great!

It wasn’t great Link to heading

Later, I noticed a pod couldn’t schedule. The pod requested a mount to a remote file system created by a CSI driver. The scheduler logs showed an error message like this:

pod/jacktest on node A fit failed: Insufficient cpu: 20
pod/jacktest on node B fit failed: Untolerated taint: nvidia.com/gpu
pod/jacktest on node C fit failed: Insufficient attachable-volumes-lustre-csi.hpe.com: 1

Initially, this error is very confusing to me. Reading off the other errors, like not enough CPUs, I thought that there would be a node resource, maybe named lustre-csi.hpe.com inside the output of kubectl describe node <X>. But it was very strange to me that a Lustre file system would be a resource at all.

Instead, this resource was a CSI driver (obvious in retrospect). The idea is that pods can register a unix domain socket on the host and kubelet can communicate with the driver pod over the UDS using this gRPC spec. For UDS to work, there needs to be an expected location of the socket that is shared by the pod and the host. With that, the pod and host processes can communicate. If you’re using a CSI driver from the internet, it’s probably configured like this from the CSI driver docs

      containers:
      - name: my-csi-driver
        volumeMounts:
        - name: socket-dir
          mountPath: /csi
        - name: mountpoint-dir
          mountPath: /var/lib/kubelet/pods
          mountPropagation: "Bidirectional"
      volumes:
      # This volume is where the socket for kubelet->driver communication is done
      - name: socket-dir
        hostPath:
          path: /var/lib/kubelet/plugins/<driver-name>
          type: DirectoryOrCreate
      # This volume is where the driver mounts volumes
      - name: mountpoint-dir
        hostPath:
          path: /var/lib/kubelet/pods
          type: Directory
      # This volume is where the node-driver-registrar registers the plugin
      # with kubelet
      - name: registration-dir
        hostPath:
          path: /var/lib/kubelet/plugins_registry
          type: Directory

The really important parts are hostPath and /var/lib/kubelet. Almost all CSI drivers default to this path and some hard code it. Here is an example inside the HewlettPackard Lustre driver

When you change the --root-dir of kubelet, it changes were to expect all these other drivers. So when I changed root-dir to /raid/kubelet I now need to update all of my CSI drivers to use a hostPath of /raid/kubelet as well.

While it’s possible to iterate all my existing daemonsets and update them, this process is very fragile and will break future daemonsets I try to integrate. The much better solution is to bind your raid mount at /var/lib/kubelet so that you don’t have to change all of these existing Daemonsets.

Prior work Link to heading

This solution is exactly what AWS does by default. Here is the script they use: setup-local-disks. Inside maybe_raid0 there is this comment:

# Sets up a RAID-0 of NVMe instance storage disks, moves
# the contents of /var/lib/kubelet and /var/lib/containerd
# to the new mounted RAID, and bind mounts the kubelet and
# containerd state directories.

The general steps are:

  • run mdadm ...
  • iterate /var/lib/kubelet and /var/lib/containerd (since they use containerd)
  • setup mounts

General advice: just keep /var/lib/kubelet Link to heading

Many components in Kubernetes, especially Daemonsets, assume /var/lib/kubelet as the default kubelet location. Just keep it there and if you want a special mount, use /var/lib/kubelet as the mount location.