Container Toolkit shutdown config.toml modification #1040

mike-hayes · 2025-04-18T14:32:30Z

When the Nvidia’s Container Toolkit pod shuts down, it update the containerd’s config.toml file in such a way that etcd and API Service pods crash after a reboot. Here are details based on my observations of a single node KVM configuration with the GPU passthrough enabled.

Current Nvidia image tags:
gpu-operator:v23.6.1
gpu-operator-validator:v23.6.1
node-feature-discovery:v0.13.1-minimal
container-toolkit:v1.13.4-centos7

Below is a snippet of the clean config.toml that the OMNI installer deploys. Specifically in the [plugins] section of the config.toml is the following entry:

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      runtime_engine = ""
      runtime_root = ""
      privileged_without_host_devices = false
      base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true

Kubernetes, Etcd and API Service pods work without a problem with this configuration. You can reboot the host and Kuberntes recovers fully assuming there are no GPU detected.

When the gpu-operator pod detects a GPU, additional pods are started including the nvidia-container-toolkit daemonset pod. The nvidia-container-toolkit pod will update the config.toml during the startup. Among other changes, the above entry is updated to below:

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      base_runtime_spec = ""
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        SystemdCgroup = true

Mostly formatting changes, but one difference is the addition of the BinaryName. Kubernetes still works fine. Etcd and API Service pods work fine and GPU works as expected (all tests passed).

However, when the nvidia-container-tookit pod shuts down (using kubectl or during the host shutdown), it appears that the pod attempts to “clean-up” the changes it made. In the process, the above “runc” runtime entry is completely removed from the config.toml. Now when you reboot the host, etcd and the API Service pods will get started, crash, and repeat every few minutes. Kubernetes will not recover. Other symptoms include:

This error in the shell console: haproxy[1284]: backend kube-api-backend has no server available!
crictl ps |egrep “etcd|kube” does not show the 4 static pods: etcd, kube-apiserver, kube-scheduler, kube-controller and/or some of the pods have a recent start time.
Kubectl and k9s intermittently don’t work (Unable to connect to the server)

When I add the deleted section back in manually into config.toml and reboot, the Kubernetes and static pods comes back up correctly.

I re-ran the test after upgraded the Nvidia containers but saw the same behavior. I used the following updated artifacts for this test:

gpu-operator:v24.9.2
gpu-operator helm chart:v24.9.2
gpu-operator-validator:v24.9.2
node-feature-discovery:v0.16.6
container-toolkit:v1.17.5-ubi8

We will provide a patch that restores the default config.toml, but it may be worthwhile reaching out to Nvidia and see if there is a bug in their code or if there is something we can configure on our side to address the Kubernetes failure after a reboot.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Container Toolkit shutdown config.toml modification #1040

Container Toolkit shutdown config.toml modification #1040

mike-hayes commented Apr 18, 2025

Container Toolkit shutdown config.toml modification #1040

Container Toolkit shutdown config.toml modification #1040

Comments

mike-hayes commented Apr 18, 2025