Skip to content

Container Toolkit shutdown config.toml modification #1040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mike-hayes opened this issue Apr 18, 2025 · 0 comments
Open

Container Toolkit shutdown config.toml modification #1040

mike-hayes opened this issue Apr 18, 2025 · 0 comments

Comments

@mike-hayes
Copy link

When the Nvidia’s Container Toolkit pod shuts down, it update the containerd’s config.toml file in such a way that etcd and API Service pods crash after a reboot. Here are details based on my observations of a single node KVM configuration with the GPU passthrough enabled.

Current Nvidia image tags:
gpu-operator:v23.6.1
gpu-operator-validator:v23.6.1
node-feature-discovery:v0.13.1-minimal
container-toolkit:v1.13.4-centos7

Below is a snippet of the clean config.toml that the OMNI installer deploys. Specifically in the [plugins] section of the config.toml is the following entry:

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      runtime_engine = ""
      runtime_root = ""
      privileged_without_host_devices = false
      base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true

Kubernetes, Etcd and API Service pods work without a problem with this configuration. You can reboot the host and Kuberntes recovers fully assuming there are no GPU detected.

When the gpu-operator pod detects a GPU, additional pods are started including the nvidia-container-toolkit daemonset pod. The nvidia-container-toolkit pod will update the config.toml during the startup. Among other changes, the above entry is updated to below:

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      base_runtime_spec = ""
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        SystemdCgroup = true

Mostly formatting changes, but one difference is the addition of the BinaryName. Kubernetes still works fine. Etcd and API Service pods work fine and GPU works as expected (all tests passed).

However, when the nvidia-container-tookit pod shuts down (using kubectl or during the host shutdown), it appears that the pod attempts to “clean-up” the changes it made. In the process, the above “runc” runtime entry is completely removed from the config.toml. Now when you reboot the host, etcd and the API Service pods will get started, crash, and repeat every few minutes. Kubernetes will not recover. Other symptoms include:

This error in the shell console: haproxy[1284]: backend kube-api-backend has no server available!
crictl ps |egrep “etcd|kube” does not show the 4 static pods: etcd, kube-apiserver, kube-scheduler, kube-controller and/or some of the pods have a recent start time.
Kubectl and k9s intermittently don’t work (Unable to connect to the server)

When I add the deleted section back in manually into config.toml and reboot, the Kubernetes and static pods comes back up correctly.

I re-ran the test after upgraded the Nvidia containers but saw the same behavior. I used the following updated artifacts for this test:

gpu-operator:v24.9.2
gpu-operator helm chart:v24.9.2
gpu-operator-validator:v24.9.2
node-feature-discovery:v0.16.6
container-toolkit:v1.17.5-ubi8

We will provide a patch that restores the default config.toml, but it may be worthwhile reaching out to Nvidia and see if there is a bug in their code or if there is something we can configure on our side to address the Kubernetes failure after a reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant