You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the Nvidia’s Container Toolkit pod shuts down, it update the containerd’s config.toml file in such a way that etcd and API Service pods crash after a reboot. Here are details based on my observations of a single node KVM configuration with the GPU passthrough enabled.
Current Nvidia image tags:
gpu-operator:v23.6.1
gpu-operator-validator:v23.6.1
node-feature-discovery:v0.13.1-minimal
container-toolkit:v1.13.4-centos7
Below is a snippet of the clean config.toml that the OMNI installer deploys. Specifically in the [plugins] section of the config.toml is the following entry:
Kubernetes, Etcd and API Service pods work without a problem with this configuration. You can reboot the host and Kuberntes recovers fully assuming there are no GPU detected.
When the gpu-operator pod detects a GPU, additional pods are started including the nvidia-container-toolkit daemonset pod. The nvidia-container-toolkit pod will update the config.toml during the startup. Among other changes, the above entry is updated to below:
Mostly formatting changes, but one difference is the addition of the BinaryName. Kubernetes still works fine. Etcd and API Service pods work fine and GPU works as expected (all tests passed).
However, when the nvidia-container-tookit pod shuts down (using kubectl or during the host shutdown), it appears that the pod attempts to “clean-up” the changes it made. In the process, the above “runc” runtime entry is completely removed from the config.toml. Now when you reboot the host, etcd and the API Service pods will get started, crash, and repeat every few minutes. Kubernetes will not recover. Other symptoms include:
This error in the shell console: haproxy[1284]: backend kube-api-backend has no server available!
crictl ps |egrep “etcd|kube” does not show the 4 static pods: etcd, kube-apiserver, kube-scheduler, kube-controller and/or some of the pods have a recent start time.
Kubectl and k9s intermittently don’t work (Unable to connect to the server)
When I add the deleted section back in manually into config.toml and reboot, the Kubernetes and static pods comes back up correctly.
I re-ran the test after upgraded the Nvidia containers but saw the same behavior. I used the following updated artifacts for this test:
We will provide a patch that restores the default config.toml, but it may be worthwhile reaching out to Nvidia and see if there is a bug in their code or if there is something we can configure on our side to address the Kubernetes failure after a reboot.
The text was updated successfully, but these errors were encountered:
When the Nvidia’s Container Toolkit pod shuts down, it update the containerd’s config.toml file in such a way that etcd and API Service pods crash after a reboot. Here are details based on my observations of a single node KVM configuration with the GPU passthrough enabled.
Current Nvidia image tags:
gpu-operator:v23.6.1
gpu-operator-validator:v23.6.1
node-feature-discovery:v0.13.1-minimal
container-toolkit:v1.13.4-centos7
Below is a snippet of the clean config.toml that the OMNI installer deploys. Specifically in the [plugins] section of the config.toml is the following entry:
Kubernetes, Etcd and API Service pods work without a problem with this configuration. You can reboot the host and Kuberntes recovers fully assuming there are no GPU detected.
When the gpu-operator pod detects a GPU, additional pods are started including the nvidia-container-toolkit daemonset pod. The nvidia-container-toolkit pod will update the config.toml during the startup. Among other changes, the above entry is updated to below:
Mostly formatting changes, but one difference is the addition of the BinaryName. Kubernetes still works fine. Etcd and API Service pods work fine and GPU works as expected (all tests passed).
However, when the nvidia-container-tookit pod shuts down (using kubectl or during the host shutdown), it appears that the pod attempts to “clean-up” the changes it made. In the process, the above “runc” runtime entry is completely removed from the config.toml. Now when you reboot the host, etcd and the API Service pods will get started, crash, and repeat every few minutes. Kubernetes will not recover. Other symptoms include:
This error in the shell console: haproxy[1284]: backend kube-api-backend has no server available!
crictl ps |egrep “etcd|kube” does not show the 4 static pods: etcd, kube-apiserver, kube-scheduler, kube-controller and/or some of the pods have a recent start time.
Kubectl and k9s intermittently don’t work (Unable to connect to the server)
When I add the deleted section back in manually into config.toml and reboot, the Kubernetes and static pods comes back up correctly.
I re-ran the test after upgraded the Nvidia containers but saw the same behavior. I used the following updated artifacts for this test:
gpu-operator:v24.9.2
gpu-operator helm chart:v24.9.2
gpu-operator-validator:v24.9.2
node-feature-discovery:v0.16.6
container-toolkit:v1.17.5-ubi8
We will provide a patch that restores the default config.toml, but it may be worthwhile reaching out to Nvidia and see if there is a bug in their code or if there is something we can configure on our side to address the Kubernetes failure after a reboot.
The text was updated successfully, but these errors were encountered: