-
Notifications
You must be signed in to change notification settings - Fork 355
NVIDIA Container Runtime Not Functioning Correctly in RKE2 (Missing Devices/Libraries) #1089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just so that it's listed, which version of the NVIDIA Container Toolkit is installed? i.e. What is the output of |
Will do first thing Monday morning. |
Nvidia-ctk --version: nvidia-container-cli --version: NVIDIA Container Toolkit CLI version 1.17.6 cli-version: 1.17.6 Host machine is running |
I think I know what the issue is. The toolkit is configured to REQUIRE volume mounts to request devices for non-privileged containers:
but the device plugin is configured to use envvars (the default). Installing / upgrading the device plugin and specifying
should address this. Since you mention that v0.17.1 of the device plugin isn't working, do you have the logs for that case? |
I will implement these changes shortly. |
I used v0,17,1 and it worked ! @nvidia-test:/$ nvidia-smi
Thu May 22 18:16:43 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA TITAN V Off | 00000000:88:00.0 Off | N/A |
| 29% 43C P8 27W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+ Sincerely thank you so much for your help. Please let me know if I should provide any other details for the benefit of others in this issue. |
Uh oh!
There was an error while loading. Please reload this page.
Desired goal
Spin up a rke2 cluster with the ability to deploy GPU enabled pods via
runtimeClassName: nvidia
Issues I've observed
After I create a pod with a pod spec of
The pod successfully deploys
Describing the pod
Grabbing a shell inside the container and trying
nvidia-smi
Missing gpu under /dev inside container
I have no name!@nvidia-test:/dev$ ls core fd full mqueue null ptmx pts random shm stderr stdin stdout termination-log tty urandom zero
Missing nvidia-smi and other nvidia executables in /usr/bin
Below I'll post all relevant setup steps and configs.
Node configuration:
sudo apt install nvidia-headless-570-server
nvidia-utils-570-server
nvidia-smi from host
which nvidia-container-runtime
Worth mentioning that installing nvidia's container tookit detects almost all of the needed configurations in both /etc/nvidia-container-runtime/config.yaml and /var/lib/rancher/rke2/agent/etc/containerd/config.toml
Cluster deployments, resources, etc
kubectl describe node
kubectl get runtimeclasses
nvidia-device-plugin deployed via helm
Had to use 0.15.0 because it seemed 0.17.1(latest) wasn't working
I'm really at a loss of what's going on and I've spent the last few weeks troubleshooting.
Any help is greatly appreciated.
The text was updated successfully, but these errors were encountered: