Skip to content

NVIDIA Container Runtime Not Functioning Correctly in RKE2 (Missing Devices/Libraries) #1089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
natestaples opened this issue May 16, 2025 · 6 comments
Assignees

Comments

@natestaples
Copy link

natestaples commented May 16, 2025

Desired goal

Spin up a rke2 cluster with the ability to deploy GPU enabled pods via runtimeClassName: nvidia

Issues I've observed

After I create a pod with a pod spec of

cat <<EOF | kubectl create -n default -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-test
  labels:
    name: nvidia-test
spec:
  runtimeClassName: nvidia 
  containers:
  - name: nvidia-test
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    imagePullPolicy: Always
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      seccompProfile:
        type: RuntimeDefault
    resources:
      limits:
        memory: "1Gi"
        cpu: "1"
        nvidia.com/gpu: 1
    command: ["tail"]
    args: ["-f", "/dev/null"]
  tolerations:
  - key: "node.kubernetes.io/unschedulable"
    operator: "Exists"
    effect: "NoSchedule"
EOF

The pod successfully deploys

NAME          READY   STATUS    RESTARTS   AGE
nvidia-test   1/1     Running   0          11s

Describing the pod

kubectl describe pod nvidia-test 
Name:                nvidia-test
Namespace:           default
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                howe/10.202.0.14
Start Time:          Fri, 16 May 2025 11:38:29 -0400
Labels:              name=nvidia-test
Annotations:         cni.projectcalico.org/containerID: 7f489e51d15fb9fec8d0497caf9ec4cc276fc1cb5c9599bb6e1e1dcda0fd71ec
                     cni.projectcalico.org/podIP: 10.42.0.65/32
                     cni.projectcalico.org/podIPs: 10.42.0.65/32
Status:              Running
IP:                  10.42.0.65
IPs:
  IP:  10.42.0.65
Containers:
  nvidia-test:
    Container ID:    containerd://2abb2775c07937a96279d255db9ce3d9104d2c8844ff35c350f7f51f5b53c1dc
    Image:           nvidia/cuda:12.4.1-base-ubuntu22.04
    Image ID:        docker.io/nvidia/cuda@sha256:0f6bfcbf267e65123bcc2287e2153dedfc0f24772fb5ce84afe16ac4b2fada95
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      tail
    Args:
      -f
      /dev/null
    State:          Running
      Started:      Fri, 16 May 2025 11:38:30 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:            <none>
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:                      <none>
QoS Class:                    Guaranteed
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                              node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  53s   default-scheduler  Successfully assigned default/nvidia-test to howe
  Normal  Pulling    47s   kubelet            Pulling image "nvidia/cuda:12.4.1-base-ubuntu22.04"
  Normal  Pulled     47s   kubelet            Successfully pulled image "nvidia/cuda:12.4.1-base-ubuntu22.04" in 299ms (299ms including waiting). Image size: 91769375 bytes.
  Normal  Created    47s   kubelet            Created container nvidia-test
  Normal  Started    46s   kubelet            Started container nvidia-test

Grabbing a shell inside the container and trying nvidia-smi

groups: cannot find name for group ID 1000
I have no name!@nvidia-test:/$ nvidia-smi
bash: nvidia-smi: command not found

Missing gpu under /dev inside container

I have no name!@nvidia-test:/dev$ ls
core  fd  full	mqueue	null  ptmx  pts  random  shm  stderr  stdin  stdout  termination-log  tty  urandom  zero

Missing nvidia-smi and other nvidia executables in /usr/bin

I have no name!@nvidia-test:/usr/bin$ ls *nvidia*
ls: cannot access '*nvidia*': No such file or directory

Below I'll post all relevant setup steps and configs.

Node configuration:

  • sudo apt install nvidia-headless-570-server
  • nvidia-utils-570-server

nvidia-smi from host

nvidia-smi 
Fri May 16 16:26:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN V                 Off |   00000000:04:00.0 Off |                  N/A |
| 29%   43C    P8             26W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN V                 Off |   00000000:05:00.0 Off |                  N/A |
| 32%   46C    P8             32W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA TITAN V                 Off |   00000000:08:00.0 Off |                  N/A |
| 28%   40C    P8             26W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA TITAN V                 Off |   00000000:09:00.0 Off |                  N/A |
| 28%   42C    P8             26W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA TITAN V                 Off |   00000000:83:00.0 Off |                  N/A |
| 28%   41C    P8             26W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA TITAN V                 Off |   00000000:84:00.0 Off |                  N/A |
| 28%   42C    P8             26W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA TITAN V                 Off |   00000000:87:00.0 Off |                  N/A |
| 29%   43C    P8             29W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA TITAN V                 Off |   00000000:88:00.0 Off |                  N/A |
| 28%   42C    P8             27W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • which nvidia-container-runtime
which nvidia-container-runtime
/usr/bin/nvidia-container-runtime
  • nvidia-container-toolkit using nvidia setup
  • Config of /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

Worth mentioning that installing nvidia's container tookit detects almost all of the needed configurations in both /etc/nvidia-container-runtime/config.yaml and /var/lib/rancher/rke2/agent/etc/containerd/config.toml

  • Here is my /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"








[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
  privileged_without_host_devices = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

Cluster deployments, resources, etc
kubectl describe node

 Name:               howe
Roles:              control-plane,etcd,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=howe
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/etcd=true
                    node-role.kubernetes.io/master=true
                    nvidia.com/gpu.present=true
                    p2p.rke2.cattle.io/enabled=true
Annotations:        etcd.rke2.cattle.io/local-snapshots-timestamp: 2025-05-16T12:00:05Z
                    etcd.rke2.cattle.io/node-address: 10.202.0.14
                    etcd.rke2.cattle.io/node-name: howe-d4b629b1
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"8a:29:44:31:05:9b"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.202.0.14
                    node.alpha.kubernetes.io/ttl: 0
                    p2p.rke2.cattle.io/node-address: /ip4/10.202.0.14/tcp/5001/p2p/QmVfVy9d8upJ5Q1EvMN7Cjr1z2sFkTScW7eYURy7eujGFq
                    rke2.io/encryption-config-hash: start-81bbda467aad65a2ec3b2f300b8afb699d0f00cc7afac25188e2526c45356e81
                    rke2.io/node-args:
                      ["server","--kubelet-arg","fail-swap-on=true","--kube-proxy-arg","proxy-mode=ipvs","--kube-proxy-arg","ipvs-strict-arp=true","--node-ip","...
                    rke2.io/node-config-hash: N5SV7K4RVHI3CNOZT7O3LNNQJSQUTWOO3EIT3Z47TEYIASPR6YHQ====
                    rke2.io/node-env: {}
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 08 May 2025 13:33:17 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  howe
  AcquireTime:     <unset>
  RenewTime:       Fri, 16 May 2025 12:14:17 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 09 May 2025 12:15:13 -0400   Fri, 09 May 2025 12:15:13 -0400   FlannelIsUp                  Flannel is running on this node
  EtcdIsVoter          True    Fri, 16 May 2025 12:11:44 -0400   Thu, 08 May 2025 13:33:33 -0400   MemberNotLearner             Node is a voting member of the etcd cluster
  MemoryPressure       False   Fri, 16 May 2025 12:12:21 -0400   Thu, 15 May 2025 14:36:29 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 16 May 2025 12:12:21 -0400   Thu, 15 May 2025 14:36:29 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 16 May 2025 12:12:21 -0400   Thu, 15 May 2025 14:36:29 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 16 May 2025 12:12:21 -0400   Thu, 15 May 2025 14:36:29 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.202.0.14
  Hostname:    howe
Capacity:
  cpu:                40
  ephemeral-storage:  1918572936Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             792493260Ki
  nvidia.com/gpu:     8
  pods:               110
Allocatable:
  cpu:                40
  ephemeral-storage:  1866387750678
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             792493260Ki
  nvidia.com/gpu:     8
  pods:               110
System Info:
  Machine ID:                 2cb6d467ab3c414b88dad8a82c967ac5
  System UUID:                b8955927-f015-2b0c-0707-704d7bc75eb9
  Boot ID:                    1810d5c5-e7a0-4a86-b36c-86113fd62181
  Kernel Version:             6.8.0-59-generic
  OS Image:                   Ubuntu 24.04.2 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.20-k3s1
  Kubelet Version:            v1.31.0+rke2r1
  Kube-Proxy Version:         
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
Non-terminated Pods:          (20 in total)
  Namespace                   Name                                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                     ------------  ----------  ---------------  -------------  ---
  default                     nvidia-test                                              1 (2%)        1 (2%)      1Gi (0%)         1Gi (0%)       26m
  flux-system                 helm-controller-55479b8b94-pm9rj                         100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  flux-system                 image-automation-controller-868c5b98c4-qth4d             100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  flux-system                 image-reflector-controller-5d6f6f9587-gg6g4              100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  flux-system                 kustomize-controller-74bdb64f84-f977k                    100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  flux-system                 notification-controller-7d8d56877d-6cxnn                 100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  flux-system                 source-controller-fc7bc69c-gkn7g                         100m (0%)     0 (0%)      64Mi (0%)        0 (0%)         7d22h
  kube-system                 etcd-howe                                                200m (0%)     0 (0%)      512Mi (0%)       0 (0%)         25h
  kube-system                 kube-apiserver-howe                                      250m (0%)     0 (0%)      1Gi (0%)         0 (0%)         6d22h
  kube-system                 kube-controller-manager-howe                             200m (0%)     0 (0%)      256Mi (0%)       0 (0%)         7d22h
  kube-system                 kube-proxy-howe                                          250m (0%)     0 (0%)      128Mi (0%)       0 (0%)         7d22h
  kube-system                 kube-scheduler-howe                                      100m (0%)     0 (0%)      128Mi (0%)       0 (0%)         7d22h
  kube-system                 local-path-provisioner-dbff48958-pnm2n                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d22h
  kube-system                 rke2-canal-ll2k4                                         250m (0%)     0 (0%)      0 (0%)           0 (0%)         7d22h
  kube-system                 rke2-coredns-rke2-coredns-787bc4b7b7-jp9gq               100m (0%)     100m (0%)   128Mi (0%)       128Mi (0%)     7d22h
  kube-system                 rke2-coredns-rke2-coredns-autoscaler-6dc69d7b97-mnnkp    25m (0%)      100m (0%)   16Mi (0%)        64Mi (0%)      7d22h
  kube-system                 rke2-metrics-server-6d99b6d454-t8chx                     100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         7d22h
  kube-system                 rke2-snapshot-controller-658d97fccc-p94wb                0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d22h
  kube-system                 rke2-snapshot-validation-webhook-784bcc6c8-lsznf         0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d22h
  nvidia-device-plugin        nvdp-nvidia-device-plugin-bn8vc                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d20h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3075m (7%)   1200m (3%)
  memory             3800Mi (0%)  1216Mi (0%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     1            1
Events:              <none>

kubectl get runtimeclasses

NAME     HANDLER   AGE
nvidia   nvidia    6d23h
  • mapping of that runtimeclassname
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"node.k8s.io/v1","handler":"nvidia","kind":"RuntimeClass","metadata":{"annotations":{},"name":"nvidia"}}
  creationTimestamp: "2025-05-09T17:08:56Z"
  name: nvidia
  resourceVersion: "312392"
  uid: b5906bbc-2461-48a0-8ef0-78deb1f07018

nvidia-device-plugin deployed via helm

Had to use 0.15.0 because it seemed 0.17.1(latest) wasn't working

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.15.0
  • Here is my helmfile
# Plugin configuration
# Only one of "name" or "map" should ever be set for a given deployment.
# Use "name" to point to an external ConfigMap with a list of configurations.
# Use "map" to build an integrated ConfigMap from a set of configurations as
# part of this helm chart. An example of setting "map" might be:
# config:
#   map:
#     default: |-
#       version: v1
#       flags:
#         migStrategy: none
#     mig-single: |-
#       version: v1
#       flags:
#         migStrategy: single
#     mig-mixed: |-
#       version: v1
#       flags:
#         migStrategy: mixed
config:
  # ConfigMap name if pulling from an external ConfigMap
  name: ""
  # Set of named configs to build an integrated ConfigMap from
  map: {}
  # Default config name within the ConfigMap
  default: ""
  # List of fallback strategies to attempt if no config is selected and no default is provided
  fallbackStrategies: ["named" , "single"]

compatWithCPUManager: null
migStrategy: null
failOnInitError: null
deviceListStrategy: null
deviceIDStrategy: null
nvidiaDriverRoot: null
gdsEnabled: null
mofedEnabled: null
deviceDiscoveryStrategy: null

nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""
selectorLabelsOverride: {}

allowDefaultNamespace: false

imagePullSecrets: []
image:
  repository: nvcr.io/nvidia/k8s-device-plugin
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

updateStrategy:
  type: RollingUpdate

podAnnotations: {}
podSecurityContext: {}
securityContext: {}

resources: {}
nodeSelector: {}
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
        - key: feature.node.kubernetes.io/pci-10de.present
          operator: In
          values:
          - "true"
      - matchExpressions:
        # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
        - key: feature.node.kubernetes.io/cpu-model.vendor_id
          operator: In
          values:
          - "NVIDIA"
      - matchExpressions:
        # We allow a GPU deployment to be forced by setting the following label to "true"
        - key: "nvidia.com/gpu.present"
          operator: In
          values:
          - "true"
tolerations:
  # This toleration is deprecated. Kept here for backward compatibility
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"

runtimeClassName: nvidia

devicePlugin:
  enabled: true

gfd:
  enabled: false
  nameOverride: gpu-feature-discovery
  namespaceOverride: ""
  noTimestamp: null
  sleepInterval: null
  securityContext:
    # privileged access is required for the gpu-feature-discovery to access the
    # vgpu info on a host.
    # TODO: This should be optional and detected automatically.
    privileged: true

# Helm dependency
nfd:
  nameOverride: node-feature-discovery
  enableNodeFeatureApi: false
  master:
    serviceAccount:
      name: node-feature-discovery
      create: true
    config:
      extraLabelNs: ["nvidia.com"]

  worker:
    tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
    config:
      sources:
        pci:
          deviceClassWhitelist:
          - "02"
          - "03"
          deviceLabelFields:
          - vendor

mps:
  # root specifies the location where files and folders for managing MPS will
  # be created. This includes a daemon-specific /dev/shm and pipe and log
  # directories.
  # Pipe directories will be created at {{ mps.root }}/{{ .ResourceName }}
  root: "/run/nvidia/mps"


cdi:
  # nvidiaHookPath specifies the path to the nvidia-cdi-hook or nvidia-ctk executables on the host.
  # This is required to ensure that the generated CDI specification refers to the correct CDI hooks.
  nvidiaHookPath: null
  • describing the nvidia device plugin pod
kubectl describe pod  -n nvidia-device-plugin 
Name:                 nvdp-nvidia-device-plugin-bn8vc
Namespace:            nvidia-device-plugin
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      default
Node:                 howe/10.202.0.14
Start Time:           Mon, 12 May 2025 16:02:51 -0400
Labels:               app.kubernetes.io/instance=nvdp
                      app.kubernetes.io/name=nvidia-device-plugin
                      controller-revision-hash=59f8d4c744
                      pod-template-generation=5
Annotations:          cni.projectcalico.org/containerID: 297b88a8839dbea403f1d9583348879da52bab411ad5a5c86b7c70cde2661cfe
                      cni.projectcalico.org/podIP: 10.42.0.38/32
                      cni.projectcalico.org/podIPs: 10.42.0.38/32
Status:               Running
IP:                   10.42.0.38
IPs:
  IP:           10.42.0.38
Controlled By:  DaemonSet/nvdp-nvidia-device-plugin
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  containerd://b1703d20e80b8a069a04cb54326c4d0cf90770f5981e81b17a810de9fccfc0d4
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.15.0
    Image ID:      sha256:fa3ba2723b8864aa319016e2ae3469e5c63a1c991d36cc8938657a59ec0ada22
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-device-plugin
    State:          Running
      Started:      Thu, 15 May 2025 10:32:50 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 12 May 2025 16:03:10 -0400
      Finished:     Thu, 15 May 2025 10:29:10 -0400
    Ready:          True
    Restart Count:  1
    Environment:
      MPS_ROOT:                    /run/nvidia/mps
      NVIDIA_MIG_MONITOR_DEVICES:  all
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
    Mounts:
      /dev/shm from mps-shm (rw)
      /mps from mps-root (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-n7nsz (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  mps-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps
    HostPathType:  DirectoryOrCreate
  mps-shm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps/shm
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  kube-api-access-n7nsz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>

I'm really at a loss of what's going on and I've spent the last few weeks troubleshooting.
Any help is greatly appreciated.

@elezar
Copy link
Member

elezar commented May 16, 2025

Just so that it's listed, which version of the NVIDIA Container Toolkit is installed? i.e. What is the output of nvidia-ctk --version and nvidia-container-cli --version?

@natestaples
Copy link
Author

Will do first thing Monday morning.

@natestaples
Copy link
Author

Nvidia-ctk --version:
nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.17.6
commit: e627eb2

nvidia-container-cli --version:

NVIDIA Container Toolkit CLI version 1.17.6
commit: e627eb2

cli-version: 1.17.6
lib-version: 1.17.6
build date: 2025-04-22T08:29+00:00
build revision: a198166e1c1166f4847598438115ea97dacc7a92
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Host machine is running Ubuntu 24.04.2 LTS

@elezar
Copy link
Member

elezar commented May 22, 2025

I think I know what the issue is. The toolkit is configured to REQUIRE volume mounts to request devices for non-privileged containers:

accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

but the device plugin is configured to use envvars (the default). Installing / upgrading the device plugin and specifying deviceListStrategy=volume-mounts as follows:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.15.0 \
  --set deviceListStrategy=volume-mounts

should address this.

Since you mention that v0.17.1 of the device plugin isn't working, do you have the logs for that case?

@elezar elezar self-assigned this May 22, 2025
@natestaples
Copy link
Author

I will implement these changes shortly.
Before I do I'll deploy v0.17.1 and gather those logs for you as well.

@natestaples
Copy link
Author

I used v0,17,1 and it worked !
Setting the volume-mounts was the key with the helm deploy.
I was then able to create a unprivileged pod that successfully had a gpu deployed in it.

@nvidia-test:/$ nvidia-smi 
Thu May 22 18:16:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN V                 Off |   00000000:88:00.0 Off |                  N/A |
| 29%   43C    P8             27W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Sincerely thank you so much for your help. Please let me know if I should provide any other details for the benefit of others in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants