Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

elezar · 2025-02-07T22:05:03Z

This PR changes the default behaviour of the NVIDIA Container Runtime to use an in-memory CDI specification to make changes to the incoming OCI runtime specification instead of invoking the nvidia-container-runtime-hook (i.e. "legacy" mode). This was previously available as an opt-in option when a user would request a CDI device using the predefined runtime.nvidia.com/gpu CDI device class as part of the CDI device name.

With this change, all device requests (annotations, volume mounts, environment variables) are mapped to runtime.nvidia.com/gpu={{ .DeviceID }} requests if they are not already CDI device requests. This means that CDI specifications for these defices are generated as the container is being created and these specifications are used to modify the incomming OCI runtime specification.

When the nvidia runtime is invoked in this mode, any existing nvidia-container-runtime-hook in the OCI runtime specification is removed so as to not use the legacy injection mechanism.

Note that when only the nvidia-container-runtime-hook is used (as is the case with the Docker --gpus flag or the default crio configuration), the default runtime mode is set to legacy to ensure that this mode of operation is not changed.

This means that:

$ docker run --rm -ti --runtime=runc --gpus=all ubuntu nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB-N (UUID: GPU-edfee158-11c1-52b8-0517-92f30e7fac88)
GPU 1: Tesla V100-SXM2-16GB-N (UUID: GPU-f22fb098-d1b3-3806-2655-ba25f02229c1)
GPU 2: Tesla V100-SXM2-16GB-N (UUID: GPU-f613f823-1032-b3ec-a876-50f2e35e6f9e)
GPU 3: Tesla V100-SXM2-16GB-N (UUID: GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e)
GPU 4: Tesla V100-SXM2-16GB-N (UUID: GPU-e28a6529-288c-7ddf-8fea-68c4833cda70)
GPU 5: Tesla V100-SXM2-16GB-N (UUID: GPU-a27fb382-bad2-c02a-95ba-f6a1da38e76c)
GPU 6: Tesla V100-SXM2-16GB-N (UUID: GPU-f5bb8d07-ee19-1787-4d9a-a84c4ac6b086)
GPU 7: Tesla V100-SXM2-16GB-N (UUID: GPU-1ba0ca0e-6d1d-d9db-07d8-c1c5a8c32814)

works as expected

(without the change of the default this would return an error similar to):

$ docker run --rm -ti --runtime=runc --gpus="device=runtime.nvidia.com/gpu=0" ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'cdi'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead: unknown.

jgehrcke · 2025-03-21T13:52:58Z

internal/modifier/cdi.go

@@ -116,6 +117,24 @@ func getDevicesFromSpec(logger logger.Interface, ociSpec oci.Spec, cfg *config.C
 	return nil, nil
 }

+func normalizeDeviceList(logger logger.Interface, defaultKind string, devices ...string) []string {


Maybe add a code comment showing an example of what this function is doing (supposed to do).

jgehrcke · 2025-03-21T13:53:32Z

cmd/nvidia-container-runtime/main_test.go

-	require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
-	require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
+	require.Empty(t, spec.Hooks, "there should be hooks in config.json")
+	require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")


let's invert the world

jgehrcke

I know this is a drafty draft and I don't know what I am talking about. But we need to make progress here and you know what you are doing. So, approved!

cdesiniotis

LGTM

cdesiniotis · 2025-03-21T17:23:59Z

cmd/nvidia-container-runtime/main_test.go

@@ -125,8 +125,8 @@ func TestGoodInput(t *testing.T) {
 	// Check config.json for NVIDIA prestart hook
 	spec, err = cfg.getRuntimeSpec()
 	require.NoError(t, err, "should be no errors when reading and parsing spec from config.json")
-	require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
-	require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
+	require.Empty(t, spec.Hooks, "there should be hooks in config.json")


Suggested change

require.Empty(t, spec.Hooks, "there should be hooks in config.json")

require.Empty(t, spec.Hooks, "there should be no hooks in config.json")

cdesiniotis · 2025-03-21T17:24:23Z

cmd/nvidia-container-runtime/main_test.go

-	require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
-	require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
+	require.Empty(t, spec.Hooks, "there should be hooks in config.json")
+	require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")


Suggested change

require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")

require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json")

cdesiniotis · 2025-03-21T17:24:45Z

cmd/nvidia-container-runtime/main_test.go

-	require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
-	require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
+	require.Empty(t, spec.Hooks, "there should be no hooks in config.json")
+	require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")


Suggested change

require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")

require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json")

cdesiniotis · 2025-03-21T17:38:31Z

internal/modifier/cdi.go

-			logger.Debugf("Ignoring duplicate device %q", name)
-			continue
+	if cfg.AcceptDeviceListAsVolumeMounts {
+		devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...)


Before this change I notice that we were only calling container.CDIDevicesFromMounts() to construct the list of devices specified as volume mounts, but now we are also calling container.DeviceFromMounts() -- is there a reason we weren't calling both functions prior?

ArangoGutierrez · 2025-05-09T10:35:19Z

@elezar is this PR Ready for review, or why is it still set as Draft?

ArangoGutierrez

LGTM
Just a couple comments

ArangoGutierrez · 2025-05-09T10:57:05Z

internal/modifier/cdi.go

-			logger.Debugf("Ignoring duplicate device %q", name)
-			continue
+	if cfg.AcceptDeviceListAsVolumeMounts {
+		devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...)


internal/modifier/cdi.go

coveralls · 2025-06-12T16:06:02Z

Pull Request Test Coverage Report for Build 15677002794

Details

59 of 127 (46.46%) changed or added relevant lines in 7 files are covered.
6 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.04%) to 33.666%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
cmd/nvidia-container-runtime-hook/main.go	0	3	0.0%
cmd/nvidia-container-runtime-hook/container_config.go	0	11	0.0%
internal/info/auto.go	32	45	71.11%
cmd/nvidia-container-runtime-hook/hook_config.go	3	18	16.67%
internal/modifier/cdi.go	0	26	0.0%

Files with Coverage Reduction	New Missed Lines	%
cmd/nvidia-container-runtime-hook/main.go	1	0.0%
internal/info/auto.go	1	75.44%
internal/modifier/cdi.go	1	8.74%
cmd/nvidia-container-runtime-hook/hook_config.go	3	55.32%

Totals
Change from base Build 15640396854:	0.04%
Covered Lines:	4353
Relevant Lines:	12930

💛 - Coveralls

force-push

elezar · 2025-06-16T08:49:01Z

internal/info/auto.go

+	RuntimeModeLegacy = RuntimeMode("legacy")
+	RuntimeModeCSV    = RuntimeMode("csv")
+	RuntimeModeCDI    = RuntimeMode("cdi")
+	RuntimeModeJitCDI = RuntimeMode("jit-cdi")


Do we have a preference on what to call this mode?

Signed-off-by: Evan Lezar <[email protected]>

elezar marked this pull request as draft February 10, 2025 12:32

elezar mentioned this pull request Feb 11, 2025

Add CUDA forward compatibility hook #906

Merged

elezar force-pushed the default-to-cdi branch 4 times, most recently from 3e2c4d9 to 23f6d28 Compare February 11, 2025 21:47

elezar added this to the Disable legacy code path by default milestone Feb 17, 2025

jgehrcke reviewed Mar 21, 2025

View reviewed changes

jgehrcke previously approved these changes Mar 21, 2025

View reviewed changes

cdesiniotis previously approved these changes Mar 21, 2025

View reviewed changes

ArangoGutierrez self-requested a review May 6, 2025 17:23

ArangoGutierrez assigned elezar May 6, 2025

ArangoGutierrez previously approved these changes May 9, 2025

View reviewed changes

elezar modified the milestones: Disable legacy code path by default, v1.18.0 May 9, 2025

elezar force-pushed the default-to-cdi branch from 23f6d28 to 0ed757f Compare June 12, 2025 08:18

elezar force-pushed the default-to-cdi branch 2 times, most recently from c4e1729 to 153a277 Compare June 13, 2025 15:28

elezar changed the title ~~Default to cdi~~ Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime Jun 13, 2025

elezar marked this pull request as ready for review June 13, 2025 15:37

elezar commented Jun 16, 2025

View reviewed changes

elezar added 3 commits June 16, 2025 10:52

Use functional options to construct runtime mode resolver

8bfce94

Signed-off-by: Evan Lezar <[email protected]>

Ensure consistent sorting of annotation devices

7f5ad9c

Signed-off-by: Evan Lezar <[email protected]>

[no-relnote] Add RuntimeMode type

5d4da62

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the default-to-cdi branch from 153a277 to 25fe3dc Compare June 16, 2025 09:19

elezar added 2 commits June 16, 2025 11:25

Default to jit-cdi mode in the nvidia runtime

a39d147

Signed-off-by: Evan Lezar <[email protected]>

Resolve to legacy by default in nvidia-container-runtime-hook

e5b690e

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the default-to-cdi branch from 25fe3dc to e5b690e Compare June 16, 2025 09:25

elezar requested review from klueska, cdesiniotis and jgehrcke June 16, 2025 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

elezar commented Feb 7, 2025 •

edited

Loading

Uh oh!

jgehrcke Mar 21, 2025

Uh oh!

jgehrcke Mar 21, 2025

Uh oh!

jgehrcke left a comment

Uh oh!

cdesiniotis left a comment

Uh oh!

cdesiniotis Mar 21, 2025

Uh oh!

cdesiniotis Mar 21, 2025

Uh oh!

cdesiniotis Mar 21, 2025

Uh oh!

cdesiniotis Mar 21, 2025

Uh oh!

ArangoGutierrez May 9, 2025

Uh oh!

ArangoGutierrez commented May 9, 2025

Uh oh!

ArangoGutierrez left a comment

Uh oh!

ArangoGutierrez May 9, 2025

Uh oh!

Uh oh!

coveralls commented Jun 12, 2025 •

edited

Loading

Uh oh!

elezar Jun 16, 2025

Uh oh!

Uh oh!

	require.Empty(t, spec.Hooks, "there should be hooks in config.json")
	require.Empty(t, spec.Hooks, "there should be no hooks in config.json")

	require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
	require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json")

Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

Are you sure you want to change the base?

Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

Conversation

elezar commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke left a comment

Choose a reason for hiding this comment

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented May 9, 2025

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coveralls commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15677002794

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elezar commented Feb 7, 2025 •

edited

Loading

coveralls commented Jun 12, 2025 •

edited

Loading