Skip to content

Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

elezar
Copy link
Member

@elezar elezar commented Feb 7, 2025

This PR changes the default behaviour of the NVIDIA Container Runtime to use an in-memory CDI specification to make changes to the incoming OCI runtime specification instead of invoking the nvidia-container-runtime-hook (i.e. "legacy" mode). This was previously available as an opt-in option when a user would request a CDI device using the predefined runtime.nvidia.com/gpu CDI device class as part of the CDI device name.

With this change, all device requests (annotations, volume mounts, environment variables) are mapped to runtime.nvidia.com/gpu={{ .DeviceID }} requests if they are not already CDI device requests. This means that CDI specifications for these defices are generated as the container is being created and these specifications are used to modify the incomming OCI runtime specification.

When the nvidia runtime is invoked in this mode, any existing nvidia-container-runtime-hook in the OCI runtime specification is removed so as to not use the legacy injection mechanism.

Note that when only the nvidia-container-runtime-hook is used (as is the case with the Docker --gpus flag or the default crio configuration), the default runtime mode is set to legacy to ensure that this mode of operation is not changed.

This means that:

$ docker run --rm -ti --runtime=runc --gpus=all ubuntu nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB-N (UUID: GPU-edfee158-11c1-52b8-0517-92f30e7fac88)
GPU 1: Tesla V100-SXM2-16GB-N (UUID: GPU-f22fb098-d1b3-3806-2655-ba25f02229c1)
GPU 2: Tesla V100-SXM2-16GB-N (UUID: GPU-f613f823-1032-b3ec-a876-50f2e35e6f9e)
GPU 3: Tesla V100-SXM2-16GB-N (UUID: GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e)
GPU 4: Tesla V100-SXM2-16GB-N (UUID: GPU-e28a6529-288c-7ddf-8fea-68c4833cda70)
GPU 5: Tesla V100-SXM2-16GB-N (UUID: GPU-a27fb382-bad2-c02a-95ba-f6a1da38e76c)
GPU 6: Tesla V100-SXM2-16GB-N (UUID: GPU-f5bb8d07-ee19-1787-4d9a-a84c4ac6b086)
GPU 7: Tesla V100-SXM2-16GB-N (UUID: GPU-1ba0ca0e-6d1d-d9db-07d8-c1c5a8c32814)

works as expected

(without the change of the default this would return an error similar to):

$ docker run --rm -ti --runtime=runc --gpus="device=runtime.nvidia.com/gpu=0" ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'cdi'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead: unknown.

@elezar elezar marked this pull request as draft February 10, 2025 12:32
@elezar elezar force-pushed the default-to-cdi branch 4 times, most recently from 3e2c4d9 to 23f6d28 Compare February 11, 2025 21:47
@elezar elezar added this to the Disable legacy code path by default milestone Feb 17, 2025
@@ -116,6 +117,24 @@ func getDevicesFromSpec(logger logger.Interface, ociSpec oci.Spec, cfg *config.C
return nil, nil
}

func normalizeDeviceList(logger logger.Interface, defaultKind string, devices ...string) []string {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a code comment showing an example of what this function is doing (supposed to do).

require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Empty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's invert the world

jgehrcke
jgehrcke previously approved these changes Mar 21, 2025
Copy link

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a drafty draft and I don't know what I am talking about. But we need to make progress here and you know what you are doing. So, approved!

cdesiniotis
cdesiniotis previously approved these changes Mar 21, 2025
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -125,8 +125,8 @@ func TestGoodInput(t *testing.T) {
// Check config.json for NVIDIA prestart hook
spec, err = cfg.getRuntimeSpec()
require.NoError(t, err, "should be no errors when reading and parsing spec from config.json")
require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Empty(t, spec.Hooks, "there should be hooks in config.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Empty(t, spec.Hooks, "there should be hooks in config.json")
require.Empty(t, spec.Hooks, "there should be no hooks in config.json")

require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Empty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json")

require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json")
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Empty(t, spec.Hooks, "there should be no hooks in config.json")
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json")
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json")

logger.Debugf("Ignoring duplicate device %q", name)
continue
if cfg.AcceptDeviceListAsVolumeMounts {
devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change I notice that we were only calling container.CDIDevicesFromMounts() to construct the list of devices specified as volume mounts, but now we are also calling container.DeviceFromMounts() -- is there a reason we weren't calling both functions prior?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@ArangoGutierrez
Copy link
Collaborator

@elezar is this PR Ready for review, or why is it still set as Draft?

ArangoGutierrez
ArangoGutierrez previously approved these changes May 9, 2025
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Just a couple comments

logger.Debugf("Ignoring duplicate device %q", name)
continue
if cfg.AcceptDeviceListAsVolumeMounts {
devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@elezar elezar modified the milestones: Disable legacy code path by default, v1.18.0 May 9, 2025
@coveralls
Copy link

coveralls commented Jun 12, 2025

Pull Request Test Coverage Report for Build 15677002794

Details

  • 59 of 127 (46.46%) changed or added relevant lines in 7 files are covered.
  • 6 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.04%) to 33.666%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-container-runtime-hook/main.go 0 3 0.0%
cmd/nvidia-container-runtime-hook/container_config.go 0 11 0.0%
internal/info/auto.go 32 45 71.11%
cmd/nvidia-container-runtime-hook/hook_config.go 3 18 16.67%
internal/modifier/cdi.go 0 26 0.0%
Files with Coverage Reduction New Missed Lines %
cmd/nvidia-container-runtime-hook/main.go 1 0.0%
internal/info/auto.go 1 75.44%
internal/modifier/cdi.go 1 8.74%
cmd/nvidia-container-runtime-hook/hook_config.go 3 55.32%
Totals Coverage Status
Change from base Build 15640396854: 0.04%
Covered Lines: 4353
Relevant Lines: 12930

💛 - Coveralls

@elezar elezar force-pushed the default-to-cdi branch 2 times, most recently from c4e1729 to 153a277 Compare June 13, 2025 15:28
@elezar elezar changed the title Default to cdi Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime Jun 13, 2025
@elezar elezar marked this pull request as ready for review June 13, 2025 15:37
@elezar elezar dismissed stale reviews from ArangoGutierrez, cdesiniotis, and jgehrcke June 16, 2025 08:45

force-push

RuntimeModeLegacy = RuntimeMode("legacy")
RuntimeModeCSV = RuntimeMode("csv")
RuntimeModeCDI = RuntimeMode("cdi")
RuntimeModeJitCDI = RuntimeMode("jit-cdi")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a preference on what to call this mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants