-
Notifications
You must be signed in to change notification settings - Fork 354
Use just-in-time CDI spec generation by default in the NVIDIA Container Runtime #910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3e2c4d9
to
23f6d28
Compare
internal/modifier/cdi.go
Outdated
@@ -116,6 +117,24 @@ func getDevicesFromSpec(logger logger.Interface, ociSpec oci.Spec, cfg *config.C | |||
return nil, nil | |||
} | |||
|
|||
func normalizeDeviceList(logger logger.Interface, defaultKind string, devices ...string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a code comment showing an example of what this function is doing (supposed to do).
require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json") | ||
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | ||
require.Empty(t, spec.Hooks, "there should be hooks in config.json") | ||
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's invert the world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a drafty draft and I don't know what I am talking about. But we need to make progress here and you know what you are doing. So, approved!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -125,8 +125,8 @@ func TestGoodInput(t *testing.T) { | |||
// Check config.json for NVIDIA prestart hook | |||
spec, err = cfg.getRuntimeSpec() | |||
require.NoError(t, err, "should be no errors when reading and parsing spec from config.json") | |||
require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json") | |||
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | |||
require.Empty(t, spec.Hooks, "there should be hooks in config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Empty(t, spec.Hooks, "there should be hooks in config.json") | |
require.Empty(t, spec.Hooks, "there should be no hooks in config.json") |
require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json") | ||
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | ||
require.Empty(t, spec.Hooks, "there should be hooks in config.json") | ||
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | |
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json") |
require.NotEmpty(t, spec.Hooks, "there should be hooks in config.json") | ||
require.Equal(t, 1, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | ||
require.Empty(t, spec.Hooks, "there should be no hooks in config.json") | ||
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "exactly one nvidia prestart hook should be inserted correctly into config.json") | |
require.Equal(t, 0, nvidiaHookCount(spec.Hooks), "no nvidia prestart hook should be inserted into config.json") |
internal/modifier/cdi.go
Outdated
logger.Debugf("Ignoring duplicate device %q", name) | ||
continue | ||
if cfg.AcceptDeviceListAsVolumeMounts { | ||
devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change I notice that we were only calling container.CDIDevicesFromMounts()
to construct the list of devices specified as volume mounts, but now we are also calling container.DeviceFromMounts()
-- is there a reason we weren't calling both functions prior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
@elezar is this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just a couple comments
internal/modifier/cdi.go
Outdated
logger.Debugf("Ignoring duplicate device %q", name) | ||
continue | ||
if cfg.AcceptDeviceListAsVolumeMounts { | ||
devices = normalizeDeviceList(logger, defaultKind, append(container.DevicesFromMounts(), container.CDIDevicesFromMounts()...)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
Pull Request Test Coverage Report for Build 15677002794Details
💛 - Coveralls |
c4e1729
to
153a277
Compare
force-push
internal/info/auto.go
Outdated
RuntimeModeLegacy = RuntimeMode("legacy") | ||
RuntimeModeCSV = RuntimeMode("csv") | ||
RuntimeModeCDI = RuntimeMode("cdi") | ||
RuntimeModeJitCDI = RuntimeMode("jit-cdi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a preference on what to call this mode?
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
This PR changes the default behaviour of the NVIDIA Container Runtime to use an in-memory CDI specification to make changes to the incoming OCI runtime specification instead of invoking the
nvidia-container-runtime-hook
(i.e. "legacy" mode). This was previously available as an opt-in option when a user would request a CDI device using the predefinedruntime.nvidia.com/gpu
CDI device class as part of the CDI device name.With this change, all device requests (annotations, volume mounts, environment variables) are mapped to
runtime.nvidia.com/gpu={{ .DeviceID }}
requests if they are not already CDI device requests. This means that CDI specifications for these defices are generated as the container is being created and these specifications are used to modify the incomming OCI runtime specification.When the
nvidia
runtime is invoked in this mode, any existingnvidia-container-runtime-hook
in the OCI runtime specification is removed so as to not use the legacy injection mechanism.Note that when only the
nvidia-container-runtime-hook
is used (as is the case with the Docker--gpus
flag or the defaultcrio
configuration), the default runtime mode is set tolegacy
to ensure that this mode of operation is not changed.This means that:
works as expected
(without the change of the default this would return an error similar to):