Skip to content

BUGFIX: modifier: respect GPU volume-mount device requests #1130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jun 5, 2025

The gated modifiers used to add support for GDS, Mofed, and CUDA Forward Comatibility only check the NVIDIA_VISIBLE_DEVICES envvar to determine whether GPUs are requested and modifications should be made. This means that use cases where volume mounts are used to request devices (e.g. when using the GPU Device Plugin) are not supported.

This patch takes visibleDevicesFromEnvVar private, making VisibleDevices the only exported method to query valid devices.
And edits the gated modifiers to use this func, ensuring device requests
via mounts are also taken into acount.

Fixes: #1049

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot June 5, 2025 11:28
@ArangoGutierrez ArangoGutierrez self-assigned this Jun 5, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR privatizes the legacy VisibleDevicesFromEnvVar method and ensures all modifiers use the public VisibleDevices API to query requested devices.

  • Swapped calls from VisibleDevicesFromEnvVar() to VisibleDevices() in each modifier
  • Renamed and privatized VisibleDevicesFromEnvVar to visibleDevicesFromEnvVar, and introduced a new visibleEnvVars helper
  • Updated the unit test to exercise the private visibleDevicesFromEnvVar method

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/modifier/graphics.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/gated.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/csv.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/cdi.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/config/image/cuda_image_test.go Switched test to call visibleDevicesFromEnvVar
internal/config/image/cuda_image.go Privatized VisibleDevicesFromEnvVar, added visibleEnvVars, updated VisibleDevices
Comments suppressed due to low confidence (2)

internal/config/image/cuda_image.go:233

  • The new helper visibleEnvVars isn’t directly covered by existing tests. Consider adding unit tests to verify its behavior across different environment variable combinations.
func (i CUDA) visibleEnvVars() []string {

internal/config/image/cuda_image_test.go:432

  • This test now calls an unexported method; ensure the test file is in the same package (not *_test) so it can access visibleDevicesFromEnvVar, or consider testing via the public VisibleDevices method instead.
devices := image.visibleDevicesFromEnvVar()

@coveralls
Copy link

coveralls commented Jun 5, 2025

Pull Request Test Coverage Report for Build 15468835763

Details

  • 27 of 31 (87.1%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 33.88%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/config/image/cuda_image.go 18 19 94.74%
internal/modifier/gated.go 0 1 0.0%
internal/modifier/cdi.go 7 9 77.78%
Totals Coverage Status
Change from base Build 15466637032: 0.3%
Covered Lines: 4316
Relevant Lines: 12739

💛 - Coveralls

@@ -95,7 +95,7 @@ func getDevicesFromSpec(logger logger.Interface, ociSpec oci.Spec, cfg *config.C

var devices []string
seen := make(map[string]bool)
for _, name := range container.VisibleDevicesFromEnvVar() {
for _, name := range container.VisibleDevices() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have mentioned this in #1110, but we already process the mount devices from above, and although this MAY have the desired effect we may want to revisit how we handle CDI device requests and how we distinguish between these are "regular" requests.

My suggestion would be to:

  1. Keep VisibleDevicesFromEnvVar public for now.
  2. Update the other usages as per this PR.
  3. Handle CDI devices in a follow up.

@elezar
Copy link
Member

elezar commented Jun 5, 2025

Please update the commit messages / PR heading and description to better indicate intent. We are making this change to FIX A BUG. That is not clear from the commit messages and as such this will be missed in our release notes.

@elezar
Copy link
Member

elezar commented Jun 5, 2025

Please also add unit tests to capture the new behaviour in the affected code. (ideally first failing with the existing implementation).

This PR is not a blocker for the v1.18.0-rc.1 release.

@ArangoGutierrez ArangoGutierrez changed the title Only export VisibleDevices func as valid method to query valid devices BUGFIX: modifier: respect GPU volume-mount device requests Jun 5, 2025
The gated modifiers used to add support for GDS, Mofed, and CUDA Forward Comatibility only check the NVIDIA_VISIBLE_DEVICES envvar to determine whether GPUs are requested and modifications should be made. This means that use cases where volume mounts are used to request devices (e.g. when using the GPU Device Plugin) are not supported.

This patch takes visibleDevicesFromEnvVar private, making VisibleDevices the only exported method to query valid devices.
And edits the gated modifiers to use this func, ensuring device requests
via mounts are also taken into acount.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gated modifiers ignore device requests by volume mounts
3 participants