Skip to content

[QUESTION] Does CV-CUDA support for multigpu? #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhengjs opened this issue Oct 18, 2024 · 7 comments
Open

[QUESTION] Does CV-CUDA support for multigpu? #212

zhengjs opened this issue Oct 18, 2024 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@zhengjs
Copy link

zhengjs commented Oct 18, 2024

Hi, I want to use this great work in torch based distributed training to speed up, it works well when only use single gpu, but when use more than one gpu, it get crash and get the error as following:
terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: Hold resources failed: cudaErrorInvalidResourceHandle: invalid resource handle
I have tried to print some info to debug this problem, it can be found that all things is fine in rank_0, but cvcuda get crash in rank_1,
image

the main code is shown as below:
`

 # Define the cuda device, context and streams.
        cuda_device = cuda.Device(self.rank)
        cuda_ctx = cuda_device.retain_primary_context()
        cuda_ctx.push()
        cvcuda_stream = cvcuda.Stream().current
        torch_stream = torch.cuda.default_stream(device=cuda_device)

        print(f'rank_{self.rank} start train, cvcuda stream: {cvcuda_stream}, torch_stream: {torch_stream}')
        self.data_preprocessor = PreprocessorCvcuda(
            self.rank, 
            cuda_ctx,
            cvcuda_stream,
        )

        #  Do everything in streams.
        with cvcuda_stream, torch.cuda.stream(torch_stream):
            self.train(train_dataloaders, test_dataloaders, iterations=iterations)
            cuda_ctx.pop()
`
````python
class ImageBatchDecoder:
    def __init__(
        self,
        device_id,
        cuda_ctx,
        cuda_stream,
        cvcuda_perf=None,
    ):
        self.device_id = device_id
        self.cuda_ctx = cuda_ctx
        self.cuda_stream = cuda_stream
        self.cvcuda_perf = cvcuda_perf
        self.decoder = nvimgcodec.Decoder(device_id=device_id)

    def __call__(self, batch: list, aug_params: dict):
        # args: 
        #   batch: batch of undecoded images bytes
        if self.cvcuda_perf is not None:
            self.cvcuda_perf.push_range("decoder.nvimagecodec")

        data_batch = [img for frame in batch for img in frame]

        tensor_list = []
        print(f'rank_{self.device_id} start decode, stream: {self.cuda_stream}...', flush=True)
        image_list = self.decoder.decode(data_batch, cuda_stream=self.cuda_stream)
        print(f'rank_{self.device_id} end decode...', flush=True)

        resize = aug_params['resize'].view(-1, 2).cpu().numpy()
        crop = aug_params['crop'].view(-1, 4).cpu().numpy()
        rotate = aug_params['rotate'].view(-1).cpu().numpy()
        rotate_rad = rotate * 3.1415926535897932384626433832795 / 180
        sin_r = np.sin(rotate_rad)
        cor_r = np.cos(rotate_rad)
        # Convert the decoded images to nvcv tensors in a list.
        for i in range(len(image_list)):
            print(f'rank_{self.device_id} start resize_crop_convert_reformat...', flush=True)
            aug_img = cvcuda.resize_crop_convert_reformat(
                cvcuda.as_tensor(image_list[i], "HWC"),
                (resize[i, 0], resize[i, 1]),
                cvcuda.Interp.LINEAR,
                cvcuda.RectI(
                    crop[i, 0], 
                    crop[i, 1], 
                    round(crop[i, 2] - crop[i, 0]), 
                    round(crop[i, 3] - crop[i, 1])),
                layout="HWC",
                data_type=nvcv.Type.U8,
                # manip=cvcuda.ChannelManip.REVERSE,
                # scale=1. / 255,
                stream=self.cuda_stream,
            )
            print(f'rank_{self.device_id} start rotate...', flush=True)
            aug_img = cvcuda.rotate(
                aug_img,
                rotate[i],
                [0.5 * (aug_img.shape[1] - aug_img.shape[1] * cor_r[i] - aug_img.shape[0] * sin_r[i]),
                 0.5 * (aug_img.shape[0] + aug_img.shape[1] * sin_r[i] - aug_img.shape[0] * cor_r[i])], 
                cvcuda.Interp.LINEAR,
                stream=self.cuda_stream,
            )
            tensor_list.append(aug_img)

        # Stack the list of tensors to a single NHWC tensor and convert to NCHW.
        print(f'rank_{self.device_id} start reformat...', flush=True)
        cvcuda_decoded_tensor = cvcuda.reformat(cvcuda.stack(tensor_list), "NCHW", stream=self.cuda_stream)

        if self.cvcuda_perf is not None:
            self.cvcuda_perf.pop_range()
        print(f'rank_{self.device_id} end of ImageBatchDecoder...', flush=True)
        return cvcuda_decoded_tensor
@zhengjs zhengjs added the question Further information is requested label Oct 18, 2024
@JanuszL
Copy link

JanuszL commented Oct 18, 2024

Hi @zhengjs,

While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training.
Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?

@zhengjs
Copy link
Author

zhengjs commented Oct 18, 2024

Hi @zhengjs,

While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?

@JanuszL Thank you for your reply! Yes, I have considered to use DALI, but I think it's a little complicated,I have to refactor many dataset code, so I use CV-CUDA. In fact, I didn't use CV-CUDA in dataset, I use it at the begining of each iteration, dataloader only read image bytes, before model forward, use nvImageCodec and CV-CUDA to do decode ang augmentation on gpu. I think the problem may be that cvcuda.Stream().current hasn't specified the device_id, but I don't found any code to do that...

@Novelian
Copy link

Novelian commented Jan 8, 2025

cv-cuda use vector to cache item for reuse,the problem is: when use the second gpu,it return the resource belong to previous gpu.
reformat related function: "Tensor::CreateFromReqs" and "CreateOperatorEx"

@Novelian
Copy link

@zhengjs @JanuszL add device_id as resource(tesnsor)'s key or create multiple resource collections depending on the graphics card

@AlphaCat00
Copy link

AlphaCat00 commented Apr 1, 2025

@zhengjs I found a solution: call cudart.cudaSetDevice from cuda-python before creating an NVCV stream. NVCV uses cudaStreamCreateWithFlags to create its CUDA stream.

@dsuthar-nvidia dsuthar-nvidia self-assigned this Apr 3, 2025
@dsuthar-nvidia
Copy link
Contributor

@Novelian @AlphaCat00 CVCUDA samples and benchmarking scripts can run on multiple GPUs. In fact the Readme talks about multi-gpu launches. Theses lines in benchmark.py launches any CVCUDA python sample on more than 1 GPU at the same time. It launches them as sub-process, with each of them having a difference CUDA devide pre-allocated with CUDA_VISIBLE_DEVICES set to only the GPU on which that sup-process is going to execute. Take a look at these lines. That would make sure that only particular GPU becomes visible to that process and that it shows up as GPU ID 0 to that process.

Let me know if this answers your question.

@Novelian
Copy link

@dsuthar-nvidia Thank you for your reply. how to use multi gpu in multiprocessing.Process( os.environ["CUDA_VISIBLE_DEVICES"] not work in subprocess)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants