Skip to content

CUDA Error when pytorch distribution training... #63

Open
@liming-ai

Description

@liming-ai

Hi, thanks for your contribution, when I using distribution training, there is always RuntimeError: RuntimeError: CUDA error: invalid device function, here is my test code:

import torch
from spatial_correlation_sampler import SpatialCorrelationSampler

device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32

input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)

correlation_sampler = SpatialCorrelationSampler(
    kernel_size=3,
    patch_size=1,
    stride=2,
    padding=0,
    dilation=2,
    dilation_patch=1)

model = torch.nn.DataParallel(correlation_sampler, device_ids=[0,1,2]).cuda()

out = model(input1, input2)

print(out.shape)

My enviroment is

Ubuntu 18.04.5 LTS
PyTorch -- 1.6.0
torchvision -- 0.7.0
gcc -- 7.5.0
CUDA -- 10.2

The whole error info is:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    out = model(input1, input2)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1]    20866 segmentation fault (core dumped)  python test.py

For un-distribution training, there is no error, but still some strange info:

torch.Size([1, 1, 1, 3, 3])
[1]    22742 segmentation fault (core dumped)  python test.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions