Open
Description
Hi, thanks for your contribution, when I using distribution training, there is always RuntimeError: RuntimeError: CUDA error: invalid device function
, here is my test code:
import torch
from spatial_correlation_sampler import SpatialCorrelationSampler
device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32
input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)
correlation_sampler = SpatialCorrelationSampler(
kernel_size=3,
patch_size=1,
stride=2,
padding=0,
dilation=2,
dilation_patch=1)
model = torch.nn.DataParallel(correlation_sampler, device_ids=[0,1,2]).cuda()
out = model(input1, input2)
print(out.shape)
My enviroment is
Ubuntu 18.04.5 LTS
PyTorch -- 1.6.0
torchvision -- 0.7.0
gcc -- 7.5.0
CUDA -- 10.2
The whole error info is:
Traceback (most recent call last):
File "test.py", line 24, in <module>
out = model(input1, input2)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/cuda/comm.py", line 166, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1] 20866 segmentation fault (core dumped) python test.py
For un-distribution training, there is no error, but still some strange info:
torch.Size([1, 1, 1, 3, 3])
[1] 22742 segmentation fault (core dumped) python test.py
Metadata
Metadata
Assignees
Labels
No labels