Skip to content

cuda index out of bounds error while training RetinaNet #785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssyrc opened this issue Feb 26, 2025 · 1 comment
Closed

cuda index out of bounds error while training RetinaNet #785

ssyrc opened this issue Feb 26, 2025 · 1 comment

Comments

@ssyrc
Copy link

ssyrc commented Feb 26, 2025

My Environment

  • docker: nvcr.io/nvidia/pytorch:24.01-py3
  • Python 3.10.12
  • cuda 12.3
  • torch 2.2.0a0+81ea7a4

After sourcing the configuration file for my system and running the benchmark,
the epoch stops in the middle of training.
Could it be an issue with the dataset?
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781.

Errors

Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

Epoch: [0] [ 0/36571] eta: 1 day, 22:50:54 lr: 0.000000 loss: 2.2681 (2.2681) classification: 1.5570 (1.5570) bbox_regression: 0.7112 (0.7112) time: 4.6117 data: 2.3579 max mem: 51676
Epoch: [0] [ 20/36571] eta: 6:43:28 lr: 0.000000 loss: 2.1965 (2.2525) classification: 1.4899 (1.5364) bbox_regression: 0.7040 (0.7162) time: 0.4649 data: 0.0004 max mem: 52126
Epoch: [0] [ 40/36571] eta: 5:44:49 lr: 0.000000 loss: 2.1948 (2.2442) classification: 1.4947 (1.5284) bbox_regression: 0.6966 (0.7158) time: 0.4656 data: 0.0004 max mem: 52126
Epoch: [0] [ 60/36571] eta: 5:24:05 lr: 0.000000 loss: 2.2333 (2.2632) classification: 1.5102 (1.5471) bbox_regression: 0.7020 (0.7160) time: 0.4634 data: 0.0004 max mem: 52126
Epoch: [0] [ 80/36571] eta: 5:13:44 lr: 0.000000 loss: 2.1976 (2.2609) classification: 1.4952 (1.5471) bbox_regression: 0.7035 (0.7138) time: 0.4649 data: 0.0005 max mem: 52126
Epoch: [0] [ 100/36571] eta: 5:07:41 lr: 0.000000 loss: 2.2347 (2.2632) classification: 1.5412 (1.5491) bbox_regression: 0.7127 (0.7141) time: 0.4670 data: 0.0005 max mem: 52126
Epoch: [0] [ 120/36571] eta: 5:03:12 lr: 0.000000 loss: 2.2351 (2.2656) classification: 1.5331 (1.5520) bbox_regression: 0.6994 (0.7136) time: 0.4632 data: 0.0005 max mem: 52126

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

**Traceback (most recent call last):
File "/workspace/ssd/train.py", line 266, in
main(args)
File "/workspace/ssd/train.py", line 235, in main
train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
File "/workspace/ssd/engine.py", line 35, in train_one_epoch
loss_dict = model(images, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/ssd/model/retinanet.py", line 552, in forward
losses = self.compute_loss(targets, head_outputs, anchors)
File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered

@ssyrc
Copy link
Author

ssyrc commented Feb 27, 2025

Well, it might not be an perfect solution, I handled this index issue by simply skipping the crashed data, as shown below.
The path of modified file & class:
single_stage_detector > ssd > model > retinanet.py > class RetinaNetClassificationHead

def compute_loss(self, targets, head_outputs, matched_idxs):
    # type: (List[Dict[str, Tensor]], Dict[str, Tensor], List[Tensor]) -> Tensor
    losses = []
    skip = 0

    cls_logits = head_outputs['cls_logits']
    
    for targets_per_image, cls_logits_per_image, matched_idxs_per_image in zip(targets, cls_logits, matched_idxs):
        
        # determine only the foreground
        foreground_idxs_per_image = matched_idxs_per_image >= 0
        num_foreground = foreground_idxs_per_image.sum()
        
        # create the target classification
        gt_classes_target = torch.zeros_like(cls_logits_per_image) # torch.Size([120087, 264])

        # [Modified] If there is a value greater than 263, an IndexError is raised
        if (targets_per_image['labels'] >= 264).any():
            skip += 1
            print(f"Skipping {skip} because labels contain values >= 264")
            continue

        gt_classes_target[
            foreground_idxs_per_image,
            targets_per_image['labels'][matched_idxs_per_image[foreground_idxs_per_image]]
        ] = 1.0 # torch.Size([120087, 264])

        # find indices for which anchors should be ignored
        valid_idxs_per_image = matched_idxs_per_image != self.BETWEEN_THRESHOLDS # torch.Size([120087])

        # compute the classification loss
        losses.append(sigmoid_focal_loss(
            cls_logits_per_image[valid_idxs_per_image],
            gt_classes_target[valid_idxs_per_image],
            reduction='sum',
        ) / max(1, num_foreground))

    return _sum(losses) / (len(targets) - skip)

@ssyrc ssyrc closed this as completed Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant