cuda index out of bounds error while training RetinaNet #785

ssyrc · 2025-02-26T02:49:55Z

My Environment

docker: nvcr.io/nvidia/pytorch:24.01-py3
Python 3.10.12
cuda 12.3
torch 2.2.0a0+81ea7a4

After sourcing the configuration file for my system and running the benchmark,
the epoch stops in the middle of training.
Could it be an issue with the dataset?
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781.

Errors

Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

Epoch: [0] [ 0/36571] eta: 1 day, 22:50:54 lr: 0.000000 loss: 2.2681 (2.2681) classification: 1.5570 (1.5570) bbox_regression: 0.7112 (0.7112) time: 4.6117 data: 2.3579 max mem: 51676
Epoch: [0] [ 20/36571] eta: 6:43:28 lr: 0.000000 loss: 2.1965 (2.2525) classification: 1.4899 (1.5364) bbox_regression: 0.7040 (0.7162) time: 0.4649 data: 0.0004 max mem: 52126
Epoch: [0] [ 40/36571] eta: 5:44:49 lr: 0.000000 loss: 2.1948 (2.2442) classification: 1.4947 (1.5284) bbox_regression: 0.6966 (0.7158) time: 0.4656 data: 0.0004 max mem: 52126
Epoch: [0] [ 60/36571] eta: 5:24:05 lr: 0.000000 loss: 2.2333 (2.2632) classification: 1.5102 (1.5471) bbox_regression: 0.7020 (0.7160) time: 0.4634 data: 0.0004 max mem: 52126
Epoch: [0] [ 80/36571] eta: 5:13:44 lr: 0.000000 loss: 2.1976 (2.2609) classification: 1.4952 (1.5471) bbox_regression: 0.7035 (0.7138) time: 0.4649 data: 0.0005 max mem: 52126
Epoch: [0] [ 100/36571] eta: 5:07:41 lr: 0.000000 loss: 2.2347 (2.2632) classification: 1.5412 (1.5491) bbox_regression: 0.7127 (0.7141) time: 0.4670 data: 0.0005 max mem: 52126
Epoch: [0] [ 120/36571] eta: 5:03:12 lr: 0.000000 loss: 2.2351 (2.2656) classification: 1.5331 (1.5520) bbox_regression: 0.6994 (0.7136) time: 0.4632 data: 0.0005 max mem: 52126

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

**Traceback (most recent call last):
File "/workspace/ssd/train.py", line 266, in
main(args)
File "/workspace/ssd/train.py", line 235, in main
train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
File "/workspace/ssd/engine.py", line 35, in train_one_epoch
loss_dict = model(images, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/ssd/model/retinanet.py", line 552, in forward
losses = self.compute_loss(targets, head_outputs, anchors)
File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered

The text was updated successfully, but these errors were encountered:

ssyrc · 2025-02-27T08:38:05Z

Well, it might not be an perfect solution, I handled this index issue by simply skipping the crashed data, as shown below.
The path of modified file & class:
single_stage_detector > ssd > model > retinanet.py > class RetinaNetClassificationHead

def compute_loss(self, targets, head_outputs, matched_idxs):
    # type: (List[Dict[str, Tensor]], Dict[str, Tensor], List[Tensor]) -> Tensor
    losses = []
    skip = 0

    cls_logits = head_outputs['cls_logits']
    
    for targets_per_image, cls_logits_per_image, matched_idxs_per_image in zip(targets, cls_logits, matched_idxs):
        
        # determine only the foreground
        foreground_idxs_per_image = matched_idxs_per_image >= 0
        num_foreground = foreground_idxs_per_image.sum()
        
        # create the target classification
        gt_classes_target = torch.zeros_like(cls_logits_per_image) # torch.Size([120087, 264])

        # [Modified] If there is a value greater than 263, an IndexError is raised
        if (targets_per_image['labels'] >= 264).any():
            skip += 1
            print(f"Skipping {skip} because labels contain values >= 264")
            continue

        gt_classes_target[
            foreground_idxs_per_image,
            targets_per_image['labels'][matched_idxs_per_image[foreground_idxs_per_image]]
        ] = 1.0 # torch.Size([120087, 264])

        # find indices for which anchors should be ignored
        valid_idxs_per_image = matched_idxs_per_image != self.BETWEEN_THRESHOLDS # torch.Size([120087])

        # compute the classification loss
        losses.append(sigmoid_focal_loss(
            cls_logits_per_image[valid_idxs_per_image],
            gt_classes_target[valid_idxs_per_image],
            reduction='sum',
        ) / max(1, num_foreground))

    return _sum(losses) / (len(targets) - skip)

ssyrc closed this as completed Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda index out of bounds error while training RetinaNet #785

cuda index out of bounds error while training RetinaNet #785

ssyrc commented Feb 26, 2025 •

edited

Loading

ssyrc commented Feb 27, 2025

Uh oh!

cuda index out of bounds error while training RetinaNet #785

cuda index out of bounds error while training RetinaNet #785

Comments

ssyrc commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My Environment

Errors

ssyrc commented Feb 27, 2025

Uh oh!

ssyrc commented Feb 26, 2025 •

edited

Loading