You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After sourcing the configuration file for my system and running the benchmark, the epoch stops in the middle of training.
Could it be an issue with the dataset?
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781.
Errors
Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
**Traceback (most recent call last):
File "/workspace/ssd/train.py", line 266, in
main(args)
File "/workspace/ssd/train.py", line 235, in main
train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
File "/workspace/ssd/engine.py", line 35, in train_one_epoch
loss_dict = model(images, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/ssd/model/retinanet.py", line 552, in forward
losses = self.compute_loss(targets, head_outputs, anchors)
File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
The text was updated successfully, but these errors were encountered:
Well, it might not be an perfect solution, I handled this index issue by simply skipping the crashed data, as shown below.
The path of modified file & class: single_stage_detector > ssd > model > retinanet.py > class RetinaNetClassificationHead
def compute_loss(self, targets, head_outputs, matched_idxs):
# type: (List[Dict[str, Tensor]], Dict[str, Tensor], List[Tensor]) -> Tensor
losses = []
skip = 0
cls_logits = head_outputs['cls_logits']
for targets_per_image, cls_logits_per_image, matched_idxs_per_image in zip(targets, cls_logits, matched_idxs):
# determine only the foreground
foreground_idxs_per_image = matched_idxs_per_image >= 0
num_foreground = foreground_idxs_per_image.sum()
# create the target classification
gt_classes_target = torch.zeros_like(cls_logits_per_image) # torch.Size([120087, 264])
# [Modified] If there is a value greater than 263, an IndexError is raised
if (targets_per_image['labels'] >= 264).any():
skip += 1
print(f"Skipping {skip} because labels contain values >= 264")
continue
gt_classes_target[
foreground_idxs_per_image,
targets_per_image['labels'][matched_idxs_per_image[foreground_idxs_per_image]]
] = 1.0 # torch.Size([120087, 264])
# find indices for which anchors should be ignored
valid_idxs_per_image = matched_idxs_per_image != self.BETWEEN_THRESHOLDS # torch.Size([120087])
# compute the classification loss
losses.append(sigmoid_focal_loss(
cls_logits_per_image[valid_idxs_per_image],
gt_classes_target[valid_idxs_per_image],
reduction='sum',
) / max(1, num_foreground))
return _sum(losses) / (len(targets) - skip)
Uh oh!
There was an error while loading. Please reload this page.
My Environment
After sourcing the configuration file for my system and running the benchmark,
the epoch stops in the middle of training.
Could it be an issue with the dataset?
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781.
Errors
Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Epoch: [0] [ 0/36571] eta: 1 day, 22:50:54 lr: 0.000000 loss: 2.2681 (2.2681) classification: 1.5570 (1.5570) bbox_regression: 0.7112 (0.7112) time: 4.6117 data: 2.3579 max mem: 51676
Epoch: [0] [ 20/36571] eta: 6:43:28 lr: 0.000000 loss: 2.1965 (2.2525) classification: 1.4899 (1.5364) bbox_regression: 0.7040 (0.7162) time: 0.4649 data: 0.0004 max mem: 52126
Epoch: [0] [ 40/36571] eta: 5:44:49 lr: 0.000000 loss: 2.1948 (2.2442) classification: 1.4947 (1.5284) bbox_regression: 0.6966 (0.7158) time: 0.4656 data: 0.0004 max mem: 52126
Epoch: [0] [ 60/36571] eta: 5:24:05 lr: 0.000000 loss: 2.2333 (2.2632) classification: 1.5102 (1.5471) bbox_regression: 0.7020 (0.7160) time: 0.4634 data: 0.0004 max mem: 52126
Epoch: [0] [ 80/36571] eta: 5:13:44 lr: 0.000000 loss: 2.1976 (2.2609) classification: 1.4952 (1.5471) bbox_regression: 0.7035 (0.7138) time: 0.4649 data: 0.0005 max mem: 52126
Epoch: [0] [ 100/36571] eta: 5:07:41 lr: 0.000000 loss: 2.2347 (2.2632) classification: 1.5412 (1.5491) bbox_regression: 0.7127 (0.7141) time: 0.4670 data: 0.0005 max mem: 52126
Epoch: [0] [ 120/36571] eta: 5:03:12 lr: 0.000000 loss: 2.2351 (2.2656) classification: 1.5331 (1.5520) bbox_regression: 0.6994 (0.7136) time: 0.4632 data: 0.0005 max mem: 52126
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.**Traceback (most recent call last):
File "/workspace/ssd/train.py", line 266, in
main(args)
File "/workspace/ssd/train.py", line 235, in main
train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
File "/workspace/ssd/engine.py", line 35, in train_one_epoch
loss_dict = model(images, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/ssd/model/retinanet.py", line 552, in forward
losses = self.compute_loss(targets, head_outputs, anchors)
File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
The text was updated successfully, but these errors were encountered: