Closed
Description
Bug description
getting cuda unknown error while training with PTL
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Traceback (most recent call last):
File "/home/aniket/Projects/kaggle/essay-score/tuner.py", line 28, in <module>
main()
File "/home/aniket/Projects/kaggle/essay-score/tuner.py", line 24, in main
best_lr = tuner.lr_find(model, train_dataloader, val_dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/tuner/tuning.py", line 180, in lr_find
self._trainer.fit(model, train_dataloaders, val_dataloaders, datamodule)
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 943, in _run
self.strategy.setup_environment()
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 129, in setup_environment
self.accelerator.setup_device(self.root_device)
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/accelerators/cuda.py", line 46, in setup_device
_check_cuda_matmul_precision(device)
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/fabric/accelerators/cuda.py", line 361, in _check_cuda_matmul_precision
if not torch.cuda.is_available() or not _is_ampere_or_later(device):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/fabric/accelerators/cuda.py", line 355, in _is_ampere_or_later
major, _ = torch.cuda.get_device_capability(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
prop = get_device_properties(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response