Skip to content

CUDA unknown error  #19903

Closed
Closed
@aniketmaurya

Description

@aniketmaurya

Bug description

getting cuda unknown error while training with PTL

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Traceback (most recent call last):
  File "/home/aniket/Projects/kaggle/essay-score/tuner.py", line 28, in <module>
    main()
  File "/home/aniket/Projects/kaggle/essay-score/tuner.py", line 24, in main
    best_lr = tuner.lr_find(model, train_dataloader, val_dataloader)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/tuner/tuning.py", line 180, in lr_find
    self._trainer.fit(model, train_dataloaders, val_dataloaders, datamodule)
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 943, in _run
    self.strategy.setup_environment()
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 129, in setup_environment
    self.accelerator.setup_device(self.root_device)
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/pytorch/accelerators/cuda.py", line 46, in setup_device
    _check_cuda_matmul_precision(device)
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/fabric/accelerators/cuda.py", line 361, in _check_cuda_matmul_precision
    if not torch.cuda.is_available() or not _is_ampere_or_later(device):
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/lightning/fabric/accelerators/cuda.py", line 355, in _is_ampere_or_later
    major, _ = torch.cuda.get_device_capability(device)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/home/aniket/miniconda3/envs/am/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions