I am trying to install the newest Tensorflow GPU version to an Ubuntu environment. The Cuda drivers are correctly installed and working, which I can confirm with the following commands:
nvcc --version
With output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Also, nvidia-smi returns a valid result:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 40C P0 68W / 300W | 9712MiB / 16384MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
It seems I have the Cuda Version 11.7. Creating an empty Conda environment for Python 3.9, I want to install cudatoolkit and cudnn as instructed at https://www.tensorflow.org/install/pip?hl=en:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
However it complains that I do not have the correct Cuda version and won't install:
- cudatoolkit=11.2 -> __cuda[version='>=11.2.1|>=11|>=11.2|>=11.2.2']
Your installed CUDA driver is: 11.7
Obviously, my Cuda version meets the requirements, but somehow Conda would not see it. This seems to be a rare error, I didn't saw many similar issues on my search and turned to here. What can be wrong here?
Related
I am following a tutorial to fin-tune a SOTA model with MXNet. I am doing it in Google Colab:
https://cv.gluon.ai/build/examples_action_recognition/finetune_custom.html
However, I am unable to make it work. I believe it has to do with the version of MXNEt and Cuda version in Google Colab. I am getting this error:
MXNetError: Traceback (most recent call last):
File "../src/ndarray/../operator/tensor/./../mxnet_op.h", line 1120
Name: Check failed: err == cudaSuccess (209 vs. 0) : mxnet_generic_kernel ErrStr:no kernel image is available for execution on the device
when it reaches this part:
train_loss += sum([l.mean().asscalar() for l in loss])
The version of CUDA I get is the following
!nvcc --version # para mirar la version de CUDA
!nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Wed Mar 16 22:13:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 71C P8 33W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Last time I ran it, it worked but after 2 weeks I tried to ran the same notebook and I am unable to make it work. This is how I installed the libraries I need:
!pip install mxnet-cu110
!pip install torch==1.8.0 torchvision
!pip install gluoncv[full]
!pip install mmcv
Any help would be very much appreciated.
Thanks!
I would like to run a g5.xlarge on AWS with pytorch.
However I have this error when I try to do something with cuda in python (for example torch(1., device="cuda")):
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Here's the nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00 Driver Version: 470.82.00 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 25C P0 55W / 300W | 1815MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10415 C /miniconda3/bin/python 1813MiB |
+-----------------------------------------------------------------------------+
Any idea? Which version of CUDA/pytorch should I use ?
The A10G GPU uses sm_86 which is natively supported in CUDA>=11.1.
The PyTorch website provides a very handy tool to find the right install command for any required OS / PyTorch / CUDA configuration:
For the AWS g5.xlarge instances with A10G GPUs the following configuration works for me:
Ubuntu 18.04
Python 3.8.5
PyTorch 1.12.1
CUDA 11.6
Conda install command:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Pip install command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Have a look at this thread discussion thread pytorch-cuda-arch issue. The issue seems similar.
I tried to use pytorch 1.4.0 with GPU but cuda.is_available returned false.
I installed pytorch using conda:
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1
this is the output of cat /proc/version:
Linux version 5.8.0-40-generic (buildd#lcy01-amd64-014) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #45~20.04.1-Ubuntu SMP
and this is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:65:00.0 Off | N/A |
| 33% 27C P8 4W / 320W | 47MiB / 10010MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1134 G /usr/lib/xorg/Xorg 36MiB |
| 0 N/A N/A 1297 G /usr/bin/gnome-shell 9MiB |
+-----------------------------------------------------------------------------+
and the output of conda list(part of it):
Name Version Build Channel
cudatoolkit 10.1.243 h6bb024c_0
ninja 1.10.2 py38hff7bd54_0
python 3.8.2 hcff3b4d_14
pytorch 1.4.0 py3.8_cuda10.1.243_cudnn7.6.3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
torchvision 0.5.0 py38_cu101 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
yacs 0.1.6 py_0 conda-forge
yaml 0.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0
As far as I know my Linux server could support pytorch 1.4.0 but it didn't. what should I do?
My code is very simple for now:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.current_device()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-20-3380d2c12118> in <module>
----> 1 torch.cuda.current_device()
~/.conda/envs/tensorflow/lib/python3.6/site-packages/torch/cuda/__init__.py in current_device()
349 def current_device():
350 r"""Returns the index of a currently selected device."""
--> 351 _lazy_init()
352 return torch._C._cuda_getDevice()
353
~/.conda/envs/tensorflow/lib/python3.6/site-packages/torch/cuda/__init__.py in _lazy_init()
161 "Cannot re-initialize CUDA in forked subprocess. " + msg)
162 _check_driver()
--> 163 torch._C._cuda_init()
164 _cudart = _load_cudart()
165 _cudart.cudaGetErrorName.restype = ctypes.c_char_p
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51
Looking in the internet it looks like it is a version problem, but I swear I tried all combinations of drivers from CUDA 10.0, 10.1, tensorflow-gpu 13, 12, etc. and nothing seems to work.
NVIDIA driver: nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 930MX Off | 00000000:01:00.0 Off | N/A |
| N/A 36C P8 N/A / N/A | 139MiB / 2004MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 986 G /usr/lib/xorg/Xorg 64MiB |
| 0 1242 G /usr/bin/gnome-shell 72MiB |
+-----------------------------------------------------------------------------+
CUDA VERSION nvcc --version :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
tensorflow-gpu version: pip list | grep tensorflow:
tensorflow 1.13.1
tensorflow-estimator 1.13.0
pytorch version pip list | grep torch
pytorch-pretrained-bert 0.6.2
torch 1.1.0
torchvision 0.3.0
Can anyone see a problem of compatibility and explain why and how I can fix it?
Did you test your cuda installation ? If not you can use (which will take a while):
$ cd ~/NVIDIA_CUDA-10.0_Samples
$ make
And then:
$ cd ~/NVIDIA_CUDA-10.0_Samples/bin/x86_64/linux/release
$./deviceQuery
You should get "Test passed!" as result.
Source
In python, after importing theano, I get the following:
In [1]: import theano
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available
(error: Unable to get the number of gpus available: unknown error)
I'm running this on ubuntu 14.04 and I have an old gpu: GeForce GTX280
And my nvidia driver:
$ nvidia-smi
Wed Jul 13 21:25:58 2016
+------------------------------------------------------+
| NVIDIA-SMI 340.96 Driver Version: 340.96 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 280 Off | 0000:02:00.0 N/A | N/A |
| 40% 65C P0 N/A / N/A | 638MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I'm not sure why it's saying it's 'Not Supported' but it seems as though that might not be an issue as said here
Also, the CUDA version:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
Any help I can get would be awesome. I've been at this all day...
I feel your pain. I spent a few days ploughing through all the CUDA related errors.
Firstly, update to a more recent driver. eg, 361. (CLEAN INSTALL IT!) Then completely wipe cuda and cudnn from your harddrive with
sudo rm -rf /usr/local/cuda
or wherever else you installed it, then install cuda 7.5 (seriously, this specific version) and cuDNN v4 (again, this specific version)
You can run the following commands to settle CUDA.
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
bash cuda_7.5.18_linux.run --override
Follow the instructions, say NO when they ask you to install the 350 driver. And you should be set.
For cudnn, there's no direct link to wget, so you have to get the installer from https://developer.nvidia.com/cudnn and run the following commands:
tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include
sudo cp -r cuda/lib64/. /usr/local/cuda-7.5/lib64
echo -e 'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/lib64"\nexport CUDA_HOME=/usr/local/cuda-7.5' >> ~/.bash_profile
source ~/.bash_profile
Now to handle Theano on GPU:
nano ~/.theanorc
add these lines:
[global]
floatX = float32
device = gpu0
If you get an nvcc error, make it so instead:
[global]
floatX = float32
device = gpu0
[nvcc]
flags=-D_FORCE_INLINES
I had the same issue and was able to solve my issue by doing two things:
Install gcc-5 and linking/usr/bin/gcc to /usr/bin/gcc-5 as well as /usr/bin/g++ to/usr/bin/g++-5 (PS: I am using cuda 8)
Adding this flag flags=-D_FORCE_INLINES to the file ~/.theanorc under nvcc since apparently a bug in glibc 2.23 causes this issue