Pytorch version on g5.xlarge - python

I would like to run a g5.xlarge on AWS with pytorch.
However I have this error when I try to do something with cuda in python (for example torch(1., device="cuda")):
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Here's the nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00 Driver Version: 470.82.00 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 25C P0 55W / 300W | 1815MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10415 C /miniconda3/bin/python 1813MiB |
+-----------------------------------------------------------------------------+
Any idea? Which version of CUDA/pytorch should I use ?

The A10G GPU uses sm_86 which is natively supported in CUDA>=11.1.
The PyTorch website provides a very handy tool to find the right install command for any required OS / PyTorch / CUDA configuration:
For the AWS g5.xlarge instances with A10G GPUs the following configuration works for me:
Ubuntu 18.04
Python 3.8.5
PyTorch 1.12.1
CUDA 11.6
Conda install command:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Pip install command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

Have a look at this thread discussion thread pytorch-cuda-arch issue. The issue seems similar.

Related

Conda cannot see Cuda version

I am trying to install the newest Tensorflow GPU version to an Ubuntu environment. The Cuda drivers are correctly installed and working, which I can confirm with the following commands:
nvcc --version
With output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Also, nvidia-smi returns a valid result:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 40C P0 68W / 300W | 9712MiB / 16384MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
It seems I have the Cuda Version 11.7. Creating an empty Conda environment for Python 3.9, I want to install cudatoolkit and cudnn as instructed at https://www.tensorflow.org/install/pip?hl=en:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
However it complains that I do not have the correct Cuda version and won't install:
- cudatoolkit=11.2 -> __cuda[version='>=11.2.1|>=11|>=11.2|>=11.2.2']
Your installed CUDA driver is: 11.7
Obviously, my Cuda version meets the requirements, but somehow Conda would not see it. This seems to be a rare error, I didn't saw many similar issues on my search and turned to here. What can be wrong here?

Error with MXNET and CUDA in Google Colab: no kernel image is available for execution on the device

I am following a tutorial to fin-tune a SOTA model with MXNet. I am doing it in Google Colab:
https://cv.gluon.ai/build/examples_action_recognition/finetune_custom.html
However, I am unable to make it work. I believe it has to do with the version of MXNEt and Cuda version in Google Colab. I am getting this error:
MXNetError: Traceback (most recent call last):
File "../src/ndarray/../operator/tensor/./../mxnet_op.h", line 1120
Name: Check failed: err == cudaSuccess (209 vs. 0) : mxnet_generic_kernel ErrStr:no kernel image is available for execution on the device
when it reaches this part:
train_loss += sum([l.mean().asscalar() for l in loss])
The version of CUDA I get is the following
!nvcc --version # para mirar la version de CUDA
!nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Wed Mar 16 22:13:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 71C P8 33W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Last time I ran it, it worked but after 2 weeks I tried to ran the same notebook and I am unable to make it work. This is how I installed the libraries I need:
!pip install mxnet-cu110
!pip install torch==1.8.0 torchvision
!pip install gluoncv[full]
!pip install mmcv
Any help would be very much appreciated.
Thanks!

Linux Pytorch.cuda.is_available() returns false

I tried to use pytorch 1.4.0 with GPU but cuda.is_available returned false.
I installed pytorch using conda:
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1
this is the output of cat /proc/version:
Linux version 5.8.0-40-generic (buildd#lcy01-amd64-014) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #45~20.04.1-Ubuntu SMP
and this is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:65:00.0 Off | N/A |
| 33% 27C P8 4W / 320W | 47MiB / 10010MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1134 G /usr/lib/xorg/Xorg 36MiB |
| 0 N/A N/A 1297 G /usr/bin/gnome-shell 9MiB |
+-----------------------------------------------------------------------------+
and the output of conda list(part of it):
Name Version Build Channel
cudatoolkit 10.1.243 h6bb024c_0
ninja 1.10.2 py38hff7bd54_0
python 3.8.2 hcff3b4d_14
pytorch 1.4.0 py3.8_cuda10.1.243_cudnn7.6.3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
torchvision 0.5.0 py38_cu101 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
yacs 0.1.6 py_0 conda-forge
yaml 0.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0
As far as I know my Linux server could support pytorch 1.4.0 but it didn't. what should I do?

CUDA runtime unknown error, maybe a driver problem? CUDA can't see my gpu

My code is very simple for now:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.current_device()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-20-3380d2c12118> in <module>
----> 1 torch.cuda.current_device()
~/.conda/envs/tensorflow/lib/python3.6/site-packages/torch/cuda/__init__.py in current_device()
349 def current_device():
350 r"""Returns the index of a currently selected device."""
--> 351 _lazy_init()
352 return torch._C._cuda_getDevice()
353
~/.conda/envs/tensorflow/lib/python3.6/site-packages/torch/cuda/__init__.py in _lazy_init()
161 "Cannot re-initialize CUDA in forked subprocess. " + msg)
162 _check_driver()
--> 163 torch._C._cuda_init()
164 _cudart = _load_cudart()
165 _cudart.cudaGetErrorName.restype = ctypes.c_char_p
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51
Looking in the internet it looks like it is a version problem, but I swear I tried all combinations of drivers from CUDA 10.0, 10.1, tensorflow-gpu 13, 12, etc. and nothing seems to work.
NVIDIA driver: nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 930MX Off | 00000000:01:00.0 Off | N/A |
| N/A 36C P8 N/A / N/A | 139MiB / 2004MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 986 G /usr/lib/xorg/Xorg 64MiB |
| 0 1242 G /usr/bin/gnome-shell 72MiB |
+-----------------------------------------------------------------------------+
CUDA VERSION nvcc --version :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
tensorflow-gpu version: pip list | grep tensorflow:
tensorflow 1.13.1
tensorflow-estimator 1.13.0
pytorch version pip list | grep torch
pytorch-pretrained-bert 0.6.2
torch 1.1.0
torchvision 0.3.0
Can anyone see a problem of compatibility and explain why and how I can fix it?
Did you test your cuda installation ? If not you can use (which will take a while):
$ cd ~/NVIDIA_CUDA-10.0_Samples
$ make
And then:
$ cd ~/NVIDIA_CUDA-10.0_Samples/bin/x86_64/linux/release
$./deviceQuery
You should get "Test passed!" as result.
Source

Jupyter Notebook Kernel dies when importing Tensorflow

I am trying to use Tensorflow-gpu on a jupyter notebook inside a docker containing running on my Ubuntu 18.04 Bionic Beaver server.
I have done the following steps:
1) Installed Nvidia Drivers 390.67 sudo apt-get install nvidia-driver-390
2) Installed CUDA Drivers 9.0 cuda_9.0.176_384.81_linux.run
3) Installed CuDNN 7.0.5 cudnn-9.0-linux-x64-v7.tgz
4) Installed Docker sudo apt install docker-ce
5) Installed nvidia-docker2 sudo apt install nvidia-docker2
I attempt to do the following
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3
The reason i am using Tensorflow 1.5.1 is because i was getting this same Kernel dead error on 1.8.0-gpu-py and i read that you need to use Tensorflow 1.5 for older CPUs. Which i don't think is really the issue since i'm trying to simply import it and i'm using tensorflow-gpu
When i run any cell that imports tensorflow for the first time i get
My server hardware is as follows
CPU: AMD Phenom(tm) II X4 965 Processor
GPU: GeForce GTX 760
Motherboard: ASRock 960GM/U3S3 FX
Memory: G Skill F3-1600C9D-8GAB (8 GB Memory)
How can i determine why the kernel is dying when i simply import tensorflow using import tensorflow as tf.
Here is the result of nvidia-docker smi
$ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Fri Jun 22 17:53:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67 Driver Version: 390.67 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 760 Off | 00000000:01:00.0 N/A | N/A |
| 0% 34C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
This matches exactly if i use nvidia-smi outside docker.
Here is the nvcc --version result:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
If i attempt to do nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3 bash to bring up a bash prompt and then i enter a python session via python when i do import tensorflow as tf i get Illegal instruction (core dumped) so it isn't working in a non-jupyter environment either. This error still occurs even if i do import numpy first and then import tensorflow as tf
It turns out i needed to downgrade to tensorflow 1.5.0. 1.5.1 is where AVX was added. AVX instructions are apparently used on module load to set up the library.

Categories