In python, after importing theano, I get the following:
In [1]: import theano
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available
(error: Unable to get the number of gpus available: unknown error)
I'm running this on ubuntu 14.04 and I have an old gpu: GeForce GTX280
And my nvidia driver:
$ nvidia-smi
Wed Jul 13 21:25:58 2016
+------------------------------------------------------+
| NVIDIA-SMI 340.96 Driver Version: 340.96 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 280 Off | 0000:02:00.0 N/A | N/A |
| 40% 65C P0 N/A / N/A | 638MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I'm not sure why it's saying it's 'Not Supported' but it seems as though that might not be an issue as said here
Also, the CUDA version:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
Any help I can get would be awesome. I've been at this all day...
I feel your pain. I spent a few days ploughing through all the CUDA related errors.
Firstly, update to a more recent driver. eg, 361. (CLEAN INSTALL IT!) Then completely wipe cuda and cudnn from your harddrive with
sudo rm -rf /usr/local/cuda
or wherever else you installed it, then install cuda 7.5 (seriously, this specific version) and cuDNN v4 (again, this specific version)
You can run the following commands to settle CUDA.
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
bash cuda_7.5.18_linux.run --override
Follow the instructions, say NO when they ask you to install the 350 driver. And you should be set.
For cudnn, there's no direct link to wget, so you have to get the installer from https://developer.nvidia.com/cudnn and run the following commands:
tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include
sudo cp -r cuda/lib64/. /usr/local/cuda-7.5/lib64
echo -e 'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/lib64"\nexport CUDA_HOME=/usr/local/cuda-7.5' >> ~/.bash_profile
source ~/.bash_profile
Now to handle Theano on GPU:
nano ~/.theanorc
add these lines:
[global]
floatX = float32
device = gpu0
If you get an nvcc error, make it so instead:
[global]
floatX = float32
device = gpu0
[nvcc]
flags=-D_FORCE_INLINES
I had the same issue and was able to solve my issue by doing two things:
Install gcc-5 and linking/usr/bin/gcc to /usr/bin/gcc-5 as well as /usr/bin/g++ to/usr/bin/g++-5 (PS: I am using cuda 8)
Adding this flag flags=-D_FORCE_INLINES to the file ~/.theanorc under nvcc since apparently a bug in glibc 2.23 causes this issue
Related
I am trying to install the newest Tensorflow GPU version to an Ubuntu environment. The Cuda drivers are correctly installed and working, which I can confirm with the following commands:
nvcc --version
With output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Also, nvidia-smi returns a valid result:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 40C P0 68W / 300W | 9712MiB / 16384MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
It seems I have the Cuda Version 11.7. Creating an empty Conda environment for Python 3.9, I want to install cudatoolkit and cudnn as instructed at https://www.tensorflow.org/install/pip?hl=en:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
However it complains that I do not have the correct Cuda version and won't install:
- cudatoolkit=11.2 -> __cuda[version='>=11.2.1|>=11|>=11.2|>=11.2.2']
Your installed CUDA driver is: 11.7
Obviously, my Cuda version meets the requirements, but somehow Conda would not see it. This seems to be a rare error, I didn't saw many similar issues on my search and turned to here. What can be wrong here?
I would like to run a g5.xlarge on AWS with pytorch.
However I have this error when I try to do something with cuda in python (for example torch(1., device="cuda")):
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
Here's the nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00 Driver Version: 470.82.00 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 25C P0 55W / 300W | 1815MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10415 C /miniconda3/bin/python 1813MiB |
+-----------------------------------------------------------------------------+
Any idea? Which version of CUDA/pytorch should I use ?
The A10G GPU uses sm_86 which is natively supported in CUDA>=11.1.
The PyTorch website provides a very handy tool to find the right install command for any required OS / PyTorch / CUDA configuration:
For the AWS g5.xlarge instances with A10G GPUs the following configuration works for me:
Ubuntu 18.04
Python 3.8.5
PyTorch 1.12.1
CUDA 11.6
Conda install command:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Pip install command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Have a look at this thread discussion thread pytorch-cuda-arch issue. The issue seems similar.
I'm on ubuntu 20.04 with cuda 10.1 and cudnn 7.6.5-32 and I'm trying to build from source tensorflow 2.3 but I keep getting a value error while using ./configure?
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 10.1.243
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.6.5.32
Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]:
Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/src/linux-headers-5.4.0-42/include/linux/,/usr/src/linux-headers-5.4.0-42/include/uapi/linux/,/usr/src/linux-headers-5.4.0-26/include/uapi/linux/,/usr/src/linux-headers-5.4.0-26/include/linux/,/usr/share/man/man3/,/usr/include/linux/,/usr/include/,/usr/lib/cuda/,/usr/include/
Traceback (most recent call last):
File "third_party/gpus/find_cuda_config.py", line 648, in <module>
main()
File "third_party/gpus/find_cuda_config.py", line 640, in main
for key, value in sorted(find_cuda_config().items()):
File "third_party/gpus/find_cuda_config.py", line 578, in find_cuda_config
result.update(_find_cuda_config(cuda_paths, cuda_version))
File "third_party/gpus/find_cuda_config.py", line 252, in _find_cuda_config
cuda_header_path, header_version = _find_header(base_paths, "cuda.h",
File "third_party/gpus/find_cuda_config.py", line 240, in _find_header
return _find_versioned_file(base_paths, _header_paths(), header_name,
File "third_party/gpus/find_cuda_config.py", line 230, in _find_versioned_file
actual_version = get_version(file)
File "third_party/gpus/find_cuda_config.py", line 247, in get_header_version
version = int(_get_header_version(path, "CUDA_VERSION"))
ValueError: invalid literal for int() with base 10: ''
Asking for detailed CUDA configuration...
I ran this command to get the base paths,
$ locate cuda.h
/snap/gnome-3-34-1804/24/usr/include/linux/cuda.h
/snap/gnome-3-34-1804/36/usr/include/linux/cuda.h
/usr/include/cuda.h
/usr/include/linux/cuda.h
/usr/share/man/man3/cuda.h.3.gz
/usr/src/linux-headers-5.4.0-26/include/linux/cuda.h
/usr/src/linux-headers-5.4.0-26/include/uapi/linux/cuda.h
/usr/src/linux-headers-5.4.0-42/include/linux/cuda.h
/usr/src/linux-headers-5.4.0-42/include/uapi/linux/cuda.h
And here are my cuda installation path,
$whereis cuda
cuda: /usr/lib/cuda /usr/include/cuda.h
And here's my nvidia and cuda version,
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
here's my driver
$ nvidia-smi
Sun Aug 2 01:39:54 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A |
| 27% 41C P0 1W / 38W | 171MiB / 1997MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1049 G /usr/lib/xorg/Xorg 20MiB |
| 0 1859 G /usr/lib/xorg/Xorg 44MiB |
| 0 2058 G /usr/bin/gnome-shell 94MiB |
+-----------------------------------------------------------------------------+
Building Tensorflow with NVIDIA GPU support (or any CUDA project) from source requires that you have a full CUDA toolkit installed (which implies all of the necessary dependencies which CUDA requires). Note that the conda distributed "cudatoolkit" package is not a full CUDA toolkit and cannot be used to build code.
You do not have a CUDA toolkit installed. Therefore you cannot build Tensorflow.
Install one.
I am trying to use Tensorflow-gpu on a jupyter notebook inside a docker containing running on my Ubuntu 18.04 Bionic Beaver server.
I have done the following steps:
1) Installed Nvidia Drivers 390.67 sudo apt-get install nvidia-driver-390
2) Installed CUDA Drivers 9.0 cuda_9.0.176_384.81_linux.run
3) Installed CuDNN 7.0.5 cudnn-9.0-linux-x64-v7.tgz
4) Installed Docker sudo apt install docker-ce
5) Installed nvidia-docker2 sudo apt install nvidia-docker2
I attempt to do the following
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3
The reason i am using Tensorflow 1.5.1 is because i was getting this same Kernel dead error on 1.8.0-gpu-py and i read that you need to use Tensorflow 1.5 for older CPUs. Which i don't think is really the issue since i'm trying to simply import it and i'm using tensorflow-gpu
When i run any cell that imports tensorflow for the first time i get
My server hardware is as follows
CPU: AMD Phenom(tm) II X4 965 Processor
GPU: GeForce GTX 760
Motherboard: ASRock 960GM/U3S3 FX
Memory: G Skill F3-1600C9D-8GAB (8 GB Memory)
How can i determine why the kernel is dying when i simply import tensorflow using import tensorflow as tf.
Here is the result of nvidia-docker smi
$ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Fri Jun 22 17:53:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67 Driver Version: 390.67 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 760 Off | 00000000:01:00.0 N/A | N/A |
| 0% 34C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
This matches exactly if i use nvidia-smi outside docker.
Here is the nvcc --version result:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
If i attempt to do nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3 bash to bring up a bash prompt and then i enter a python session via python when i do import tensorflow as tf i get Illegal instruction (core dumped) so it isn't working in a non-jupyter environment either. This error still occurs even if i do import numpy first and then import tensorflow as tf
It turns out i needed to downgrade to tensorflow 1.5.0. 1.5.1 is where AVX was added. AVX instructions are apparently used on module load to set up the library.
I'm having trouble getting Theano to use the GPU on my machine.
When I run:
/usr/local/lib/python2.7/dist-packages/theano/misc$ THEANO_FLAGS=floatX=float32,device=gpu python check_blas.py
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)
I've also checked that the NVIDIA driver is installed with: lspci -vnn | grep -i VGA -A 12
with result: Kernel driver in use: nvidia
However, when I run: nvidia-smi
result: NVIDIA: could not open the device file /dev/nvidiactl (No such file or directory).
NVIDIA-SMI has failed because it couldn't communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.
and /dev/nvidiaactl doesn't exist. What's going on?
UPDATE: /nvidia-smi works with result:
+------------------------------------------------------+
| NVIDIA-SMI 4.304... Driver Version: 304.116 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 | 0000:00:03.0 N/A | N/A |
| N/A 39C N/A N/A / N/A | 0% 10MB / 4095MB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and after compiling the NVIDIA_CUDA-6.0_Samples then running deviceQuery I get result:
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
CUDA GPUs in a linux system are not usable until certain "device files" have been properly established.
There is a note to this effect in the documentation.
In general there are several ways these device files can be established:
If an X-server is running.
If a GPU activity is initiated as root user (such as running nvidia-smi, or any CUDA app.)
Via startup scripts (refer to the documentation linked above for an example).
If none of these steps are taken, the GPUs will not be functional for non-root users. Note that the files do not persist through re-boots, and must be re-established on each boot cycle, through one of the 3 above methods. If you use method 2, and reboot, the GPUs will not be available until you use method 2 again.
I suggest reading the linux getting started guide entirely (linked above), if you are having trouble setting up a linux system for CUDA GPU usage.
If you are using CUDA 7.5, make sure follow official instruction:
CUDA 7.5 doesn't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 20
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 10
sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 30
sudo update-alternatives --set cc /usr/bin/gcc
sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
sudo update-alternatives --set c++ /usr/bin/g++
If theano GPU test code has error:
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu:
libcublas.so.7.5: cannot open shared object file: No such file or
directory WARNING (theano.sandbox.cuda): CUDA is installed, but
device gpu is not available (error: cuda unavilable)
Just using ldconfig command to link the shared object of cuda 7.5:
sudo ldconfig /usr/local/cuda-7.5/lib64
I've wasted a lot of hours trying to get AWS G2 to work on ubuntu but failed by getting exact error like you did. Currently I'm running Theano with gpu smoothly with this redhat AMI. To install Theano on Redhat follow the process of Installing Theano in CentOS in Theano documentation.
Had the same problem and reinstalled Cuda and at the end it says i have to update PATH to include /usr/local/cuda7.0/bin and LD_LIBRARY_PATH to include /usr/local/cuda7.0/lib64. The PATH (add LD_LIBRARY_PATH in same file) can be found in /etc/environment. Then theano found gpu. Basic error on my part...
I got
-> CUDA driver version is insufficient for CUDA runtime version
and my problem is related with the selected GPU mode.
In other words, the problem may be related to the selected GPU mode (Performance/Power Saving Mode), when you select (with nvidia-settings utility, in the "PRIME Profiles" configurations) the integrated Intel GPU and you execute the deviceQuery script... you get this error:
But this error is misleading,
by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.
This is not a version problem.
Regards
P.s: The selection is available when Prime-related-stuff is installed. Further details: https://askubuntu.com/questions/858030/nvidia-prime-in-nvidia-x-server-settings-in-16-04-1