Using Theano with GPU on Ubuntu 14.04 on AWS g2 - python

I'm having trouble getting Theano to use the GPU on my machine.
When I run:
/usr/local/lib/python2.7/dist-packages/theano/misc$ THEANO_FLAGS=floatX=float32,device=gpu python check_blas.py
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)
I've also checked that the NVIDIA driver is installed with: lspci -vnn | grep -i VGA -A 12
with result: Kernel driver in use: nvidia
However, when I run: nvidia-smi
result: NVIDIA: could not open the device file /dev/nvidiactl (No such file or directory).
NVIDIA-SMI has failed because it couldn't communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.
and /dev/nvidiaactl doesn't exist. What's going on?
UPDATE: /nvidia-smi works with result:
+------------------------------------------------------+
| NVIDIA-SMI 4.304... Driver Version: 304.116 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 | 0000:00:03.0 N/A | N/A |
| N/A 39C N/A N/A / N/A | 0% 10MB / 4095MB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
and after compiling the NVIDIA_CUDA-6.0_Samples then running deviceQuery I get result:
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

CUDA GPUs in a linux system are not usable until certain "device files" have been properly established.
There is a note to this effect in the documentation.
In general there are several ways these device files can be established:
If an X-server is running.
If a GPU activity is initiated as root user (such as running nvidia-smi, or any CUDA app.)
Via startup scripts (refer to the documentation linked above for an example).
If none of these steps are taken, the GPUs will not be functional for non-root users. Note that the files do not persist through re-boots, and must be re-established on each boot cycle, through one of the 3 above methods. If you use method 2, and reboot, the GPUs will not be available until you use method 2 again.
I suggest reading the linux getting started guide entirely (linked above), if you are having trouble setting up a linux system for CUDA GPU usage.

If you are using CUDA 7.5, make sure follow official instruction:
CUDA 7.5 doesn't support the default g++ version. Install an supported version and make it the default.
sudo apt-get install g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 20
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 10
sudo update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 30
sudo update-alternatives --set cc /usr/bin/gcc
sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
sudo update-alternatives --set c++ /usr/bin/g++
If theano GPU test code has error:
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu:
libcublas.so.7.5: cannot open shared object file: No such file or
directory WARNING (theano.sandbox.cuda): CUDA is installed, but
device gpu is not available (error: cuda unavilable)
Just using ldconfig command to link the shared object of cuda 7.5:
sudo ldconfig /usr/local/cuda-7.5/lib64

I've wasted a lot of hours trying to get AWS G2 to work on ubuntu but failed by getting exact error like you did. Currently I'm running Theano with gpu smoothly with this redhat AMI. To install Theano on Redhat follow the process of Installing Theano in CentOS in Theano documentation.

Had the same problem and reinstalled Cuda and at the end it says i have to update PATH to include /usr/local/cuda7.0/bin and LD_LIBRARY_PATH to include /usr/local/cuda7.0/lib64. The PATH (add LD_LIBRARY_PATH in same file) can be found in /etc/environment. Then theano found gpu. Basic error on my part...

I got
-> CUDA driver version is insufficient for CUDA runtime version
and my problem is related with the selected GPU mode.
In other words, the problem may be related to the selected GPU mode (Performance/Power Saving Mode), when you select (with nvidia-settings utility, in the "PRIME Profiles" configurations) the integrated Intel GPU and you execute the deviceQuery script... you get this error:
But this error is misleading,
by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.
This is not a version problem.
Regards
P.s: The selection is available when Prime-related-stuff is installed. Further details: https://askubuntu.com/questions/858030/nvidia-prime-in-nvidia-x-server-settings-in-16-04-1

Related

How do I make torch.cuda.is_available() be True? [duplicate]

I'm trying to run Pytorch on a laptop that I have. It's an older model but it does have an Nvidia graphics card. I realize it is probably not going to be sufficient for real machine learning but I am trying to do it so I can learn the process of getting CUDA installed.
I have followed the steps on the installation guide for Ubuntu 18.04 (my specific distribution is Xubuntu).
My graphics card is a GeForce 845M, verified by lspci | grep nvidia:
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce 845M] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
I also have gcc 7.5 installed, verified by gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
And I have the correct headers installed, verified by trying to install them with sudo apt-get install linux-headers-$(uname -r):
Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-headers-4.15.0-106-generic is already the newest version (4.15.0-106.107).
I then followed the installation instructions using a local .deb for version 10.1.
Now, when I run nvidia-smi, I get:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 845M On | 00000000:01:00.0 Off | N/A |
| N/A 40C P0 N/A / N/A | 88MiB / 2004MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 982 G /usr/lib/xorg/Xorg 87MiB |
+-----------------------------------------------------------------------------+
and I run nvcc -V I get:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
I then performed the post-installation instructions from section 6.1, and so as a result, echo $PATH looks like this:
/home/isaek/anaconda3/envs/stylegan2_pytorch/bin:/home/isaek/anaconda3/bin:/home/isaek/anaconda3/condabin:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
echo $LD_LIBRARY_PATH looks like this:
/usr/local/cuda-10.1/lib64
and my /etc/udev/rules.d/40-vm-hotadd.rules file looks like this:
# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}=="Microsoft Corporation", ATTR{[dmi/id]product_name}=="Virtual Machine", GOTO="vm_hotadd_apply"
ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply"
GOTO="vm_hotadd_end"
LABEL="vm_hotadd_apply"
# Memory hotadd request
# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"
LABEL="vm_hotadd_end"
After all of this, I even compiled and ran the samples. ./deviceQuery returns:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 845M"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2004 MBytes (2101870592 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 863 MHz (0.86 GHz)
Memory Clock rate: 1001 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
and ./bandwidthTest returns:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce 845M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.7
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 14.5
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
But after all of this, this Python snippet (in a conda environment with all dependencies installed):
import torch
torch.cuda.is_available()
returns False
Does anybody have any idea about how to resolve this? I've tried to add /usr/local/cuda-10.1/bin to etc/environment like this:
PATH=$PATH:/usr/local/cuda-10.1/bin
And restarting the terminal, but that didn't fix it. I really don't know what else to try.
EDIT - Results of collect_env for #kHarshit
Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce 845M
Nvidia driver version: 418.87.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.18.5
[pip] pytorch-ranger==0.1.1
[pip] stylegan2-pytorch==0.12.0
[pip] torch==1.5.0
[pip] torch-optimizer==0.0.1a12
[pip] torchvision==0.6.0
[pip] vector-quantize-pytorch==0.0.2
[conda] numpy 1.18.5 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] stylegan2-pytorch 0.12.0 pypi_0 pypi
[conda] torch 1.5.0 pypi_0 pypi
[conda] torch-optimizer 0.0.1a12 pypi_0 pypi
[conda] torchvision 0.6.0 pypi_0 pypi
[conda] vector-quantize-pytorch 0.0.2 pypi_0 pypi
PyTorch doesn't use the system's CUDA library. When you install PyTorch using the precompiled binaries using either pip or conda it is shipped with a copy of the specified version of the CUDA library which is installed locally. In fact, you don't even need to install CUDA on your system to use PyTorch with CUDA support.
There are two scenarios which could have caused your issue.
You installed the CPU only version of PyTorch. In this case PyTorch wasn't compiled with CUDA support so it didn't support CUDA.
You installed the CUDA 10.2 version of PyTorch. In this case the problem is that your graphics card currently uses the 418.87 drivers, which only support up to CUDA 10.1. The two potential fixes in this case would be to either install updated drivers (version >= 440.33 according to Table 2) or to install a version of PyTorch compiled against CUDA 10.1.
To determine the appropriate command to use when installing PyTorch you can use the handy widget in the "Install PyTorch" section at pytorch.org. Just select the appropriate operating system, package manager, and CUDA version then run the recommended command.
In your case one solution was to use
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
which explicitly specifies to conda that you want to install the version of PyTorch compiled against CUDA 10.1.
For more information about PyTorch CUDA compatibility with respect drivers and hardware see this answer.
Edit After you added the output of collect_env we can see that the problem was that you had the CUDA 10.2 version of PyTorch installed. Based on that an alternative solution would have been to update the graphics driver as elaborated in item 2 and the linked answer.
TL; DR
Install NVIDIA Toolkit provided by Canonical or NVIDIA third-party PPA.
Reboot your workstation.
Create a clean Python virtual environment (or reinstall all CUDA dependent packages).
Description
First install NVIDIA CUDA Toolkit provided by Canonical:
sudo apt install -y nvidia-cuda-toolkit
or follow NVIDIA developers instructions:
# ENVARS ADDED **ONLY FOR READABILITY**
NVIDIA_CUDA_PPA=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/
NVIDIA_CUDA_PREFERENCES=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
NVIDIA_CUDA_PUBKEY=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
# Add NVIDIA Developers 3rd-Party PPA
sudo wget ${NVIDIA_CUDA_PREFERENCES} -O /etc/apt/preferences.d/nvidia-cuda
sudo apt-key adv --fetch-keys ${NVIDIA_CUDA_PUBKEY}
echo "deb ${NVIDIA_CUDA_PPA} /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
# Install development tools
sudo apt update
sudo apt install -y cuda
then reboot the OS load the kernel with the NVIDIA drivers
Create an environment using your favorite manager (conda, venv, etc)
conda create -n stack-overflow pytorch torchvision
conda activate stack-overflow
or reinstall pytorch and torchvision into the existing one:
conda activate stack-overflow
conda install --force-reinstall pytorch torchvision
otherwise NVIDIA CUDA C/C++ bindings may not be correctly detected.
Finally ensure CUDA is correctly detected:
(stack-overflow)$ python3 -c 'import torch; print(torch.cuda.is_available())'
True
Versions
NVIDIA CUDA Toolkit v11.6
Ubuntu LTS 20.04.x
Ubuntu LTS 22.04 (prior official release)
In my case, just restarting my machine made the GPU active again. The initial message I got was that the GPU is currently in use by another application. But when I looked at nvidia-smi, there was nothing that I saw. So, no changes to dependencies, and it just started working again.
Another possible scenario is that environment variable CUDA_VISIBLE_DEVICES is not set correctly before installing PyTorch.
In my case it worked to do as follows:
remove the CUDA drivers
sudo apt-get remove --purge nvidia*
Then get the exact installation script of the drivers based on your distro and system from the link: https://developer.nvidia.com/cuda-downloads?target_os=Linux
In my case it was dabian on x64 so I did:
wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
And now nvidia-smi works as intended!
I hope that helps
If your CUDA version does not match what PyTorch expects, you will see this issue.
On Arch / Manjaro:
Get Pytorch from here: https://pytorch.org/get-started/locally/
Note what CUDA version you are getting PyTorch for
Get the same CUDA version from here: https://archive.archlinux.org/packages/c/cuda/
Install CUDA using (e.g.) sudo pacman -U --noconfirm cuda-11.6.2-1-x86_64.pkg.tar.zst
Do not update to a newer version of CUDA than PyTorch expects. If PyTorch wants 11.6 and you have updated to 11.7, you will get the error message.
Make sure that os.environ['CUDA_VISIBLE_DEVICES'] = '0' is set after if __name__ == "__main__":. So your code should look like this:
import torch
import os
if __name__ == "__main__":
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print(torch.cuda.is_available()) // true
...

Deep Learning on Nvidia 1070 Ti Ubuntu 18.04

I'm pulling my hair out at this point, I've spent a ton of time trying different things to get my card working to use Tensorflow.
My latest attempt (which has similar problems to before) was that I tried installing the tensorflow docker
https://hub.docker.com/r/tensorflow/tensorflow/
I installed the nvidia-docker and ran the SMI and it seemed to report back that my GPU exists.
Then I ran this command
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
After downloading and starting up, I try running the notebooks, (first the hello tensorflow notebook).
As soon as I try to "import" tensorflow (just using the default unmodified notebook) I get a KernelRestart.
KernelRestarter: restarting kernel (1/5), keep random ports
I'm not really sure what the next best step is, I don't know how to troubleshoot docker containers, and then inside that the jupyter notebook.
I've had similar issues previously while attempting to run locally without a docker container.
Any suggestions on what a good next step is? I spent more than I cared to on this card and am out of ideas on how to get it to work.
(I believe I could import locally on my machine using tensorflow-gpu installed, however when I got to a conv2d section I would get could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED if I'm recalling, but it's been a few hectic days)
Edit: Yes to cuda and cudnn and i apt get installed nvidia-390, it seemed like a good test was nvidia-smi which works. I just finished compiling tf from scratch and still failling (in this case, importing tf doesnt fail, but same not initalized error, and perhaps not the right nvidia version it mentioned, and called out nvidia-390.77 i think)
Im considering a fresh 18.04 install and an earlier nvidia-3xx version install, attempting to 'downgrade' resulted in broken apt, and multiple days of trying to fix
EDIT2 :
I also came to the realization I installed CUDA 9.0, but the cudnn7.1 with 9.1 CUDA (you can download from nvidia that combo whatever that means).
I'm attempting to revert, but I am having plenty of trouble backing out, I'm extremely close to just wiping and re-installing ubuntu and going from there. I have all the commands and think it might be easier, but I'm not sure if that will solve it. (eg cudnn-9.0-linux-x64-v7.1)
EDIT3 :
Came back to respond to this. I wrote up a gist of what I had to do to get my GPU working in ubuntu 16.04 for my main machine, however I didn't test it in docker, here is the gist of it.
https://gist.github.com/onaclov2000/c22fe1456ffa7da6cebd67600003dffb
Copy pasting here:
# 1070 Ti
Fresh Install 16.04
(download updates, and include 3rd party)
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install nvidia-384
# Contents
sudo bash -c 'cat >> /etc/modprobe.d/blacklist-nouveau.conf << 'EOF'
blacklist nouveau
options nouveau modeset=0
EOF'
sudo update-initramfs -u
sudo reboot
# Takes about 30-40 minutes 1.5GB approx
wget https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run
No to install nvidia accelerated Graphics Driver for Linux
yes to Cuda 9.0 toolkit
default
yes to symbolic link
yes to samples
default location is fine
#Alternately (need to test)
#sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --samples
cat >> ~/.bashrc << 'EOF'
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery # Assuming make was successful
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest
make
./bandwidthTest # Assuming make was successful
# Look for Result = PASS
sudo apt-get install nvidia-cuda-toolkit
# Couldn't find on 16.04 maybe this is a 18.04 upgrade?
#sudo apt-get install cuda-toolkit-9.0 cuda-command-line-tools-9-0
# At this point the driver and CUDA are installed, now it's time to install the CUDNN driver/piece.
#This is the link that I have, be sure to use v7 not v7.1 as I haven't had luck in the past with that (though it might work).
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7
# 333 MB so will take a bit
cd ~/Downloads
tar -xvf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/
sudo apt-get install git tmux
cd ~/Downloads
# At this point I'm going to install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-4.3.1-Linux-x86_64.sh -O anaconda-install.sh
bash anaconda-install.sh # Follow Prompts adding path to bash
source ~/.bashrc
conda create --name ml
source activate ml
pip install tensorflow-gpu==1.5
# test the install
cd ~
mkdir projects
cd projects
git clone https://github.com/tensorflow/models
# Addional notes
Run a sample from the cuda samples folder
/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery
Output:
Plenty but ends with the following
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS
This tells you which cudnn is installed
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
Outputs:
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
# This tells you what
nvcc --version
Outputs:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Finally, I updated to 18.04 but haven't chased all this down again, so I will update with a 18.04 version at the gist above as I move forward.

Jupyter Notebook Kernel dies when importing Tensorflow

I am trying to use Tensorflow-gpu on a jupyter notebook inside a docker containing running on my Ubuntu 18.04 Bionic Beaver server.
I have done the following steps:
1) Installed Nvidia Drivers 390.67 sudo apt-get install nvidia-driver-390
2) Installed CUDA Drivers 9.0 cuda_9.0.176_384.81_linux.run
3) Installed CuDNN 7.0.5 cudnn-9.0-linux-x64-v7.tgz
4) Installed Docker sudo apt install docker-ce
5) Installed nvidia-docker2 sudo apt install nvidia-docker2
I attempt to do the following
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3
The reason i am using Tensorflow 1.5.1 is because i was getting this same Kernel dead error on 1.8.0-gpu-py and i read that you need to use Tensorflow 1.5 for older CPUs. Which i don't think is really the issue since i'm trying to simply import it and i'm using tensorflow-gpu
When i run any cell that imports tensorflow for the first time i get
My server hardware is as follows
CPU: AMD Phenom(tm) II X4 965 Processor
GPU: GeForce GTX 760
Motherboard: ASRock 960GM/U3S3 FX
Memory: G Skill F3-1600C9D-8GAB (8 GB Memory)
How can i determine why the kernel is dying when i simply import tensorflow using import tensorflow as tf.
Here is the result of nvidia-docker smi
$ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Fri Jun 22 17:53:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67 Driver Version: 390.67 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 760 Off | 00000000:01:00.0 N/A | N/A |
| 0% 34C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
This matches exactly if i use nvidia-smi outside docker.
Here is the nvcc --version result:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
If i attempt to do nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:1.5.1-gpu-py3 bash to bring up a bash prompt and then i enter a python session via python when i do import tensorflow as tf i get Illegal instruction (core dumped) so it isn't working in a non-jupyter environment either. This error still occurs even if i do import numpy first and then import tensorflow as tf
It turns out i needed to downgrade to tensorflow 1.5.0. 1.5.1 is where AVX was added. AVX instructions are apparently used on module load to set up the library.

problems with installation tensorflow from source on ubuntu 16 with gpu

I have ubuntu 16.04
I have installed CUDA 7.5 from Ubuntu repo and cuDNN 5.1.3 for CUDA 7.5
and ran CUDA examples that works, same for pycuda
and I want to install tensorflow from source with gpu support,
tensorflow configuration is stuck on nvvm, and I can't find it in system also find isn't any helpful
$ find / -name nvvm*
/usr/include/nvvm.h
/usr/share/doc/nvidia-cuda-doc/html/nvvm-ir-spec
where can I find nvvm?
Your find command is wrong. You need to quote the -name argument to prevent shell globbing:
$ find / -name 'nvvm*'

Unknown CUDA error when importing theano

In python, after importing theano, I get the following:
In [1]: import theano
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available
(error: Unable to get the number of gpus available: unknown error)
I'm running this on ubuntu 14.04 and I have an old gpu: GeForce GTX280
And my nvidia driver:
$ nvidia-smi
Wed Jul 13 21:25:58 2016
+------------------------------------------------------+
| NVIDIA-SMI 340.96 Driver Version: 340.96 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 280 Off | 0000:02:00.0 N/A | N/A |
| 40% 65C P0 N/A / N/A | 638MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I'm not sure why it's saying it's 'Not Supported' but it seems as though that might not be an issue as said here
Also, the CUDA version:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
Any help I can get would be awesome. I've been at this all day...
I feel your pain. I spent a few days ploughing through all the CUDA related errors.
Firstly, update to a more recent driver. eg, 361. (CLEAN INSTALL IT!) Then completely wipe cuda and cudnn from your harddrive with
sudo rm -rf /usr/local/cuda
or wherever else you installed it, then install cuda 7.5 (seriously, this specific version) and cuDNN v4 (again, this specific version)
You can run the following commands to settle CUDA.
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
bash cuda_7.5.18_linux.run --override
Follow the instructions, say NO when they ask you to install the 350 driver. And you should be set.
For cudnn, there's no direct link to wget, so you have to get the installer from https://developer.nvidia.com/cudnn and run the following commands:
tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include
sudo cp -r cuda/lib64/. /usr/local/cuda-7.5/lib64
echo -e 'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/lib64"\nexport CUDA_HOME=/usr/local/cuda-7.5' >> ~/.bash_profile
source ~/.bash_profile
Now to handle Theano on GPU:
nano ~/.theanorc
add these lines:
[global]
floatX = float32
device = gpu0
If you get an nvcc error, make it so instead:
[global]
floatX = float32
device = gpu0
[nvcc]
flags=-D_FORCE_INLINES
I had the same issue and was able to solve my issue by doing two things:
Install gcc-5 and linking/usr/bin/gcc to /usr/bin/gcc-5 as well as /usr/bin/g++ to/usr/bin/g++-5 (PS: I am using cuda 8)
Adding this flag flags=-D_FORCE_INLINES to the file ~/.theanorc under nvcc since apparently a bug in glibc 2.23 causes this issue

Categories