Tensorflow GPU / CUDA installation on Ubuntu

Tensorflow GPU / CUDA installation on Ubuntu - python

I have set up a Ubuntu 18.04 and tried to make Tensorflow 2.2 GPU work (I have an Nvidia/CUDA graphic card) with Python.
Even after reading the documentation https://www.tensorflow.org/install/gpu#linux_setup, it failed (see below for details about how it failed).
Question: would you have a canonical "todo" list (starting point: freshly installed Ubuntu server) on how to install tensorflow-gpu and make it work, with a few steps?
Notes:
I have read many similar forum posts, and I think that having a canonical "todo" (from a fresh Ubuntu install to having tensorflow-gpu working) would be interesting, with a few steps/bash commands
the documentation I used involved
export LD_LIBRARY_PATH...
# Add NVIDIA package repository
sudo apt-key adv --fetch-keys http://developer.download...
...
# Install CUDA and tools. Include optional NCCL 2.x
sudo apt install cuda9.0 cuda...
Even after a lot of trial and errors (I don't copy/paste all the different errors here, would be too long), then at the end:
import tensorflow
always failed. Some reasons included `ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory. I have already read the relevant question here, or this very long (!) Github issue.
After some trial and error, import tensorflow works, but it doesn't use the GPU (see also Tensorflow not running on GPU).

Well, I was facing the same problem. The first thing to do is to look up, which Tensorflow version is required. In your case Tensorflow 2.2. requires CUDA 10.1. The correct cuDNN version is also important. In your case it would be cuDNN 7.4. An additional point is the installed python version. I would recommend Python 3.5-3.8. If one those mismatch, a fully compatibility is almost impossible.
So if you want a check list, here you go:
Install CUDA 10.1 by installing nvidia-cuda-toolkit.
Install the cuDNN version compatible with CUDA 10.1.
Export CUDA environment variables.
If Bazel is not installed, you will be asked on that.
Install TensorFlow 2.2 using pip. I would highly recommend the usage of a virtual environment.
You can find the compatibility check list of Tensorflow and CUDA here
You can find the CUDA Toolkit here
Finally get cuDNN in the correct version here
That's all.

I faced the problem as well when using the Google Cloud Platform for two projects involving deep learning. They provide servers with nothing but a freshly installed Ubuntu OS. Regarding my experience, I recommend doing the following steps:
Look up the cuda and cuDNN version supported by the current Tensorflow release on the Tensorflow page.
Install the targeted cuda version from the deb package retrieved from Nvidias cuda page and be careful that more recent cuda versions might not work! This will automatically install the corresponding Nvidia drivers.
Install the targeted cuDNN version from this page and again be careful that a more recent cuDNN version might not work.
Install tensorflow-gpu using pip.
This should work. Your problem is probably that you are using a more recent cuda version than targeted by the current Tensorflow release.

To install tensorflow-gpu, the guidelines which are provided on official website are very tedious for beginers, instead we can do these simple steps:
Note : NVIDIA driver must be installed before this(you can verify this using command nvidia-smi).
Install Anaconda https://www.anaconda.com/distribution/?
Create an virtual environment using command "conda create -n envname"
Then activate env using command "conda activate envname"
Finally install tensorflow using command "conda install tensorflow-gpu"
With the given code
import tensorflow as tf
if tf.test.gpu_device_name():
print('Default GPU Device{}'.format(tf.test.gpu_device_name()))
else:
print("not using gpu")
You can find the tutorial on link given below
https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing-CUDA-1170/?

I would suggest to first check the availability of GPU using nvidia-smi command.
I had faced the same issue, i was able to resolve it by using docker container, you can install docker using Install Docker Engine on Ubuntu or use the Digital Ocean guide (i used this one) How To Install and Use Docker on Ubuntu 18.04
After that it is simple just run the following command based on the requirements
NV_GPU='0' nvidia-docker run --runtime=nvidia -it -v /path/to/folder:/path/to/folder/for/docker/container nvcr.io/nvidia/tensorflow:17.11
NV_GPU='0' nvidia-docker run --runtime=nvidia -it -v /storage/research/:/storage/research/ nvcr.io/nvidia/tensorflow:20.12-tf2-py3
Here '0' represents the GPU number, if you want to use more than one GPU just use '0,1,2' and so on ....
Hope this solves the issue.

Related

Tensorflow install on Mac

I am having a difficult time getting tensorflow to install on a MacBook Pro.
Initially, I tried pip3.8 install tensorflow in my virtual environment. It installed but gives the following error when I try to use it:
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
I get that this is a warning, but I think it's a serious warning that I am going to have performance issues with any non-trivial work.
Based on this post (Tensorflow on MacOS: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA), I uninstalled tensorflow and followed the steps. After I installed Bazel, using homebrew, I got an error when I ran ./configure:
Please downgrade your Bazel installation to version 0.26.1.
According to this: (https://github.com/bazelbuild/bazel/releases) the oldest version is 3.2. I had 3.7 installed. So I uninstalled Bazel using homebrew. This felt like a dead end even though compiling from the source seems like the correct way to go. That version it's asking for is not even remotely close to a current version. I think the message is not telling me what I need to know.
Next I tried using pip to install the version recommended by Tensorflow.org. (https://www.tensorflow.org/install/pip.html)
pip3.8 install https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-2.4.0-cp38-cp38-macosx_10_14_x86_64.whl
This successfully installed as well, but gives me similar errors as my original install and fails the test recommended under the list of installations on the tensorflow.org page.
I think I'm making a novice mistake. Can anyone assist here?

It's not serious warning, you can either wait for macOS 12.0+ to have built-in Apple Metal Acceleration or run Intel optimized wheels for macOS, but it support only TensorFlow 2.0

Why does my CUDA work for Pytorch but not for Tensorflow suddenly?

The machine I'm using is with Titan XP and running Ubuntu 18.10. I'm not the owner so I'm not sure how it was configured previously. The cuda version is 9.*, most likely 9.0. There is no folder like /usr/local/cuda. Though it sounds strange (because no Cuda is compatible with 18.10), previously it worked pretty well both for Tensorflow and Pytorch. Now, when running tensorflow-gpu v1.12.0 in python 2.7, cudatoolkit 9.2 and cudnn 7.2.1 (this worked well previously without any change), it reports:
ImportError: libcublas.so.9.0: cannot open shared object file: No such file of directory
But, when I change my conda env to python 3.6 with pytorch 0.4.1 , cudatoolkit 9.0 and cudnn 7.6 (they are shown in pycharm). There is:
torch.cuda.is_available() # True
This shows that GPU is running in Pytorch code. Also I've checked GPU RAM by nvidia-smi, when Pytorch is running, RAM is occupied.
Although there is no Cuda folder like /usr/local/cuda/, when I run:
nvcc - V
There is:
Cuda compilation tools, release 9.1, V9.1.85
Can someone give me a hint about how these strange things happen? What should I do to make my tensorflow-gpu works? I get totally confused orz.

Anaconda environments install their own version of the CUDA toolkit when you install things like pytorch and tensorflow-gpu with conda. That looks like it's how your Python 3.6 environment was set up. Is your 2.7 version of Python a system install or part of another Python environment? It's possible that your Tensorflow was built against a CUDA toolkit that is no longer installed, for whatever reason, or in any case that you were trying to use Tensorflow while not having the path to the libraries that it was built against in your LD_LIBRARY_PATH (perhaps because of an unusual install location)
You can type which nvcc to see which part of your PATH is currently pointing to that executable. That will tell you where your CUDA toolkit is installed. I'm guessing that your PATH was still pointing to a conda environment when you last ran nvcc, or to some version of the CUDA toolkit in an unusual install location in any case.
First, I'd suggest abandoning any effort to use your system python with Tensorflow. My suggestion is to either modify or create a new conda environment and install tensorflow-gpu with conda, which will also install the CUDA toolkit for that environment. Note that your CUDA install will not be in /usr/local/cuda if you go down this path, it'll be located inside your conda environment instead.

Issue with tensorflow-mkl on CPU

I am new to tensorflow. In fact using it because the server code I am writing calls that.
I am using conda to setup the various packages. I did conda install -c anaconda tensorflow-mkl. (Note: I dont have a GPU - using a CPU)
I always get this error:
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
The specific line of code where this happens:
tensorflow.contrib.predictor.from_saved_model(path)
On further research, I figured out that this is because the tensorflow package I have does not support this above instruction and needs to have support for the same.
Some questions:
1. How do we ensure the tensorflow package I have does support the above function? Any source from which I can download?
If it is not important, is there a way to suppress this instruction or any errors from it?
Thanks in advance!

You can use conda or pip installations to download the tensorflow that supports cpu. You can use the following commands from your terminal
conda install tensorflow -c anaconda
or
pip install tensorflow==1.13.1
You can use this link if you havent installed pip yet
How to install pip3 on Windows?
Hope this helps..

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
This is just a warning.Not an error.
To suppress this warning please add the following lines before your actual code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
As per the tensorflow official documentation,starting with TensorFlow 1.6, binaries use AVX instructions which may not run on older CPUs.
You can refer the below url for more details on Intel Optimized tensorflow installation:
https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide
Hope this answer your query.Thank you.

Which versions of TensorFlow work with which versions of CUDA [duplicate]

I have noticed that some newer TensorFlow versions are incompatible with older CUDA and cuDNN versions. Does an overview of the compatible versions or even a list of officially tested combinations exist? I can't find it in the TensorFlow documentation.

TL;DR) See this table: https://www.tensorflow.org/install/source#gpu
Generally:
Check the CUDA version:
cat /usr/local/cuda/version.txt
and cuDNN version:
grep CUDNN_MAJOR -A 2 /usr/local/cuda/include/cudnn.h
and install a combination as given below in the images or here.
The following images and the link provide an overview of the officially supported/tested combinations of CUDA and TensorFlow on Linux, macOS and Windows:
Minor configurations:
Since the given specifications below in some cases might be too broad, here is one specific configuration that works:
tensorflow-gpu==1.12.0
cuda==9.0
cuDNN==7.1.4
The corresponding cudnn can be downloaded here.
Tested build configurations
Please refer to https://www.tensorflow.org/install/source#gpu for a up-to-date compatibility chart (for official TF wheels).
(figures updated May 20, 2020)
Linux GPU
Linux CPU
macOS GPU
macOS CPU
Windows GPU
Windows CPU
Updated as of Dec 5 2020: For the updated information please refer Link for Linux and Link for Windows.

The compatibility table given in the tensorflow site does not contain specific minor versions for cuda and cuDNN. However, if the specific versions are not met, there will be an error when you try to use tensorflow.
For tensorflow-gpu==1.12.0 and cuda==9.0, the compatible cuDNN version is 7.1.4, which can be downloaded from here after registration.
You can check your cuda version using
nvcc --version
cuDNN version using
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
tensorflow-gpu version using
pip freeze | grep tensorflow-gpu
UPDATE:
Since tensorflow 2.0, has been released, I will share the compatible cuda and cuDNN versions for it as well (for Ubuntu 18.04).
tensorflow-gpu = 2.0.0
cuda = 10.0
cuDNN = 7.6.0

if you are coding in jupyter notebook, and want to check which cuda version tf is using, run the follow command directly into jupyter cell:
!conda list cudatoolkit
!conda list cudnn
and to check if the gpu is visible to tf:
tf.test.is_gpu_available(
cuda_only=False, min_cuda_compute_capability=None
)

You can use this configuration for cuda 10.0 (10.1 does not work as of 3/18), this runs for me:
tensorflow>=1.12.0
tensorflow_gpu>=1.4
Install version tensorflow gpu:
pip install tensorflow-gpu==1.4.0

Thanks for the first answer.
Something about backward compatibility.
I can successfully install tensorflow-2.4.0 with cuda-11.1 and cudnn 8.0.5.
Source: https://www.tensorflow.org/install/source#gpu

I had installed CUDA 10.1 and CUDNN 7.6 by mistake. You can use following configurations (This worked for me - as of 9/10). :
Tensorflow-gpu == 1.14.0
CUDA 10.1
CUDNN 7.6
Ubuntu 18.04
But I had to create symlinks for it to work as tensorflow originally works with CUDA 10.
sudo ln -s /opt/cuda/targets/x86_64-linux/lib/libcublas.so /opt/cuda/targets/x86_64-linux/lib/libcublas.so.10.0
sudo cp /usr/lib/x86_64-linux-gnu/libcublas.so.10 /usr/local/cuda-10.1/lib64/
sudo ln -s /usr/local/cuda-10.1/lib64/libcublas.so.10 /usr/local/cuda-10.1/lib64/libcublas.so.10.0
sudo ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so.10 /usr/local/cuda/lib64/libcusolver.so.10.0
sudo ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so.10 /usr/local/cuda/lib64/libcurand.so.10.0
sudo ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10 /usr/local/cuda/lib64/libcufft.so.10.0
sudo ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so /usr/local/cuda/lib64/libcudart.so.10.0
sudo ln -s /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so.10 /usr/local/cuda/lib64/libcusparse.so.10.0
And add the following to my ~/.bashrc -
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cuda/targets/x86_64-linux/lib/

I had a similar problem after upgrading to TF 2.0. The CUDA version that TF was reporting did not match what Ubuntu 18.04 thought I had installed. It said I was using CUDA 7.5.0, but apt thought I had the right version installed.
What I eventually had to do was grep recursively in /usr/local for CUDNN_MAJOR, and I found that /usr/local/cuda-10.0/targets/x86_64-linux/include/cudnn.h did indeed specify the version as 7.5.0.
/usr/local/cuda-10.1 got it right, and /usr/local/cuda pointed to /usr/local/cuda-10.1, so it was (and remains) a mystery to me why TF was looking at /usr/local/cuda-10.0.
Anyway, I just moved /usr/local/cuda-10.0 to /usr/local/old-cuda-10.0 so TF couldn't find it any more and everything then worked like a charm.
It was all very frustrating, and I still feel like I just did a random hack. But it worked :) and perhaps this will help someone with a similar issue.

Tensorflow installation with anaconda installation and run issue

I have been trying to install Tensorflow for a really long time now, but I never seem to make it work. I have tried to install Tensorflow via pip, virtual environment and anaconda so far. The installation process seem to run smoothly with all three methods. But as soon as I try to validate the installation by running "import tensorflow" I get the following error. I know it looks kind of chaotic, I wasn't sure how to pose the question.
By now, all help is appreciated
Thanks

As mentioned, you have to install python 3.5.X first
Secondly, I strongly recommend you to use anaconda. You should install anaconda 4.4.0 for python 3.6 version and 64-bit installer.
Then, you should run the following command
conda create -n tensorflow python=3.5
By the way, would you watch the tensorflow installation tutorial ?

TensorFlow versions 1.2 and later are compatible with Python 3.6. The error message points to the actual problem:
ImportError: libcudnn.so.5: cannot open shared object file: No such file or directory
This implies two things:
You have installed the tensorflow-gpu package, which requires a CUDA-capable GPU and a working installation of CUDA and cuDNN.
TensorFlow cannot find cuDNN.
This answer explains how to fix your cuDNN installation.

Update:
I get a similar error but the message in the bottom is now:
"ImportError libnvidia-fatbinaryloader.so.384.47: cannot open shared object file: No such file or directory"
Also when I enter "which nvcc" it returns /usr/bin/nvcc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.