Tensorflow stuck after "Created TensorFlow device"

Tensorflow stuck after "Created TensorFlow device" - python

I'm working on Ubuntu 18.04 with tensorflow==2.2.
I've installed cuda 10.1. My GPU is detected but the program seems to be stucked after "Created TensorFlow device" or at least takes 2-3 minutes to run.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti On | 00000000:01:00.0 On | N/A |
| 32% 34C P0 1W / 38W | 320MiB / 2000MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1013 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 1130 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 1343 G /usr/lib/xorg/Xorg 178MiB |
| 0 N/A N/A 1489 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 2643 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 2734 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 2778 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 5438 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 6178 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 6691 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 7007 G /usr/lib/firefox/firefox 1MiB |
+-----------------------------------------------------------------------------+
Importing works fine:
import tensorflow as tf
tf.config.list_physical_devices("GPU")
Output:
Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
But this seems to be stucked:
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1425 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
...
I've installed tensorflow with pip3 install tensorflow==2.2 and also tried pip3 install tensorflow-gpu.
Any ideas?

Please check below if you have followed all the required steps properly to install Tensorflow-gpu on ubuntu.
STEP 1: Nvidia Drivers
STEP 2: Installation of NVIDIA CUDA
STEP 3: Installation of Deep Neural Network library (cuDNN)
STEP 4: Finally installing TENSORFLOW with GPU support
pip install --upgrade tensorflow-gpu
STEP 5: Checking the installation
python -c "from tensorflow.python.client import device_lib;
print(device_lib.list_local_devices())
You can check this reference to complete all these steps, then execute the same above code. Please let us know if issue still persists.

Related

Running gpt-neo on GPU

The following code runs the EleutherAI/gpt-neo-1.3B model. The model runs on CPUs, but I don't understand why it does not use my GPU. Did I missed something?
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
prompt = ("What is the capital of France?")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=50 )
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print (gen_text)
By the way, here is the output of the nvidia-smi command
Thu Feb 16 14:58:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:73:00.0 On | N/A |
| 30% 31C P8 34W / 350W | 814MiB / 24576MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:A6:00.0 Off | Off |
| 30% 31C P8 16W / 230W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3484 G /usr/lib/xorg/Xorg 378MiB |
| 0 N/A N/A 3660 G /usr/bin/gnome-shell 62MiB |
| 0 N/A N/A 4364 G ...662097787256072160,131072 225MiB |
| 0 N/A N/A 37532 G ...6/usr/lib/firefox/firefox 142MiB |
| 1 N/A N/A 3484 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+

Tensorflow does not get GPU

Python verion: 3.7.6
Tensorflow version: 2.3.0
CUDA: 10.2.89
CUDNN: 10.2
nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 451.48 Driver Version: 451.48 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 WDDM | 00000000:04:00.0 On | N/A |
| 0% 47C P8 8W / 200W | 463MiB / 8192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1268 C+G Insufficient Permissions N/A |
| 0 N/A N/A 1308 C+G Insufficient Permissions N/A |
| 0 N/A N/A 4936 C+G ...\Direct4\jabra-direct.exe N/A |
| 0 N/A N/A 7500 C+G Insufficient Permissions N/A |
| 0 N/A N/A 7516 C+G ...w5n1h2txyewy\SearchUI.exe N/A |
| 0 N/A N/A 9668 C+G Insufficient Permissions N/A |
| 0 N/A N/A 10676 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 10828 C+G ...st\Desktop\Mattermost.exe N/A |
| 0 N/A N/A 11536 C+G ...8bbwe\Microsoft.Notes.exe N/A |
| 0 N/A N/A 14604 C+G ...es.TextInput.InputApp.exe N/A |
+-----------------------------------------------------------------------------+
I tried:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Num GPUs Available: 0
Why tensorflow is not able to detect the GPU?

It seems you are trying to use the TensorFlow-GPU version and you have downloaded unsupported versions.
Note: GPU support is available for Ubuntu and Windows with CUDA enabled cards only.
If you have a Cuda enabled card follow the instructions provided below.
As stated in Tensorflow documentation. The software requirements are as follows.
Nvidia gpu drivers - 418.x or higher
Cuda - 10.1 (TensorFlow >= 2.1.0)
cuDNN - 7.6
Make sure you have these exact versions of the software mentioned above. See this
Also, check the system requirements here.
Make sure you have installed all the c++ redistributables - here
For downloading the software mentioned above see here.
For downloading TensorFlow follow the instructions provided here to correctly install the necessary packages.

Why Python stops using GPU and switches to CPU in runtime

I have been using this:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
in order to run on GPU. It has been working properly since today.
The problem now is that, in the middle of the runtime, my program stops using GPU and switches to CPU, so it becomes too slow.
Any idea on why is that happening?
Output at the beggining of the execution for nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 On | 00000000:01:00.0 On | N/A |
| 0% 42C P8 14W / 200W | 363MiB / 4039MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c On | 00000000:05:00.0 Off | 0 |
| 35% 74C P0 136W / 235W | 11011MiB / 11441MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1037 G /usr/lib/xorg/Xorg 20MiB |
| 0 1150 G /usr/bin/gnome-shell 12MiB |
| 0 7430 G /usr/lib/xorg/Xorg 166MiB |
| 0 7560 G /usr/bin/gnome-shell 158MiB |
| 1 13772 C python3 10998MiB |
+-----------------------------------------------------------------------------+
And then, when it begins to run too slowly:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 On | 00000000:01:00.0 On | N/A |
| 0% 42C P8 14W / 200W | 363MiB / 4039MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c On | 00000000:05:00.0 Off | 0 |
| 35% 69C P0 63W / 235W | 11011MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1037 G /usr/lib/xorg/Xorg 20MiB |
| 0 1150 G /usr/bin/gnome-shell 12MiB |
| 0 7430 G /usr/lib/xorg/Xorg 166MiB |
| 0 7560 G /usr/bin/gnome-shell 158MiB |
| 1 13772 C python3 10998MiB |
+-----------------------------------------------------------------------------+

How to check if keras tensorflow backend is running on the GPU or CPU?

I have notebook with GPU: nvidia-smi
Thu Oct 18 20:49:22 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87 Driver Version: 390.87 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 540M Off | 00000000:01:00.0 N/A | N/A |
| N/A 44C P8 N/A / N/A | 12MiB / 964MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
It is running a code Keras:
Using TensorFlow backend.
2018-10-18 20:26:08.963084: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-18 20:26:08.963593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GT 540M major: 2 minor: 1 memoryClockRate(GHz): 1.344
pciBusID: 0000:01:00.0
totalMemory: 964.50MiB freeMemory: 917.75MiB
2018-10-18 20:26:08.963633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GeForce GT 540M, pci bus id: 0000:01:00.0, compute capability: 2.1) with Cuda compute capability 2.1. The minimum required Cuda capability is 3.5.
2018-10-18 20:26:08.963652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-18 20:26:08.963663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-10-18 20:26:08.963673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N

Tensorflow is unable to run code on too many GPUs?

I have the following test code:
import tensorflow as tf
import numpy as np
def body(x):
a = tf.random_uniform(shape=[2, 2], dtype=tf.int32, maxval=100)
b = tf.constant(np.array([[1, 2], [3, 4]]), dtype=tf.int32)
c = a + b
return tf.nn.relu(x + c)
def condition(x):
return tf.reduce_sum(x) < 100
x = tf.Variable(tf.constant(0, shape=[2, 2]))
with tf.Session():
tf.initialize_all_variables().run()
result = tf.while_loop(condition, body, [x])
print(result.eval())
when I run it on my GPU cluster, I produce the following error:
2018-03-30 18:10:33.473913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-30 18:10:33.591203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-30 18:10:33.688390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-30 18:10:33.806845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-30 18:10:33.913200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-30 18:10:34.018533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1)
Killed
When I run the script using CUDA_VISIBLE_DEVICES='6' python script.py it aborts using the GPU. What could be causing this? Could this be a defective GPU?
nvidia-smi reports the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:3D:00.0 Off | N/A |
| 28% 21C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:3E:00.0 Off | N/A |
| 28% 21C P8 7W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:60:00.0 Off | N/A |
| 28% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:61:00.0 Off | N/A |
| 28% 25C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:B1:00.0 Off | N/A |
| 28% 19C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:B2:00.0 Off | N/A |
| 28% 20C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:DA:00.0 Off | N/A |
| 28% 22C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:DB:00.0 Off | N/A |
| 28% 21C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Tensorflow version is 1.7.0 and CUDA version is 9.0.176

The problem was that I didn't request enough RAM space when creating a job to use that many GPUs. To use 8 GPUs, you need a good amount of space, perhaps ~60 Gi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow stuck after "Created TensorFlow device" - python

Related

Running gpt-neo on GPU

Tensorflow does not get GPU

Why Python stops using GPU and switches to CPU in runtime

How to check if keras tensorflow backend is running on the GPU or CPU?

Tensorflow is unable to run code on too many GPUs?

Categories

Resources