I want to run over multiple GPUs in parallel torch.inverse(), and am not able to.
I saw this post Matmul on multiple GPUs , which goes over the process for matmul. It shows that if you have multiple tensors allocated to each GPU matmul will be run in parallel. I was able to replicate this behavior for matmul but when I try to do the same thing for torch.inverse() it seems to run sequentially when I check "watch nvidia-smi". Also when I replace the torch.inverse() function with the torch fft function I get parallel GPU usage. Any ideas?
import torch
ngpu = torch.cuda.device_count()
# This is the allocation to each GPU.
lis = []
for i in range(ngpu):
lis.append(torch.rand(5000,5000,device = 'cuda:'+ str(i)))
# per the matmul on multiple GPUs post this should already be in parallel
# but doesnt seem to be based on watch nvidia-smi
C_ = []
for i in range(ngpu):
C_.append(torch.inverse(lis[i]))
Edit: This can be compared to the FFT code(below) and the Matmul code in the link above.
import torch
ngpu = torch.cuda.device_count()
# This is the allocation to each GPU.
lis = []
for i in range(ngpu):
lis.append(torch.rand(5000,5000,2,device = 'cuda:'+ str(i)))
# per the matmul on multiple GPUs post this should already be in parallel
# but doesnt seem to be based on watch nvidia-smi
C_ = []
for i in range(ngpu):
C_.append(torch.fft(lis[i],2))
Related
TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?
Normal version:
I am using PyTorch in combination with Optuna (a hyperparameter optimization framework; basically starts different trials for one model with different parameters, see: https://optuna.readthedocs.io/en/stable/) for my model training on a setup with 4 GPUs. Here, I've been looking for a way to distribute the workload more efficiently on the GPUs, hence I explored the multiprocessing library.
The core of the multiprocessing code looks like following:
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue, params, signals):
self.gpu_queue = gpu_queue
# create dataset
# ...
def __call__(self, trial: optuna.Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
val = trainer(trial, gpu=gpu_i, ...)
return val
And in main, optuna study and optuna optimize are initiated with:
study = optuna.create_study(direction="minimize", sampler = optuna.samplers.TPESampler(seed=17)) # storage = "sqlite:///trials.db")
study.optimize(Objective(GpuQueue(), ..., n_jobs=4))
Same implementation can be found in this StackOverflow post (used as inspiration): Is there a way to pass arguments to multiple jobs in optuna?
What this code does is that every trial gets its own GPU, hence the GPU usage and distribution is better than other methods. However it happens often that a GPU is stuck and just 'shuts itself off' and does not finish the trial, hence the code actually never finishes running and that GPU is never freed.
Say, for example, that I am running 100 trials, then trial 1,2,3,4 get assigned GPUs 0,1,2,3 (not always in that order), and whenever a GPU is freed, say GPU 2, it takes on trial 5, etc. The issue is, it can happen that the trial that the GPU is assigned to 'quits' in the process and never finishes the trial, hence not taking on another trial and resulting in the run with many trials not completing.
I suspected a deadlock, but apparently Queue() is thread-safe (see: Is Python multiprocessing.Queue thread safe?).
Any clue on what can cause the hang and what I can look for?
I'm trying find a way to efficiently and comfortably measure time of execution of a single layer in TF Keras model. I've already searched quite a lot, but I haven't found a solution that fully satisfies me.
One thing I've came across, which was similar, was an approach used in Pytorch which basically used time library. And for GPU case, it also used: torch.cuda.synchronize() function.
Can I do something similar in TF?
I've tried in a following way:
def measure_layer_latency(layer, batch_size=4, runtimes=5):
input_shape = (batch_size,) + tuple(layer.input_shape[1:]) #input data shape including batch size
total_time = .0
for i in range(runtimes):
x = tf.random.normal(input_shape)
start = time.time()
layer(x)
finish = time.time()
total_time += (finish-start)
return total_time/float(runtimes)
but I'm afraid it won't be a right choice for GPU. Is it at least fine for calculations on CPU?
I've also found information about Eager Execution mode in Tensorflow and such method:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
# tf.matmul can return before completing the matrix multiplication
# (e.g., can return after enqueing the operation on a CUDA stream).
# The x.numpy() call below will ensure that all enqueued operations
# have completed (and will also copy the result to host memory,
# so we're including a little more than just the matmul operation
# time).
_ = x.numpy()
end = time.time()
return end - start
Would anybody recommend such approach?
The last thing, most often recommended is TF Profiler, but for now I find it a bit uncomfortable for my task (maybe because I don't know it well).
Finally, the thing which I would like to achieve is having a model in which I can iterate over layers, change number of filters in Conv operations and calculate latency of a single layer depending on the number of input and output channels. I will be grateful for any ideas!
In the following code, it is absolutely imperative for me to execute the complete function in GPU without a single jump back to CPU. This is because I have 4 CPU cores but I have 1200 cuda cores. Theoretically, it is possible because the tensorflow feed_forwards, if statements and and the variable assigns can be done on GPU (I have NVIDIA GTX 1060).
The problem I'm facing is tensorflow2.0 does this automatic assignment to GPU and CPU in the backend and doesn't mention which of it's ops are GPU compatible. When I run the following function with device as GPU, I get
parallel_func could not be transformed and will be staged without change.
and it runs sequentially on GPU.
My question is where to use tf.device? What part of code will be converted by autograph to GPU code and what will remain on CPU? How can I convert that too to GPU?
#tf.function
def parallel_func(self):
for i in tf.range(114): #want this parallel on GPU
for count in range(320): #want this sequential on GPU
retrivedValue = self.data[i][count]
if self.var[i]==1:
self.value[i] = retrievedValue # assigns, if else
elif self.var[i]==-1: # some links to class data through
self.value[i] = -retrivedValue # self.data, self.a and self.b
state = tf.reshape(tf.Variable([self.a[i], self.b[i][count]]), [-1,2])
if self.workerSwitch == False:
action = tf.math.argmax(self.feed_forward(i, count, state))
else:
action = tf.math.argmax(self.worker_feed_forward(i, count, state))
if (action==1 or action==-1):
self.actionCount +=1
Side note: the message parallel_func could not be transformed and will be staged without change is output by autograph, and since it contains data-dependent control flow, it's likely that the function can't run at all. It would be worth filing an issue with steps to reproduce and more detailed log messages.
In tensorflow's guide about the performance of eager execution, there is a piece of code as follows:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
_ = x.numpy() # Make sure to execute op and not just enqueue it
end = time.time()
return end - start
...
with tf.device("/cpu:0"):
print("CPU: {} secs".format(measure(tf.random_normal(shape), steps)))
with tf.device("/gpu:0"):
print("GPU: {} secs".format(measure(tf.random_normal(shape), steps)))
what is the meaning of the code before the second comment: "_ = x.numpy()"?
If I comment out this line, will tf.matmul(x,x) not be executed on cpu/gpu?
Technically, the call to tf.matmul can return before the matrix multiplication is complete.
In practice:
If executing on the CPU (and not using execution_mode=tf.contrib.eager.ASYNC), then tf.matmul returns only after the matrix multiplication has completed.
If executing on the GPU, then tf.matmul returns after enqueueing the matrix multiplication on the CUDA stream (see NVIDIA developer documentation for more information on streams)
The .numpy() call causes the result to be copied back to host memory (since numpy arrays must be backed by host and not GPU memory). In order to correctly do that, it has to wait for all compute operations enqueued on the CUDA stream to complete. Thus the .numpy() call is a means of ensuring "the CUDA stream has been processed". The intent there is to ensure that end - start accounts for the time it takes the operation to complete, not just be enqueued on the CUDA stream.
That said, that code snippet seems like it is over-estimating time executed on GPU since it also includes the time to copy to host after each step. That _ = x.numpy() line should be moved outside the for loop to get a more accurate measure (i.e., time to execute matrix multiplication steps times, then wait for the CUDA stream to finish, and copy to host memory once). Ideally, we would be able to exclude the time it takes to copy back to host memory.
Hope that makes sense.
I've trained 3 models and am now running code that loads each of the 3 checkpoints in sequence and runs predictions using them. I'm using the GPU.
When the first model is loaded it pre-allocates the entire GPU memory (which I want for working through the first batch of data). But it doesn't unload memory when it's finished. When the second model is loaded, using both tf.reset_default_graph() and with tf.Graph().as_default() the GPU memory still is fully consumed from the first model, and the second model is then starved of memory.
Is there a way to resolve this, other than using Python subprocesses or multiprocessing to work around the problem (the only solution I've found on via google searches)?
A git issue from June 2016 (https://github.com/tensorflow/tensorflow/issues/1727) indicates that there is the following problem:
currently the Allocator in the GPUDevice belongs to the ProcessState,
which is essentially a global singleton. The first session using GPU
initializes it, and frees itself when the process shuts down.
Thus the only workaround would be to use processes and shut them down after the computation.
Example Code:
import tensorflow as tf
import multiprocessing
import numpy as np
def run_tensorflow():
n_input = 10000
n_classes = 1000
# Create model
def multilayer_perceptron(x, weight):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, weight)
return layer_1
# Store layers weight & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
pred = multilayer_perceptron(x, weights)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(100):
batch_x = np.random.rand(10, 10000)
batch_y = np.random.rand(10, 1000)
sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
print "finished doing stuff with tensorflow!"
if __name__ == "__main__":
# option 1: execute code with extra process
p = multiprocessing.Process(target=run_tensorflow)
p.start()
p.join()
# wait until user presses enter key
raw_input()
# option 2: just execute the function
run_tensorflow()
# wait until user presses enter key
raw_input()
So if you would call the function run_tensorflow() within a process you created and shut the process down (option 1), the memory is freed. If you just run run_tensorflow() (option 2) the memory is not freed after the function call.
You can use numba library to release all the gpu memory
pip install numba
from numba import cuda
device = cuda.get_current_device()
device.reset()
This will release all the memory
I use numba to release GPU. With TensorFlow, I cannot find an effective method.
import tensorflow as tf
from numba import cuda
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.1),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
i=1
while(i<1000):
i=i+1
print(sess.run(c))
sess.close() # if don't use numba,the gpu can't be released
cuda.select_device(1)
cuda.close()
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
while(1):
print(sess.run(c))
Now there seem to be two ways to resolve the iterative training model or if you use future multipleprocess pool to serve the model training, where the process in the pool will not be killed if the future finished. You can apply two methods in the training process to release GPU memory meanwhile you wish to preserve the main process.
call a subprocess to run the model training. when one phase training completed, the subprocess will exit and free memory. It's easy to get the return value.
call the multiprocessing.Process(p) to run the model training(p.start), and p.join will indicate the process exit and free memory.
Here is a helper function using multiprocess.Process which can open a new process to run your python written function and reture value instead of using Subprocess,
# open a new process to run function
def process_run(func, *args):
def wrapper_func(queue, *args):
try:
logger.info('run with process id: {}'.format(os.getpid()))
result = func(*args)
error = None
except Exception:
result = None
ex_type, ex_value, tb = sys.exc_info()
error = ex_type, ex_value,''.join(traceback.format_tb(tb))
queue.put((result, error))
def process(*args):
queue = Queue()
p = Process(target = wrapper_func, args = [queue] + list(args))
p.start()
result, error = queue.get()
p.join()
return result, error
result, error = process(*args)
return result, error
I am figuring out which option is better in the Jupyter Notebook. Jupyter Notebook occupies the GPU memory permanently even a deep learning application is completed. It usually incurs the GPU Fan ERROR that is a big headache. In this condition, I have to reset nvidia_uvm and reboot the linux system regularly. I conclude the following two options can remove the headache of GPU Fan Error but want to know which is better.
Environment:
CUDA 11.0
cuDNN 8.0.1
TensorFlow 2.2
Keras 2.4.3
Jupyter Notebook 6.0.3
Miniconda 4.8.3
Ubuntu 18.04 LTS
First Option
Put the following code at the end of the cell. The kernel immediately ended upon the application runtime is completed. But it is not much elegant. Juputer will pop up a message for the died ended kernel.
import os
pid = os.getpid()
!kill -9 $pid
Section Option
The following code can also end the kernel with Jupyter Notebook. I do not know whether numba is secure. Nvidia prefers the "0" GPU that is the most used GPU by personal developer (not server guys). However, both Neil G and mradul dubey have had the response: This leaves the GPU in a bad state.
from numba import cuda
cuda.select_device(0)
cuda.close()
It seems that the second option is more elegant. Can some one confirm which is the best choice?
Notes:
It is not such the problem to automatically release the GPU memory in the environment of Anaconda by direct executing "$ python abc.py". However, I sometimes need to use Jyputer Notebook to handle .ipynb application.
I was able to solve an OOM error just now with the garbage collector.
import gc
gc.collect()
model.evaluate(x1, y1)
gc.collect()
model.evaluate(x2, y2)
gc.collect()
etc.
Based on what Yaroslav Bulatov said in their answer (that tf deallocates GPU memory when the object is destroyed), I surmised that it could just be that the garbage collector hadn't run yet. Forcing it to collect freed me up, so that might be a good way to go.
GPU memory allocated by tensors is released (back into TensorFlow memory pool) as soon as the tensor is not needed anymore (before the .run call terminates). GPU memory allocated for variables is released when variable containers are destroyed. In case of DirectSession (ie, sess=tf.Session("")) it is when session is closed or explicitly reset (added in 62c159ff)
I have trained my models in a for loop for different parameters when I got this error after 120 models trained. Afterwards I could not even train a simple model if I did not kill the kernel.
I was able to solve my issue by adding the following line before building the model:
tf.keras.backend.clear_session()
(see https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session)
To free my resources, I use:
import os, signal
os.kill(os.getpid(), signal.SIGKILL)