I've trained 3 models and am now running code that loads each of the 3 checkpoints in sequence and runs predictions using them. I'm using the GPU.
When the first model is loaded it pre-allocates the entire GPU memory (which I want for working through the first batch of data). But it doesn't unload memory when it's finished. When the second model is loaded, using both tf.reset_default_graph() and with tf.Graph().as_default() the GPU memory still is fully consumed from the first model, and the second model is then starved of memory.
Is there a way to resolve this, other than using Python subprocesses or multiprocessing to work around the problem (the only solution I've found on via google searches)?
A git issue from June 2016 (https://github.com/tensorflow/tensorflow/issues/1727) indicates that there is the following problem:
currently the Allocator in the GPUDevice belongs to the ProcessState,
which is essentially a global singleton. The first session using GPU
initializes it, and frees itself when the process shuts down.
Thus the only workaround would be to use processes and shut them down after the computation.
Example Code:
import tensorflow as tf
import multiprocessing
import numpy as np
def run_tensorflow():
n_input = 10000
n_classes = 1000
# Create model
def multilayer_perceptron(x, weight):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, weight)
return layer_1
# Store layers weight & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
pred = multilayer_perceptron(x, weights)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(100):
batch_x = np.random.rand(10, 10000)
batch_y = np.random.rand(10, 1000)
sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
print "finished doing stuff with tensorflow!"
if __name__ == "__main__":
# option 1: execute code with extra process
p = multiprocessing.Process(target=run_tensorflow)
p.start()
p.join()
# wait until user presses enter key
raw_input()
# option 2: just execute the function
run_tensorflow()
# wait until user presses enter key
raw_input()
So if you would call the function run_tensorflow() within a process you created and shut the process down (option 1), the memory is freed. If you just run run_tensorflow() (option 2) the memory is not freed after the function call.
You can use numba library to release all the gpu memory
pip install numba
from numba import cuda
device = cuda.get_current_device()
device.reset()
This will release all the memory
I use numba to release GPU. With TensorFlow, I cannot find an effective method.
import tensorflow as tf
from numba import cuda
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.1),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
i=1
while(i<1000):
i=i+1
print(sess.run(c))
sess.close() # if don't use numba,the gpu can't be released
cuda.select_device(1)
cuda.close()
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
while(1):
print(sess.run(c))
Now there seem to be two ways to resolve the iterative training model or if you use future multipleprocess pool to serve the model training, where the process in the pool will not be killed if the future finished. You can apply two methods in the training process to release GPU memory meanwhile you wish to preserve the main process.
call a subprocess to run the model training. when one phase training completed, the subprocess will exit and free memory. It's easy to get the return value.
call the multiprocessing.Process(p) to run the model training(p.start), and p.join will indicate the process exit and free memory.
Here is a helper function using multiprocess.Process which can open a new process to run your python written function and reture value instead of using Subprocess,
# open a new process to run function
def process_run(func, *args):
def wrapper_func(queue, *args):
try:
logger.info('run with process id: {}'.format(os.getpid()))
result = func(*args)
error = None
except Exception:
result = None
ex_type, ex_value, tb = sys.exc_info()
error = ex_type, ex_value,''.join(traceback.format_tb(tb))
queue.put((result, error))
def process(*args):
queue = Queue()
p = Process(target = wrapper_func, args = [queue] + list(args))
p.start()
result, error = queue.get()
p.join()
return result, error
result, error = process(*args)
return result, error
I am figuring out which option is better in the Jupyter Notebook. Jupyter Notebook occupies the GPU memory permanently even a deep learning application is completed. It usually incurs the GPU Fan ERROR that is a big headache. In this condition, I have to reset nvidia_uvm and reboot the linux system regularly. I conclude the following two options can remove the headache of GPU Fan Error but want to know which is better.
Environment:
CUDA 11.0
cuDNN 8.0.1
TensorFlow 2.2
Keras 2.4.3
Jupyter Notebook 6.0.3
Miniconda 4.8.3
Ubuntu 18.04 LTS
First Option
Put the following code at the end of the cell. The kernel immediately ended upon the application runtime is completed. But it is not much elegant. Juputer will pop up a message for the died ended kernel.
import os
pid = os.getpid()
!kill -9 $pid
Section Option
The following code can also end the kernel with Jupyter Notebook. I do not know whether numba is secure. Nvidia prefers the "0" GPU that is the most used GPU by personal developer (not server guys). However, both Neil G and mradul dubey have had the response: This leaves the GPU in a bad state.
from numba import cuda
cuda.select_device(0)
cuda.close()
It seems that the second option is more elegant. Can some one confirm which is the best choice?
Notes:
It is not such the problem to automatically release the GPU memory in the environment of Anaconda by direct executing "$ python abc.py". However, I sometimes need to use Jyputer Notebook to handle .ipynb application.
I was able to solve an OOM error just now with the garbage collector.
import gc
gc.collect()
model.evaluate(x1, y1)
gc.collect()
model.evaluate(x2, y2)
gc.collect()
etc.
Based on what Yaroslav Bulatov said in their answer (that tf deallocates GPU memory when the object is destroyed), I surmised that it could just be that the garbage collector hadn't run yet. Forcing it to collect freed me up, so that might be a good way to go.
GPU memory allocated by tensors is released (back into TensorFlow memory pool) as soon as the tensor is not needed anymore (before the .run call terminates). GPU memory allocated for variables is released when variable containers are destroyed. In case of DirectSession (ie, sess=tf.Session("")) it is when session is closed or explicitly reset (added in 62c159ff)
I have trained my models in a for loop for different parameters when I got this error after 120 models trained. Afterwards I could not even train a simple model if I did not kill the kernel.
I was able to solve my issue by adding the following line before building the model:
tf.keras.backend.clear_session()
(see https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session)
To free my resources, I use:
import os, signal
os.kill(os.getpid(), signal.SIGKILL)
Related
TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?
Normal version:
I am using PyTorch in combination with Optuna (a hyperparameter optimization framework; basically starts different trials for one model with different parameters, see: https://optuna.readthedocs.io/en/stable/) for my model training on a setup with 4 GPUs. Here, I've been looking for a way to distribute the workload more efficiently on the GPUs, hence I explored the multiprocessing library.
The core of the multiprocessing code looks like following:
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue, params, signals):
self.gpu_queue = gpu_queue
# create dataset
# ...
def __call__(self, trial: optuna.Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
val = trainer(trial, gpu=gpu_i, ...)
return val
And in main, optuna study and optuna optimize are initiated with:
study = optuna.create_study(direction="minimize", sampler = optuna.samplers.TPESampler(seed=17)) # storage = "sqlite:///trials.db")
study.optimize(Objective(GpuQueue(), ..., n_jobs=4))
Same implementation can be found in this StackOverflow post (used as inspiration): Is there a way to pass arguments to multiple jobs in optuna?
What this code does is that every trial gets its own GPU, hence the GPU usage and distribution is better than other methods. However it happens often that a GPU is stuck and just 'shuts itself off' and does not finish the trial, hence the code actually never finishes running and that GPU is never freed.
Say, for example, that I am running 100 trials, then trial 1,2,3,4 get assigned GPUs 0,1,2,3 (not always in that order), and whenever a GPU is freed, say GPU 2, it takes on trial 5, etc. The issue is, it can happen that the trial that the GPU is assigned to 'quits' in the process and never finishes the trial, hence not taking on another trial and resulting in the run with many trials not completing.
I suspected a deadlock, but apparently Queue() is thread-safe (see: Is Python multiprocessing.Queue thread safe?).
Any clue on what can cause the hang and what I can look for?
I've encountered a mysterious bug while trying to implement Hogwild with torch.multiprocessing. In particular, one version of the code runs fine, but when I add in a seemingly unrelated bit of code before the multiprocessing step, this somehow causes an error during the multiprocessing step: RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork
I reproduced the error in a minimal code sample, pasted below. If I comment out the two lines of code m0 = Model(); train(m0) which carry out a non-parallel training run on a separate model instance, then everything runs fine. I can't figure out how these lines could be causing a problem.
I'm running PyTorch 1.5.1 and Python 3.7.6 on a Linux machine, training on CPU only.
import torch
import torch.multiprocessing as mp
from torch import nn
def train(model):
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
for _ in range(10000):
opt.zero_grad()
# We train the model to output the value 4 (arbitrarily)
loss = (model(0) - 4)**2
loss.backward()
opt.step()
# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
def __init__(self):
super().__init__()
self.x = nn.Parameter(torch.ones(3))
def forward(self, x):
return torch.sum(self.x)
############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################
num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(model,))
p.start()
processes.append(p)
for p in processes:
p.join()
print(model.x)
If you modify your code to create new processes like this:
processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
p = ctx.Process(target=train, args=(model,))
it seems to run fine (rest of code same as yours, tested on pytorch 1.5.0 / python 3.6 / NVIDIA T4 GPU).
I'm not completely sure what is carried over from the non-parallel run to the parallel run; I tried creating a completely new model for the two runs (with its own class), and/or deleting anything from the original, and/or making sure to delete any tensors and free up memory, and none of that made any difference.
What did make a difference was making sure that .backward() never got called outside of mp.Process() before it was called by a function within mp.Process(). I think what may be carried over is an autograd thread; if the thread exists before multiprocessing with the default fork method it fails, if the thread is created after fork it seems to work okay, and if using spawn it also works okay.
Btw: That's a really interesting question - thank you especially for digesting it to a minimal example!
You missed this:
if __name__ == '__main__':
which is very important for multi-processing!
In the following code, it is absolutely imperative for me to execute the complete function in GPU without a single jump back to CPU. This is because I have 4 CPU cores but I have 1200 cuda cores. Theoretically, it is possible because the tensorflow feed_forwards, if statements and and the variable assigns can be done on GPU (I have NVIDIA GTX 1060).
The problem I'm facing is tensorflow2.0 does this automatic assignment to GPU and CPU in the backend and doesn't mention which of it's ops are GPU compatible. When I run the following function with device as GPU, I get
parallel_func could not be transformed and will be staged without change.
and it runs sequentially on GPU.
My question is where to use tf.device? What part of code will be converted by autograph to GPU code and what will remain on CPU? How can I convert that too to GPU?
#tf.function
def parallel_func(self):
for i in tf.range(114): #want this parallel on GPU
for count in range(320): #want this sequential on GPU
retrivedValue = self.data[i][count]
if self.var[i]==1:
self.value[i] = retrievedValue # assigns, if else
elif self.var[i]==-1: # some links to class data through
self.value[i] = -retrivedValue # self.data, self.a and self.b
state = tf.reshape(tf.Variable([self.a[i], self.b[i][count]]), [-1,2])
if self.workerSwitch == False:
action = tf.math.argmax(self.feed_forward(i, count, state))
else:
action = tf.math.argmax(self.worker_feed_forward(i, count, state))
if (action==1 or action==-1):
self.actionCount +=1
Side note: the message parallel_func could not be transformed and will be staged without change is output by autograph, and since it contains data-dependent control flow, it's likely that the function can't run at all. It would be worth filing an issue with steps to reproduce and more detailed log messages.
In tensorflow's guide about the performance of eager execution, there is a piece of code as follows:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
_ = x.numpy() # Make sure to execute op and not just enqueue it
end = time.time()
return end - start
...
with tf.device("/cpu:0"):
print("CPU: {} secs".format(measure(tf.random_normal(shape), steps)))
with tf.device("/gpu:0"):
print("GPU: {} secs".format(measure(tf.random_normal(shape), steps)))
what is the meaning of the code before the second comment: "_ = x.numpy()"?
If I comment out this line, will tf.matmul(x,x) not be executed on cpu/gpu?
Technically, the call to tf.matmul can return before the matrix multiplication is complete.
In practice:
If executing on the CPU (and not using execution_mode=tf.contrib.eager.ASYNC), then tf.matmul returns only after the matrix multiplication has completed.
If executing on the GPU, then tf.matmul returns after enqueueing the matrix multiplication on the CUDA stream (see NVIDIA developer documentation for more information on streams)
The .numpy() call causes the result to be copied back to host memory (since numpy arrays must be backed by host and not GPU memory). In order to correctly do that, it has to wait for all compute operations enqueued on the CUDA stream to complete. Thus the .numpy() call is a means of ensuring "the CUDA stream has been processed". The intent there is to ensure that end - start accounts for the time it takes the operation to complete, not just be enqueued on the CUDA stream.
That said, that code snippet seems like it is over-estimating time executed on GPU since it also includes the time to copy to host after each step. That _ = x.numpy() line should be moved outside the for loop to get a more accurate measure (i.e., time to execute matrix multiplication steps times, then wait for the CUDA stream to finish, and copy to host memory once). Ideally, we would be able to exclude the time it takes to copy back to host memory.
Hope that makes sense.
I ran the model on a four-way GTX1070 in Ubuntu, but when I started the terminal running the program, when I type python ... py --job_name = "ps" --task_index = 0, the four GPUs look Sub-layer was full, and I have not opened a new terminal to run the worker, what is the problem?
It is how Tensorflow works. When it starts with GPU, it allocate almost all the memory.
One small thing you could try is to limit the portion of GPU allocation:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)
sess = tf.Session(config = tf.ConfigProto(gpu_options = gpu_options))
But it controls all GPUs memories so you cannot be sure how the memory will be split (If you put 0.25, it could take all the memory on 1 GPU and 0 on other, or another configuration).
Just ran into this problem recently, it might be because you used server = tf.train.Server(...) in your code and didn't pass a config argument,
so TF default took all the memory of all your GPU, thus there's no memory left for the worker task.
The solution might be:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)
config = tf.ConfigProto(gpu_options = gpu_options)
server = tf.train.Server(..., config=config)
Anyway, it worked for me, hope that helps you.