How to measure time of a layer execution in TF Keras model? - python

I'm trying find a way to efficiently and comfortably measure time of execution of a single layer in TF Keras model. I've already searched quite a lot, but I haven't found a solution that fully satisfies me.
One thing I've came across, which was similar, was an approach used in Pytorch which basically used time library. And for GPU case, it also used: torch.cuda.synchronize() function.
Can I do something similar in TF?
I've tried in a following way:
def measure_layer_latency(layer, batch_size=4, runtimes=5):
input_shape = (batch_size,) + tuple(layer.input_shape[1:]) #input data shape including batch size
total_time = .0
for i in range(runtimes):
x = tf.random.normal(input_shape)
start = time.time()
layer(x)
finish = time.time()
total_time += (finish-start)
return total_time/float(runtimes)
but I'm afraid it won't be a right choice for GPU. Is it at least fine for calculations on CPU?
I've also found information about Eager Execution mode in Tensorflow and such method:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
# tf.matmul can return before completing the matrix multiplication
# (e.g., can return after enqueing the operation on a CUDA stream).
# The x.numpy() call below will ensure that all enqueued operations
# have completed (and will also copy the result to host memory,
# so we're including a little more than just the matmul operation
# time).
_ = x.numpy()
end = time.time()
return end - start
Would anybody recommend such approach?
The last thing, most often recommended is TF Profiler, but for now I find it a bit uncomfortable for my task (maybe because I don't know it well).
Finally, the thing which I would like to achieve is having a model in which I can iterate over layers, change number of filters in Conv operations and calculate latency of a single layer depending on the number of input and output channels. I will be grateful for any ideas!

Related

PyTorch multiprocessing error with Hogwild

I've encountered a mysterious bug while trying to implement Hogwild with torch.multiprocessing. In particular, one version of the code runs fine, but when I add in a seemingly unrelated bit of code before the multiprocessing step, this somehow causes an error during the multiprocessing step: RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork
I reproduced the error in a minimal code sample, pasted below. If I comment out the two lines of code m0 = Model(); train(m0) which carry out a non-parallel training run on a separate model instance, then everything runs fine. I can't figure out how these lines could be causing a problem.
I'm running PyTorch 1.5.1 and Python 3.7.6 on a Linux machine, training on CPU only.
import torch
import torch.multiprocessing as mp
from torch import nn
def train(model):
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
for _ in range(10000):
opt.zero_grad()
# We train the model to output the value 4 (arbitrarily)
loss = (model(0) - 4)**2
loss.backward()
opt.step()
# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
def __init__(self):
super().__init__()
self.x = nn.Parameter(torch.ones(3))
def forward(self, x):
return torch.sum(self.x)
############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################
num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(model,))
p.start()
processes.append(p)
for p in processes:
p.join()
print(model.x)
If you modify your code to create new processes like this:
processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
p = ctx.Process(target=train, args=(model,))
it seems to run fine (rest of code same as yours, tested on pytorch 1.5.0 / python 3.6 / NVIDIA T4 GPU).
I'm not completely sure what is carried over from the non-parallel run to the parallel run; I tried creating a completely new model for the two runs (with its own class), and/or deleting anything from the original, and/or making sure to delete any tensors and free up memory, and none of that made any difference.
What did make a difference was making sure that .backward() never got called outside of mp.Process() before it was called by a function within mp.Process(). I think what may be carried over is an autograd thread; if the thread exists before multiprocessing with the default fork method it fails, if the thread is created after fork it seems to work okay, and if using spawn it also works okay.
Btw: That's a really interesting question - thank you especially for digesting it to a minimal example!
You missed this:
if __name__ == '__main__':
which is very important for multi-processing!

Is there a way to pass arguments to multiple jobs in optuna?

I am trying to use optuna for searching hyper parameter spaces.
In one particular scenario I train a model on a machine with a few GPUs.
The model and batch size allows me to run 1 training per 1 GPU.
So, ideally I would like to let optuna spread all trials across the available GPUs
so that there is always 1 trial running on each GPU.
In the docs it says, I should just start one process per GPU in a separate terminal like:
CUDA_VISIBLE_DEVICES=0 optuna study optimize foo.py objective --study foo --storage sqlite:///example.db
I want to avoid that because the whole hyper parameter search continues in multiple rounds after that. I don't want to always manually start a process per GPU, check when all are finished, then start the next round.
I saw study.optimize has a n_jobs argument.
At first glance this seems to be perfect.
E.g. I could do this:
import optuna
def objective(trial):
# the actual model would be trained here
# the trainer here would need to know which GPU
# it should be using
best_val_loss = trainer(**trial.params)
return best_val_loss
study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=8)
This starts multiple threads each starting a training.
However, the trainer within objective somehow needs to know which GPU it should be using.
Is there a trick to accomplish that?
After a few mental breakdowns I figured out that I can do what I want using a multiprocessing.Queue. To get it into the objective function I need to define it as a lambda function or as a class (I guess partial also works). E.g.
from contextlib import contextmanager
import multiprocessing
N_GPUS = 2
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue):
self.gpu_queue = gpu_queue
def __call__(self, trial: Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
best_val_loss = trainer(**trial.params, gpu=gpu_i)
return best_val_loss
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(Objective(GpuQueue()), n_trials=100, n_jobs=8)
If you want a documented solution of passing arguments to objective functions used by multiple jobs, then Optuna docs present two solutions:
callable classes (it can be combined with multiprocessing),
lambda function wrapper (caution: simpler, but does not work with multiprocessing).
If you are prepared to take a few shortcuts, then you can skip some boilerplate by passing global values (constants such as number of GPUs used) directly (via python environment) to the __call__() method (rather than as arguments of __init__()).
The callable classes solution was tested to work (in optuna==2.0.0) with the two multiprocessing backends (loky/multiprocessing) and remote database backends (mariadb/postgresql).
To overcome the problem if introduced a global variable that tracks, which GPU is currently in use, which can then be read out in the objective function. The code looks like this.
EPOCHS = n
USED_DEVICES = []
def objective(trial):
time.sleep(random.uniform(0, 2)) #used because all n_jobs start at the same time
gpu_list = list(range(torch.cuda.device_count())
unused_gpus = [x for x in gpu_list if x not in USED_DEVICES]
idx = random.choice(unused_gpus)
USED_DEVICES.append(idx)
unused_gpus.remove(idx)
DEVICE = f"cuda:{idx}"
model = define_model(trial).to(DEVICE)
#... YOUR CODE ...
for epoch in range(EPOCHS):
# ... YOUR CODE ...
if trial.should_prune():
USED_DEVICES.remove(idx)
raise optuna.exceptions.TrialPruned()
#remove idx from list to reuse in next trial
USED_DEVICES.remove(idx)

Tensor Inverse in parallel over multiple GPUs using PyTorch

I want to run over multiple GPUs in parallel torch.inverse(), and am not able to.
I saw this post Matmul on multiple GPUs , which goes over the process for matmul. It shows that if you have multiple tensors allocated to each GPU matmul will be run in parallel. I was able to replicate this behavior for matmul but when I try to do the same thing for torch.inverse() it seems to run sequentially when I check "watch nvidia-smi". Also when I replace the torch.inverse() function with the torch fft function I get parallel GPU usage. Any ideas?
import torch
ngpu = torch.cuda.device_count()
# This is the allocation to each GPU.
lis = []
for i in range(ngpu):
lis.append(torch.rand(5000,5000,device = 'cuda:'+ str(i)))
# per the matmul on multiple GPUs post this should already be in parallel
# but doesnt seem to be based on watch nvidia-smi
C_ = []
for i in range(ngpu):
C_.append(torch.inverse(lis[i]))
Edit: This can be compared to the FFT code(below) and the Matmul code in the link above.
import torch
ngpu = torch.cuda.device_count()
# This is the allocation to each GPU.
lis = []
for i in range(ngpu):
lis.append(torch.rand(5000,5000,2,device = 'cuda:'+ str(i)))
# per the matmul on multiple GPUs post this should already be in parallel
# but doesnt seem to be based on watch nvidia-smi
C_ = []
for i in range(ngpu):
C_.append(torch.fft(lis[i],2))

performance measurement in Tensorflow's eager mode

In tensorflow's guide about the performance of eager execution, there is a piece of code as follows:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
_ = x.numpy() # Make sure to execute op and not just enqueue it
end = time.time()
return end - start
...
with tf.device("/cpu:0"):
print("CPU: {} secs".format(measure(tf.random_normal(shape), steps)))
with tf.device("/gpu:0"):
print("GPU: {} secs".format(measure(tf.random_normal(shape), steps)))
what is the meaning of the code before the second comment: "_ = x.numpy()"?
If I comment out this line, will tf.matmul(x,x) not be executed on cpu/gpu?
Technically, the call to tf.matmul can return before the matrix multiplication is complete.
In practice:
If executing on the CPU (and not using execution_mode=tf.contrib.eager.ASYNC), then tf.matmul returns only after the matrix multiplication has completed.
If executing on the GPU, then tf.matmul returns after enqueueing the matrix multiplication on the CUDA stream (see NVIDIA developer documentation for more information on streams)
The .numpy() call causes the result to be copied back to host memory (since numpy arrays must be backed by host and not GPU memory). In order to correctly do that, it has to wait for all compute operations enqueued on the CUDA stream to complete. Thus the .numpy() call is a means of ensuring "the CUDA stream has been processed". The intent there is to ensure that end - start accounts for the time it takes the operation to complete, not just be enqueued on the CUDA stream.
That said, that code snippet seems like it is over-estimating time executed on GPU since it also includes the time to copy to host after each step. That _ = x.numpy() line should be moved outside the for loop to get a more accurate measure (i.e., time to execute matrix multiplication steps times, then wait for the CUDA stream to finish, and copy to host memory once). Ideally, we would be able to exclude the time it takes to copy back to host memory.
Hope that makes sense.

Clearing Tensorflow GPU memory after model execution

I've trained 3 models and am now running code that loads each of the 3 checkpoints in sequence and runs predictions using them. I'm using the GPU.
When the first model is loaded it pre-allocates the entire GPU memory (which I want for working through the first batch of data). But it doesn't unload memory when it's finished. When the second model is loaded, using both tf.reset_default_graph() and with tf.Graph().as_default() the GPU memory still is fully consumed from the first model, and the second model is then starved of memory.
Is there a way to resolve this, other than using Python subprocesses or multiprocessing to work around the problem (the only solution I've found on via google searches)?
A git issue from June 2016 (https://github.com/tensorflow/tensorflow/issues/1727) indicates that there is the following problem:
currently the Allocator in the GPUDevice belongs to the ProcessState,
which is essentially a global singleton. The first session using GPU
initializes it, and frees itself when the process shuts down.
Thus the only workaround would be to use processes and shut them down after the computation.
Example Code:
import tensorflow as tf
import multiprocessing
import numpy as np
def run_tensorflow():
n_input = 10000
n_classes = 1000
# Create model
def multilayer_perceptron(x, weight):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, weight)
return layer_1
# Store layers weight & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
pred = multilayer_perceptron(x, weights)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(100):
batch_x = np.random.rand(10, 10000)
batch_y = np.random.rand(10, 1000)
sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
print "finished doing stuff with tensorflow!"
if __name__ == "__main__":
# option 1: execute code with extra process
p = multiprocessing.Process(target=run_tensorflow)
p.start()
p.join()
# wait until user presses enter key
raw_input()
# option 2: just execute the function
run_tensorflow()
# wait until user presses enter key
raw_input()
So if you would call the function run_tensorflow() within a process you created and shut the process down (option 1), the memory is freed. If you just run run_tensorflow() (option 2) the memory is not freed after the function call.
You can use numba library to release all the gpu memory
pip install numba
from numba import cuda
device = cuda.get_current_device()
device.reset()
This will release all the memory
I use numba to release GPU. With TensorFlow, I cannot find an effective method.
import tensorflow as tf
from numba import cuda
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.1),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
i=1
while(i<1000):
i=i+1
print(sess.run(c))
sess.close() # if don't use numba,the gpu can't be released
cuda.select_device(1)
cuda.close()
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
while(1):
print(sess.run(c))
Now there seem to be two ways to resolve the iterative training model or if you use future multipleprocess pool to serve the model training, where the process in the pool will not be killed if the future finished. You can apply two methods in the training process to release GPU memory meanwhile you wish to preserve the main process.
call a subprocess to run the model training. when one phase training completed, the subprocess will exit and free memory. It's easy to get the return value.
call the multiprocessing.Process(p) to run the model training(p.start), and p.join will indicate the process exit and free memory.
Here is a helper function using multiprocess.Process which can open a new process to run your python written function and reture value instead of using Subprocess,
# open a new process to run function
def process_run(func, *args):
def wrapper_func(queue, *args):
try:
logger.info('run with process id: {}'.format(os.getpid()))
result = func(*args)
error = None
except Exception:
result = None
ex_type, ex_value, tb = sys.exc_info()
error = ex_type, ex_value,''.join(traceback.format_tb(tb))
queue.put((result, error))
def process(*args):
queue = Queue()
p = Process(target = wrapper_func, args = [queue] + list(args))
p.start()
result, error = queue.get()
p.join()
return result, error
result, error = process(*args)
return result, error
I am figuring out which option is better in the Jupyter Notebook. Jupyter Notebook occupies the GPU memory permanently even a deep learning application is completed. It usually incurs the GPU Fan ERROR that is a big headache. In this condition, I have to reset nvidia_uvm and reboot the linux system regularly. I conclude the following two options can remove the headache of GPU Fan Error but want to know which is better.
Environment:
CUDA 11.0
cuDNN 8.0.1
TensorFlow 2.2
Keras 2.4.3
Jupyter Notebook 6.0.3
Miniconda 4.8.3
Ubuntu 18.04 LTS
First Option
Put the following code at the end of the cell. The kernel immediately ended upon the application runtime is completed. But it is not much elegant. Juputer will pop up a message for the died ended kernel.
import os
pid = os.getpid()
!kill -9 $pid
Section Option
The following code can also end the kernel with Jupyter Notebook. I do not know whether numba is secure. Nvidia prefers the "0" GPU that is the most used GPU by personal developer (not server guys). However, both Neil G and mradul dubey have had the response: This leaves the GPU in a bad state.
from numba import cuda
cuda.select_device(0)
cuda.close()
It seems that the second option is more elegant. Can some one confirm which is the best choice?
Notes:
It is not such the problem to automatically release the GPU memory in the environment of Anaconda by direct executing "$ python abc.py". However, I sometimes need to use Jyputer Notebook to handle .ipynb application.
I was able to solve an OOM error just now with the garbage collector.
import gc
gc.collect()
model.evaluate(x1, y1)
gc.collect()
model.evaluate(x2, y2)
gc.collect()
etc.
Based on what Yaroslav Bulatov said in their answer (that tf deallocates GPU memory when the object is destroyed), I surmised that it could just be that the garbage collector hadn't run yet. Forcing it to collect freed me up, so that might be a good way to go.
GPU memory allocated by tensors is released (back into TensorFlow memory pool) as soon as the tensor is not needed anymore (before the .run call terminates). GPU memory allocated for variables is released when variable containers are destroyed. In case of DirectSession (ie, sess=tf.Session("")) it is when session is closed or explicitly reset (added in 62c159ff)
I have trained my models in a for loop for different parameters when I got this error after 120 models trained. Afterwards I could not even train a simple model if I did not kill the kernel.
I was able to solve my issue by adding the following line before building the model:
tf.keras.backend.clear_session()
(see https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session)
To free my resources, I use:
import os, signal
os.kill(os.getpid(), signal.SIGKILL)

Categories