performance measurement in Tensorflow's eager mode

performance measurement in Tensorflow's eager mode - python

In tensorflow's guide about the performance of eager execution, there is a piece of code as follows:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
_ = x.numpy() # Make sure to execute op and not just enqueue it
end = time.time()
return end - start
...
with tf.device("/cpu:0"):
print("CPU: {} secs".format(measure(tf.random_normal(shape), steps)))
with tf.device("/gpu:0"):
print("GPU: {} secs".format(measure(tf.random_normal(shape), steps)))
what is the meaning of the code before the second comment: "_ = x.numpy()"?
If I comment out this line, will tf.matmul(x,x) not be executed on cpu/gpu?

Technically, the call to tf.matmul can return before the matrix multiplication is complete.
In practice:
If executing on the CPU (and not using execution_mode=tf.contrib.eager.ASYNC), then tf.matmul returns only after the matrix multiplication has completed.
If executing on the GPU, then tf.matmul returns after enqueueing the matrix multiplication on the CUDA stream (see NVIDIA developer documentation for more information on streams)
The .numpy() call causes the result to be copied back to host memory (since numpy arrays must be backed by host and not GPU memory). In order to correctly do that, it has to wait for all compute operations enqueued on the CUDA stream to complete. Thus the .numpy() call is a means of ensuring "the CUDA stream has been processed". The intent there is to ensure that end - start accounts for the time it takes the operation to complete, not just be enqueued on the CUDA stream.
That said, that code snippet seems like it is over-estimating time executed on GPU since it also includes the time to copy to host after each step. That _ = x.numpy() line should be moved outside the for loop to get a more accurate measure (i.e., time to execute matrix multiplication steps times, then wait for the CUDA stream to finish, and copy to host memory once). Ideally, we would be able to exclude the time it takes to copy back to host memory.
Hope that makes sense.

Related

Multithreading degrades GPU performance

In my Python application I am using Detectron2 to run prediction on an image and detect the key-points of all the humans in the image.
I want to run the prediction on frames that are streamed to my app live (using aiortc), but I discovered that the predictions time is much worse because it now runs on a new thread (the main thread is occupied with the server).
Running predictions on a thread takes anywhere between 1.5 to 4 seconds, which is a lot.
When running the predictions on the main-thread (without the video streaming part), I get predictions times of less than a second.
My question is why it happens and how can I fix it¿ Why the GPU performance is degraded so drastically when using it from a new thread¿
Notes:
The code is tested in Google Colab with Tesla P100 GPU and the video streaming is emulated by reading frames from a video file.
I calculate the time it takes to run prediction on a frame using the code in this question.
I tried switching to multiprocessing instead, but couldn't make it work with cuda (I tried both import multiprocessing as well as import torch.multiprocessing with set_stratup_method('spawn')) it just gets stuck when calling start on the process.
Example code:
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
import threading
from typing import List
import numpy as np
import timeit
import cv2
# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)
def get_frames(video: cv2.VideoCapture):
frames = list()
while True:
has_frame, frame = video.read()
if not has_frame:
break
frames.append(frame)
return frames
class CodeTimer:
# Source: https://stackoverflow.com/a/52749808/9977758
def __init__(self, name=None):
self.name = " '" + name + "'" if name else ''
def __enter__(self):
self.start = timeit.default_timer()
def __exit__(self, exc_type, exc_value, traceback):
self.took = (timeit.default_timer() - self.start) * 1000.0
print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')
video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()
def send_frames():
# This function emulates the stream, so here we "get" a frame and add it to our buffer
for frame in get_frames(video):
frames_buffer.append(frame)
# Simulate delays between frames
time.sleep(random.uniform(0.3, 2.1))
def predict_frames():
predicted_frames = 0 # The number of frames predicted so far
while predicted_frames < num_frames: # Stop after we predicted all frames
buffer_length = len(frames_buffer)
if buffer_length <= predicted_frames:
continue # Wait until we get a new frame
# Read all the frames from the point we stopped
for frame in frames_buffer[predicted_frames:]:
# Measure the prediction time
with CodeTimer('In stream prediction'):
predictions.append(predictor(frame))
predicted_frames += 1
t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()

Python threads rely on the GIL which must be locked by all C bindings trying to access Python objects. GPU computing libraries typically use C bindings, and could potentially lock the GIL from time to time and thus pause Python code execution.
It is a wild guess, but this is possible that the predictor function, which needs to go through C and a lock of the GIL finds itself waiting for the other threads that are writing the video buffers. Then depending on how the computation is broken down and how Python juggles with your other thread, I suppose the impact on performance may become visible.
You may:
avoid multi-threading by performing the reading and the prediction in the same thread.
use multi-processing so that the GIL does not interfere between the two processes
code this in a native language such as C, C++...

The problem is in: your hardware, your libraries or, in the differences between your example code and the real code.
I implemented your code on an Nvidia Jetson Xavier. I installed all needed libraries using the following commands:
# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate
#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl
# torchvision
pip install 'git+https://github.com/pytorch/vision.git#v0.9.0'
# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
# ipython bindings (optional)
pip install ipykernel cloudpickle
# opencv
pip install opencv-python
After that I run your example script on an example video and received the following output:
Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms
After the first pass, the detector consistently takes around 400ms to run the detection. Which seems about right for an Jetson Xavier. I do not experience the slowdown you described.
I have to note that the Jetson is a specific piece of hardware. In this hardware the RAM memory is shared between the CPU and the GPU. Therefore I do not have to transfer the data from CPU to GPU. So if your slow down is caused by the transfer between CPU and GPU memory, I will not experience this problem in my setup.

Not seeing the full code, here is a few suggestions:
You might be running into overhead of starting new threads every time. So explore option of the thread pool instead of starting new threads every time.
If you are not moving workload to GPU - that means it's CPU bound task and Python threads is not the right tool for the task. For CPU intensive tasks you should be using https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing

Some operations are I/O bound. For example, each cv2.imread call results in I/O overhead. You can read this article which says : "Not all algorithms can be made parallel and distributed to all cores of a processor — some algorithms are simply single threaded in nature."
This means that multiprocessing for computer vision algorithms must be global: a single operation (such as imread) will not be improved by multithreading. However, you will sometimes gain speed by performing other operations in parallel because they are not limited by I/O or anything else. At this point, you will probably see an overall speedup:
If you run single imread:
non-multithreaded: 5 ms = cost of imread
multithreaded: 7 ms = cost of multithreading + cost of imread
But if you run operations that can be multithreaded :
non multithreaded: 5 ms + 10 ms = cost of imread + cost of operation
multi-threaded: 2ms + 5 ms + 5 ms = cost of multithreading + cost of imread + cost of parallel operations
(these figures are not real, they are just to illustrate what I mean)

How to measure time of a layer execution in TF Keras model?

I'm trying find a way to efficiently and comfortably measure time of execution of a single layer in TF Keras model. I've already searched quite a lot, but I haven't found a solution that fully satisfies me.
One thing I've came across, which was similar, was an approach used in Pytorch which basically used time library. And for GPU case, it also used: torch.cuda.synchronize() function.
Can I do something similar in TF?
I've tried in a following way:
def measure_layer_latency(layer, batch_size=4, runtimes=5):
input_shape = (batch_size,) + tuple(layer.input_shape[1:]) #input data shape including batch size
total_time = .0
for i in range(runtimes):
x = tf.random.normal(input_shape)
start = time.time()
layer(x)
finish = time.time()
total_time += (finish-start)
return total_time/float(runtimes)
but I'm afraid it won't be a right choice for GPU. Is it at least fine for calculations on CPU?
I've also found information about Eager Execution mode in Tensorflow and such method:
import time
def measure(x, steps):
# TensorFlow initializes a GPU the first time it's used, exclude from timing.
tf.matmul(x, x)
start = time.time()
for i in range(steps):
x = tf.matmul(x, x)
# tf.matmul can return before completing the matrix multiplication
# (e.g., can return after enqueing the operation on a CUDA stream).
# The x.numpy() call below will ensure that all enqueued operations
# have completed (and will also copy the result to host memory,
# so we're including a little more than just the matmul operation
# time).
_ = x.numpy()
end = time.time()
return end - start
Would anybody recommend such approach?
The last thing, most often recommended is TF Profiler, but for now I find it a bit uncomfortable for my task (maybe because I don't know it well).
Finally, the thing which I would like to achieve is having a model in which I can iterate over layers, change number of filters in Conv operations and calculate latency of a single layer depending on the number of input and output channels. I will be grateful for any ideas!

Tensorflow Eager mode: First execution on GPU slow

The code below compares computation time on the CPU vs GPU. Only for the first execution, I get slower runtime on GPU than CPU, and in all subsequent runs the GPU is much faster. Why is the first run on GPU slow? How do I make even the first run on GPU fast?
from __future__ import absolute_import, division, print_function
import tensorflow as tf
tf.enable_eager_execution()
import time
def time_matmul(x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print("10 loops: {:0.2f}ms".format(1000*result))
print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random_uniform([1000, 1000])
assert x.device.endswith("GPU:0")
time_matmul(x)
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random_uniform([1000, 1000])
assert x.device.endswith("CPU:0")
time_matmul(x)
Output on first run:
On GPU:
10 loops: 443.04ms
On CPU:
10 loops: 100.01ms
Output on subsequent runs:
On GPU:
10 loops: 1.00ms
On CPU:
10 loops: 103.01ms
PS: This is different from a seemingly related question because tf.device("GPU:0") already chooses /device:GPU:0 and not /device:XLA_GPU:0

Out of curiosity I have tried the OP script 3 years later. The same situation happens on the latest version of TF, CUDA (yet an old GTX1050 card). A possible explanation is data movement.
On the first runs---either GPU or CPU---data moves around, ready for action. Data movements are well-known to slow things down significantly. CPU memory is physically "closer" than GPU memory, the latter usually on an external board. The default compute is the CPU and its memory, so a program is almost ready for CPU runs---little or nothing to move, and basically staying on the same chip. GPU memory is physically a different chip, "far away", so moving there is likely to take much more time.
This thinking can be supported by looping over the OP script (slight change to match TF2.9.1):
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
import time
def time_matmul(run, x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print(f"Run #{run}: {1000*result:0.2f}ms")
print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("GPU:0")
for run in range(10):
time_matmul(run, x)
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("CPU:0")
for run in range(10):
time_matmul(run, x)
Which results in:
Run #0: 273.66ms
Run #1: 0.37ms
Run #2: 0.36ms
Run #3: 0.36ms
Run #4: 0.37ms
Run #5: 0.36ms
Run #6: 0.35ms
Run #7: 0.41ms
Run #8: 0.37ms
Run #9: 0.35ms
On CPU:
Run #0: 56.89ms
Run #1: 44.31ms
Run #2: 47.60ms
Run #3: 46.97ms
Run #4: 46.40ms
Run #5: 44.84ms
Run #6: 43.88ms
Run #7: 45.28ms
Run #8: 43.46ms
Run #9: 43.57ms
Eye-balling what happens (a proper stats approach would run that many times, done but no more insight) the first run is slow, but then faster, and more importantly stable. Stability is what we expect in the first place (run the same should behave the same), but the first run needs to be setup by placing the data at the "right" place in memory.
I am not aware of APIs to place the data manually, and then start the runs. But this would be an "illusion". Run #0 here includes movement and computation. Splitting the two would likely make Run #0 as fast as all other runs, yet we still have to move the data beforehand---the required time would just not show in the result table...
Please note this memory movement is a likely cause (abductive reasoning here), and there might be something else happening. The thinking is supported by the script result, yet it just allows to conclude memory movement is a likely cause. This post proves nothing. A proper analysis to get the root cause would require more time with a profiler (and a Python profiler may not be enough).
Aside this disclaimer, it really looks like a memory movement cost we observe here.

How to use autograph and tf.device with tf.function wrapped class method?

In the following code, it is absolutely imperative for me to execute the complete function in GPU without a single jump back to CPU. This is because I have 4 CPU cores but I have 1200 cuda cores. Theoretically, it is possible because the tensorflow feed_forwards, if statements and and the variable assigns can be done on GPU (I have NVIDIA GTX 1060).
The problem I'm facing is tensorflow2.0 does this automatic assignment to GPU and CPU in the backend and doesn't mention which of it's ops are GPU compatible. When I run the following function with device as GPU, I get
parallel_func could not be transformed and will be staged without change.
and it runs sequentially on GPU.
My question is where to use tf.device? What part of code will be converted by autograph to GPU code and what will remain on CPU? How can I convert that too to GPU?
#tf.function
def parallel_func(self):
for i in tf.range(114): #want this parallel on GPU
for count in range(320): #want this sequential on GPU
retrivedValue = self.data[i][count]
if self.var[i]==1:
self.value[i] = retrievedValue # assigns, if else
elif self.var[i]==-1: # some links to class data through
self.value[i] = -retrivedValue # self.data, self.a and self.b
state = tf.reshape(tf.Variable([self.a[i], self.b[i][count]]), [-1,2])
if self.workerSwitch == False:
action = tf.math.argmax(self.feed_forward(i, count, state))
else:
action = tf.math.argmax(self.worker_feed_forward(i, count, state))
if (action==1 or action==-1):
self.actionCount +=1

Side note: the message parallel_func could not be transformed and will be staged without change is output by autograph, and since it contains data-dependent control flow, it's likely that the function can't run at all. It would be worth filing an issue with steps to reproduce and more detailed log messages.

Clearing Tensorflow GPU memory after model execution

I've trained 3 models and am now running code that loads each of the 3 checkpoints in sequence and runs predictions using them. I'm using the GPU.
When the first model is loaded it pre-allocates the entire GPU memory (which I want for working through the first batch of data). But it doesn't unload memory when it's finished. When the second model is loaded, using both tf.reset_default_graph() and with tf.Graph().as_default() the GPU memory still is fully consumed from the first model, and the second model is then starved of memory.
Is there a way to resolve this, other than using Python subprocesses or multiprocessing to work around the problem (the only solution I've found on via google searches)?

A git issue from June 2016 (https://github.com/tensorflow/tensorflow/issues/1727) indicates that there is the following problem:
currently the Allocator in the GPUDevice belongs to the ProcessState,
which is essentially a global singleton. The first session using GPU
initializes it, and frees itself when the process shuts down.
Thus the only workaround would be to use processes and shut them down after the computation.
Example Code:
import tensorflow as tf
import multiprocessing
import numpy as np
def run_tensorflow():
n_input = 10000
n_classes = 1000
# Create model
def multilayer_perceptron(x, weight):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, weight)
return layer_1
# Store layers weight & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
pred = multilayer_perceptron(x, weights)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(100):
batch_x = np.random.rand(10, 10000)
batch_y = np.random.rand(10, 1000)
sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
print "finished doing stuff with tensorflow!"
if __name__ == "__main__":
# option 1: execute code with extra process
p = multiprocessing.Process(target=run_tensorflow)
p.start()
p.join()
# wait until user presses enter key
raw_input()
# option 2: just execute the function
run_tensorflow()
# wait until user presses enter key
raw_input()
So if you would call the function run_tensorflow() within a process you created and shut the process down (option 1), the memory is freed. If you just run run_tensorflow() (option 2) the memory is not freed after the function call.

You can use numba library to release all the gpu memory
pip install numba
from numba import cuda
device = cuda.get_current_device()
device.reset()
This will release all the memory

I use numba to release GPU. With TensorFlow, I cannot find an effective method.
import tensorflow as tf
from numba import cuda
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.1),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
i=1
while(i<1000):
i=i+1
print(sess.run(c))
sess.close() # if don't use numba,the gpu can't be released
cuda.select_device(1)
cuda.close()
with tf.device('/gpu:1'):
c = a+b
TF_CONFIG = tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5),
allow_soft_placement=True)
sess = tf.Session(config=TF_CONFIG)
sess.run(tf.global_variables_initializer())
while(1):
print(sess.run(c))

Now there seem to be two ways to resolve the iterative training model or if you use future multipleprocess pool to serve the model training, where the process in the pool will not be killed if the future finished. You can apply two methods in the training process to release GPU memory meanwhile you wish to preserve the main process.
call a subprocess to run the model training. when one phase training completed, the subprocess will exit and free memory. It's easy to get the return value.
call the multiprocessing.Process(p) to run the model training(p.start), and p.join will indicate the process exit and free memory.
Here is a helper function using multiprocess.Process which can open a new process to run your python written function and reture value instead of using Subprocess,
# open a new process to run function
def process_run(func, *args):
def wrapper_func(queue, *args):
try:
logger.info('run with process id: {}'.format(os.getpid()))
result = func(*args)
error = None
except Exception:
result = None
ex_type, ex_value, tb = sys.exc_info()
error = ex_type, ex_value,''.join(traceback.format_tb(tb))
queue.put((result, error))
def process(*args):
queue = Queue()
p = Process(target = wrapper_func, args = [queue] + list(args))
p.start()
result, error = queue.get()
p.join()
return result, error
result, error = process(*args)
return result, error

I am figuring out which option is better in the Jupyter Notebook. Jupyter Notebook occupies the GPU memory permanently even a deep learning application is completed. It usually incurs the GPU Fan ERROR that is a big headache. In this condition, I have to reset nvidia_uvm and reboot the linux system regularly. I conclude the following two options can remove the headache of GPU Fan Error but want to know which is better.
Environment:
CUDA 11.0
cuDNN 8.0.1
TensorFlow 2.2
Keras 2.4.3
Jupyter Notebook 6.0.3
Miniconda 4.8.3
Ubuntu 18.04 LTS
First Option
Put the following code at the end of the cell. The kernel immediately ended upon the application runtime is completed. But it is not much elegant. Juputer will pop up a message for the died ended kernel.
import os
pid = os.getpid()
!kill -9 $pid
Section Option
The following code can also end the kernel with Jupyter Notebook. I do not know whether numba is secure. Nvidia prefers the "0" GPU that is the most used GPU by personal developer (not server guys). However, both Neil G and mradul dubey have had the response: This leaves the GPU in a bad state.
from numba import cuda
cuda.select_device(0)
cuda.close()
It seems that the second option is more elegant. Can some one confirm which is the best choice?
Notes:
It is not such the problem to automatically release the GPU memory in the environment of Anaconda by direct executing "$ python abc.py". However, I sometimes need to use Jyputer Notebook to handle .ipynb application.

I was able to solve an OOM error just now with the garbage collector.
import gc
gc.collect()
model.evaluate(x1, y1)
gc.collect()
model.evaluate(x2, y2)
gc.collect()
etc.
Based on what Yaroslav Bulatov said in their answer (that tf deallocates GPU memory when the object is destroyed), I surmised that it could just be that the garbage collector hadn't run yet. Forcing it to collect freed me up, so that might be a good way to go.

GPU memory allocated by tensors is released (back into TensorFlow memory pool) as soon as the tensor is not needed anymore (before the .run call terminates). GPU memory allocated for variables is released when variable containers are destroyed. In case of DirectSession (ie, sess=tf.Session("")) it is when session is closed or explicitly reset (added in 62c159ff)

I have trained my models in a for loop for different parameters when I got this error after 120 models trained. Afterwards I could not even train a simple model if I did not kill the kernel.
I was able to solve my issue by adding the following line before building the model:
tf.keras.backend.clear_session()
(see https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session)

To free my resources, I use:
import os, signal
os.kill(os.getpid(), signal.SIGKILL)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.