Multithreading degrades GPU performance

Multithreading degrades GPU performance - python

In my Python application I am using Detectron2 to run prediction on an image and detect the key-points of all the humans in the image.
I want to run the prediction on frames that are streamed to my app live (using aiortc), but I discovered that the predictions time is much worse because it now runs on a new thread (the main thread is occupied with the server).
Running predictions on a thread takes anywhere between 1.5 to 4 seconds, which is a lot.
When running the predictions on the main-thread (without the video streaming part), I get predictions times of less than a second.
My question is why it happens and how can I fix it¿ Why the GPU performance is degraded so drastically when using it from a new thread¿
Notes:
The code is tested in Google Colab with Tesla P100 GPU and the video streaming is emulated by reading frames from a video file.
I calculate the time it takes to run prediction on a frame using the code in this question.
I tried switching to multiprocessing instead, but couldn't make it work with cuda (I tried both import multiprocessing as well as import torch.multiprocessing with set_stratup_method('spawn')) it just gets stuck when calling start on the process.
Example code:
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
import threading
from typing import List
import numpy as np
import timeit
import cv2
# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)
def get_frames(video: cv2.VideoCapture):
frames = list()
while True:
has_frame, frame = video.read()
if not has_frame:
break
frames.append(frame)
return frames
class CodeTimer:
# Source: https://stackoverflow.com/a/52749808/9977758
def __init__(self, name=None):
self.name = " '" + name + "'" if name else ''
def __enter__(self):
self.start = timeit.default_timer()
def __exit__(self, exc_type, exc_value, traceback):
self.took = (timeit.default_timer() - self.start) * 1000.0
print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')
video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()
def send_frames():
# This function emulates the stream, so here we "get" a frame and add it to our buffer
for frame in get_frames(video):
frames_buffer.append(frame)
# Simulate delays between frames
time.sleep(random.uniform(0.3, 2.1))
def predict_frames():
predicted_frames = 0 # The number of frames predicted so far
while predicted_frames < num_frames: # Stop after we predicted all frames
buffer_length = len(frames_buffer)
if buffer_length <= predicted_frames:
continue # Wait until we get a new frame
# Read all the frames from the point we stopped
for frame in frames_buffer[predicted_frames:]:
# Measure the prediction time
with CodeTimer('In stream prediction'):
predictions.append(predictor(frame))
predicted_frames += 1
t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()

Python threads rely on the GIL which must be locked by all C bindings trying to access Python objects. GPU computing libraries typically use C bindings, and could potentially lock the GIL from time to time and thus pause Python code execution.
It is a wild guess, but this is possible that the predictor function, which needs to go through C and a lock of the GIL finds itself waiting for the other threads that are writing the video buffers. Then depending on how the computation is broken down and how Python juggles with your other thread, I suppose the impact on performance may become visible.
You may:
avoid multi-threading by performing the reading and the prediction in the same thread.
use multi-processing so that the GIL does not interfere between the two processes
code this in a native language such as C, C++...

The problem is in: your hardware, your libraries or, in the differences between your example code and the real code.
I implemented your code on an Nvidia Jetson Xavier. I installed all needed libraries using the following commands:
# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate
#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl
# torchvision
pip install 'git+https://github.com/pytorch/vision.git#v0.9.0'
# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
# ipython bindings (optional)
pip install ipykernel cloudpickle
# opencv
pip install opencv-python
After that I run your example script on an example video and received the following output:
Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms
After the first pass, the detector consistently takes around 400ms to run the detection. Which seems about right for an Jetson Xavier. I do not experience the slowdown you described.
I have to note that the Jetson is a specific piece of hardware. In this hardware the RAM memory is shared between the CPU and the GPU. Therefore I do not have to transfer the data from CPU to GPU. So if your slow down is caused by the transfer between CPU and GPU memory, I will not experience this problem in my setup.

Not seeing the full code, here is a few suggestions:
You might be running into overhead of starting new threads every time. So explore option of the thread pool instead of starting new threads every time.
If you are not moving workload to GPU - that means it's CPU bound task and Python threads is not the right tool for the task. For CPU intensive tasks you should be using https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing

Some operations are I/O bound. For example, each cv2.imread call results in I/O overhead. You can read this article which says : "Not all algorithms can be made parallel and distributed to all cores of a processor — some algorithms are simply single threaded in nature."
This means that multiprocessing for computer vision algorithms must be global: a single operation (such as imread) will not be improved by multithreading. However, you will sometimes gain speed by performing other operations in parallel because they are not limited by I/O or anything else. At this point, you will probably see an overall speedup:
If you run single imread:
non-multithreaded: 5 ms = cost of imread
multithreaded: 7 ms = cost of multithreading + cost of imread
But if you run operations that can be multithreaded :
non multithreaded: 5 ms + 10 ms = cost of imread + cost of operation
multi-threaded: 2ms + 5 ms + 5 ms = cost of multithreading + cost of imread + cost of parallel operations
(these figures are not real, they are just to illustrate what I mean)

Related

Faster Startup of Processes Python

I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!

several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()

you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.

Python: parallel execution of a function which has a sequential loop inside

I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction.
Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
import numpy as np
def foo(eps): # need an (unused) argument to use pool.map()
# initialising
# the true values of the actions
q = np.random.normal(0, 1, size=10)
# the estimated values
q_est = np.zeros(10)
# the counter of how many times each of the 10 actions was chosen
n = np.zeros(10)
rewards = []
for i in range(1000):
# choose an action based on its estimated value
a = np.argmax(q_est)
# get the normally distributed reward
rewards.append(np.random.normal(q[a], 1))
# increment the chosen action counter
n[a] += 1
# update the estimated value of the action
q_est[a] += (rewards[-1] - q_est[a]) / n[a]
return rewards
I execute this function 2000 times to get (2000, 1000) array:
reward = np.array([foo(0) for _ in range(2000)])
Then I plot the mean reward across 2000 experiments:
import matplotlib.pyplot as plt
plt.plot(np.arange(1000), reward.mean(axis=0))
sequential plot
which fully corresponds the expected result (looks the same as in the book).
But when I try to execute it in parallel, I get much greater standard deviation of the average reward:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
reward_p = np.array(pool.map(foo, [0]*2000))
plt.plot(np.arange(1000), reward_p.mean(axis=0))
parallel plot
I suppose this is due to the parallelization of a loop inside of the foo. As i reduce the number of cores allocated to the task, the reward plot approaches the expected shape.
Is there a way to get the advantage of the multiprocessing here while getting the correct results?
UPD:
I tried running the same code on Windows 10 and sequential vs parallel and the results turned out to be the same! What may be the reason?
Ubuntu 20.04, Python 3.8.5, jupyter
Windows 10, Python 3.7.3, jupyter

As we found out it is different on windows and ubuntu. It is probably because of this:
spawn The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
Try adding this line to your code:
mp.set_start_method('spawn')

Tensorflow Eager mode: First execution on GPU slow

The code below compares computation time on the CPU vs GPU. Only for the first execution, I get slower runtime on GPU than CPU, and in all subsequent runs the GPU is much faster. Why is the first run on GPU slow? How do I make even the first run on GPU fast?
from __future__ import absolute_import, division, print_function
import tensorflow as tf
tf.enable_eager_execution()
import time
def time_matmul(x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print("10 loops: {:0.2f}ms".format(1000*result))
print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random_uniform([1000, 1000])
assert x.device.endswith("GPU:0")
time_matmul(x)
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random_uniform([1000, 1000])
assert x.device.endswith("CPU:0")
time_matmul(x)
Output on first run:
On GPU:
10 loops: 443.04ms
On CPU:
10 loops: 100.01ms
Output on subsequent runs:
On GPU:
10 loops: 1.00ms
On CPU:
10 loops: 103.01ms
PS: This is different from a seemingly related question because tf.device("GPU:0") already chooses /device:GPU:0 and not /device:XLA_GPU:0

Out of curiosity I have tried the OP script 3 years later. The same situation happens on the latest version of TF, CUDA (yet an old GTX1050 card). A possible explanation is data movement.
On the first runs---either GPU or CPU---data moves around, ready for action. Data movements are well-known to slow things down significantly. CPU memory is physically "closer" than GPU memory, the latter usually on an external board. The default compute is the CPU and its memory, so a program is almost ready for CPU runs---little or nothing to move, and basically staying on the same chip. GPU memory is physically a different chip, "far away", so moving there is likely to take much more time.
This thinking can be supported by looping over the OP script (slight change to match TF2.9.1):
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
import time
def time_matmul(run, x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print(f"Run #{run}: {1000*result:0.2f}ms")
print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("GPU:0")
for run in range(10):
time_matmul(run, x)
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("CPU:0")
for run in range(10):
time_matmul(run, x)
Which results in:
Run #0: 273.66ms
Run #1: 0.37ms
Run #2: 0.36ms
Run #3: 0.36ms
Run #4: 0.37ms
Run #5: 0.36ms
Run #6: 0.35ms
Run #7: 0.41ms
Run #8: 0.37ms
Run #9: 0.35ms
On CPU:
Run #0: 56.89ms
Run #1: 44.31ms
Run #2: 47.60ms
Run #3: 46.97ms
Run #4: 46.40ms
Run #5: 44.84ms
Run #6: 43.88ms
Run #7: 45.28ms
Run #8: 43.46ms
Run #9: 43.57ms
Eye-balling what happens (a proper stats approach would run that many times, done but no more insight) the first run is slow, but then faster, and more importantly stable. Stability is what we expect in the first place (run the same should behave the same), but the first run needs to be setup by placing the data at the "right" place in memory.
I am not aware of APIs to place the data manually, and then start the runs. But this would be an "illusion". Run #0 here includes movement and computation. Splitting the two would likely make Run #0 as fast as all other runs, yet we still have to move the data beforehand---the required time would just not show in the result table...
Please note this memory movement is a likely cause (abductive reasoning here), and there might be something else happening. The thinking is supported by the script result, yet it just allows to conclude memory movement is a likely cause. This post proves nothing. A proper analysis to get the root cause would require more time with a profiler (and a Python profiler may not be enough).
Aside this disclaimer, it really looks like a memory movement cost we observe here.

How to profile CPU usage of a Python script?

Ideally what I want is to record the CPU usage of a Python script that is executing a deep neural net Keras model. I'm looking for the CPU equivalent of memory_profiler, which provides the memory consumption of a process.
I have looked at using psutil (suggested in this answer) which would indicate my script could contain some variant of
p = psutil.Process(current_pid)
p.cpu_percent()
but the problem is the important function call I really want to capture the effect of would be the inference stage of the model
model.predict(x_test)
and if I run psutil before/after this step the CPU usage recorded won't be a true reflection of the CPU usage of the process.
So then I was thinking could I use something like top/htop to log the CPU usage of the script to some file, capturing the fluctuating CPU usage while the process
is running, and then calculate an average (or something similar) after the fact. The issue I see with this, however, is don't I need to know the PID to utilise top,
so how can I use top to monitor the script before it is executed (and hasn't even been assigned a PID)?
I can see this highly-ranked answer suggests
cProfile which gives the running time of functions within a script. Although this isn't exactly what I want I do notice that it returns
the total number of CPU seconds, which would at least let me compare CPU usage in that regard.

You can run model.predict(x_test) in a subprocess and log its CPU usage simultaneously in the main process. For example,
import time
import multiprocessing as mp
import psutil
import numpy as np
from keras.models import load_model
def run_predict():
model = load_model('1.h5')
x_test = np.random.rand(10000, 1000)
time.sleep(1)
for _ in range(3):
model.predict(x_test)
time.sleep(0.5)
def monitor(target):
worker_process = mp.Process(target=target)
worker_process.start()
p = psutil.Process(worker_process.pid)
# log cpu usage of `worker_process` every 10 ms
cpu_percents = []
while worker_process.is_alive():
cpu_percents.append(p.cpu_percent())
time.sleep(0.01)
worker_process.join()
return cpu_percents
cpu_percents = monitor(target=run_predict)
The values in cpu_percents for the above script would be something like:

I/O slowdown with multithreading in python

I have a python script, which works on the following scheme: read a large file (e.g., movie) - compose selected information from it into a number of small temporary files - spawn in subprocesses a C++ application to perform the files processing/calculations (separately for each file) - read the application output. To speed up the script I used multiprocessing. However, it has major drawback: each process has to maintain in RAM the whole copy of the large input file, and therefore I can run only few processes, as I run out of memory. Thus I decided to try multithreading instead (or some combination of multiprocessing and multithreading) due to the fact that threads share the address space. As the python part most of the time works with file I/O or waits for the C++ application to complete, I thought that GIL must not be an issue here. Nevertheless, instead of some gain in performance I observe drastic slowdown, mainly owing to the I/O part.
I illustrate the problem with the following code (saved as test.py):
import sys, threading, tempfile, time
nthreads = int(sys.argv[1])
class IOThread (threading.Thread):
def __init__(self, thread_id, obj):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.obj = obj
def run(self):
run_io(self.thread_id, self.obj)
def gen_object(nlines):
obj = []
for i in range(nlines):
obj.append(str(i) + '\n')
return obj
def run_io(thread_id, obj):
ntasks = 100 // nthreads + (1 if thread_id < 100 % nthreads else 0)
for i in range(ntasks):
tmpfile = tempfile.NamedTemporaryFile('w+')
with open(tmpfile.name, 'w') as ofile:
for elem in obj:
ofile.write(elem)
with open(tmpfile.name, 'r') as ifile:
content = ifile.readlines()
tmpfile.close()
obj = gen_object(100000)
starttime = time.time()
threads = []
for thread_id in range(nthreads):
threads.append(IOThread(thread_id, obj))
threads[thread_id].start()
for thread in threads:
thread.join()
runtime = time.time() - starttime
print('Runtime: {:.2f} s'.format(runtime))
When I run it with different number of threads, I get this:
$ python3 test.py 1
Runtime: 2.84 s
$ python3 test.py 1
Runtime: 2.77 s
$ python3 test.py 1
Runtime: 3.34 s
$ python3 test.py 2
Runtime: 6.54 s
$ python3 test.py 2
Runtime: 6.76 s
$ python3 test.py 2
Runtime: 6.33 s
Can someone explain me the result, as well as give some advice, how to effectively parallelize I/O using multithreading?
EDIT:
The slowdown is not due to HDD performance, because:
1) the files are getting cached to RAM anyway
2) the same operations with multiprocessing (not multithreading) are indeed getting faster (almost by factor of CPUs number)

As I delved deeper into the problem, I made comparison benchmarks for 4 different parallelisation methods, 3 of which are using python and 1 is using java (the purpose of the test was not to compare I/O machinery between different languages but to see if multithreading can boost I/O operations). The test was performed on Ubuntu 14.04.3, all files were placed to a RAM disk.
Although the data are quite noisy, the clear trend is evident (see the chart; n=5 for each bar, error bars represent SD): python multithreading fails to boost the I/O performance. The most probable reason is GIL, and therefore there is no way around it.

I think your performance measures don't lie: you're asking your hard disk to do many things at the same time. Reads, writes, fsync when closing the files, ... and on several files at the same time. It triggers a lot of hardware physical operations. And the more files you write at the same time, the more contention you get.
So the CPU is waiting for the disk operation to finish...
Moreover, maybe you don't have a SSD hard disk, so the syncs actually mean some physical moves.
EDIT: it could be a GIL problem. When you iterate elem in obj in run_io, you execute python code between each write. The ofile.write probably release the GIL, so that the IO doesnt block the other threads, but the lock is released/acquired with each iteration. So maybe your writes don't really run "concurrently".
EDIT2: to test the hypothesis you can try to replace:
for elem in obj:
ofile.write(elem)
with:
ofile.write("".join(obj))
and see if perf gets better

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.