How to profile CPU usage of a Python script? - python

Ideally what I want is to record the CPU usage of a Python script that is executing a deep neural net Keras model. I'm looking for the CPU equivalent of memory_profiler, which provides the memory consumption of a process.
I have looked at using psutil (suggested in this answer) which would indicate my script could contain some variant of
p = psutil.Process(current_pid)
p.cpu_percent()
but the problem is the important function call I really want to capture the effect of would be the inference stage of the model
model.predict(x_test)
and if I run psutil before/after this step the CPU usage recorded won't be a true reflection of the CPU usage of the process.
So then I was thinking could I use something like top/htop to log the CPU usage of the script to some file, capturing the fluctuating CPU usage while the process
is running, and then calculate an average (or something similar) after the fact. The issue I see with this, however, is don't I need to know the PID to utilise top,
so how can I use top to monitor the script before it is executed (and hasn't even been assigned a PID)?
I can see this highly-ranked answer suggests
cProfile which gives the running time of functions within a script. Although this isn't exactly what I want I do notice that it returns
the total number of CPU seconds, which would at least let me compare CPU usage in that regard.

You can run model.predict(x_test) in a subprocess and log its CPU usage simultaneously in the main process. For example,
import time
import multiprocessing as mp
import psutil
import numpy as np
from keras.models import load_model
def run_predict():
model = load_model('1.h5')
x_test = np.random.rand(10000, 1000)
time.sleep(1)
for _ in range(3):
model.predict(x_test)
time.sleep(0.5)
def monitor(target):
worker_process = mp.Process(target=target)
worker_process.start()
p = psutil.Process(worker_process.pid)
# log cpu usage of `worker_process` every 10 ms
cpu_percents = []
while worker_process.is_alive():
cpu_percents.append(p.cpu_percent())
time.sleep(0.01)
worker_process.join()
return cpu_percents
cpu_percents = monitor(target=run_predict)
The values in cpu_percents for the above script would be something like:

Related

What might cause a GPU to hang/get stuck when using multiprocessing in Python?

TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?
Normal version:
I am using PyTorch in combination with Optuna (a hyperparameter optimization framework; basically starts different trials for one model with different parameters, see: https://optuna.readthedocs.io/en/stable/) for my model training on a setup with 4 GPUs. Here, I've been looking for a way to distribute the workload more efficiently on the GPUs, hence I explored the multiprocessing library.
The core of the multiprocessing code looks like following:
class GpuQueue:
def __init__(self):
self.queue = multiprocessing.Manager().Queue()
all_idxs = list(range(N_GPUS)) if N_GPUS > 0 else [None]
for idx in all_idxs:
self.queue.put(idx)
#contextmanager
def one_gpu_per_process(self):
current_idx = self.queue.get()
yield current_idx
self.queue.put(current_idx)
class Objective:
def __init__(self, gpu_queue: GpuQueue, params, signals):
self.gpu_queue = gpu_queue
# create dataset
# ...
def __call__(self, trial: optuna.Trial):
with self.gpu_queue.one_gpu_per_process() as gpu_i:
val = trainer(trial, gpu=gpu_i, ...)
return val
And in main, optuna study and optuna optimize are initiated with:
study = optuna.create_study(direction="minimize", sampler = optuna.samplers.TPESampler(seed=17)) # storage = "sqlite:///trials.db")
study.optimize(Objective(GpuQueue(), ..., n_jobs=4))
Same implementation can be found in this StackOverflow post (used as inspiration): Is there a way to pass arguments to multiple jobs in optuna?
What this code does is that every trial gets its own GPU, hence the GPU usage and distribution is better than other methods. However it happens often that a GPU is stuck and just 'shuts itself off' and does not finish the trial, hence the code actually never finishes running and that GPU is never freed.
Say, for example, that I am running 100 trials, then trial 1,2,3,4 get assigned GPUs 0,1,2,3 (not always in that order), and whenever a GPU is freed, say GPU 2, it takes on trial 5, etc. The issue is, it can happen that the trial that the GPU is assigned to 'quits' in the process and never finishes the trial, hence not taking on another trial and resulting in the run with many trials not completing.
I suspected a deadlock, but apparently Queue() is thread-safe (see: Is Python multiprocessing.Queue thread safe?).
Any clue on what can cause the hang and what I can look for?

Start a new multiprocessing.pool() process for each one that dies

I'm using a python api for a proprietary software to run numerical simulations. I need to do quite a few so have tried to speed things up using multiprocessing.pool() to run simulations in parallel. The simulations are independent and the function passed to multiprosessing.pool() returns nothing but the simulation results are saved to disk. As far as I understand this should be similar to opening X no of terminals and running a call to the API from each.
Using multiprocessing starts off well, I can see all processors running at 100% which is expected for the simulations. However after a while the processes seem to die. Eventually I end up with no active processes but still simulations that have not started. I think that the problem is that the API is sometimes a a little buggy. Certain errors cause python kernel to crash. I think this likely what is happening with my multiprocessing.pool().
Is there a way that I can add a new process for each one that dies so that there will always be processes in the pool? For now I can run the individual simulations that give problems manually.
Below is a minimum working example but I am not sure how to reproduce an error that causes the kernel to crash so it is not of much use.
from multiprocessing import Pool
from multiprocessing import cpu_count
import time
def test_function(a,b):
"Takes in two variables to justify starmap, pause,return nothing"
print(f'running case {a}')
' api(a,b) - Runs a simulation and saves output to disk'
'include error that "randomly" crashes python console/process'
time.sleep(5)
if __name__ == '__main__':
case_names = list(range(60))
b = 'b'
inputs = [(a,b) for a in case_names] #All the inputs in order needed by run_wdi
start_time = time.time()
# no_processes = cpu_count()
no_processes = min(cpu_count(),len(inputs))
print(f"Using {no_processes} processes on {cpu_count()} cpu's")
# with Pool(processes=no_processes) as pool:
with Pool() as pool:
result = pool.starmap(test_function, inputs)
end_time = time.time()
print(f'Total time {end_time-start_time}')

Multithreading degrades GPU performance

In my Python application I am using Detectron2 to run prediction on an image and detect the key-points of all the humans in the image.
I want to run the prediction on frames that are streamed to my app live (using aiortc), but I discovered that the predictions time is much worse because it now runs on a new thread (the main thread is occupied with the server).
Running predictions on a thread takes anywhere between 1.5 to 4 seconds, which is a lot.
When running the predictions on the main-thread (without the video streaming part), I get predictions times of less than a second.
My question is why it happens and how can I fix it¿ Why the GPU performance is degraded so drastically when using it from a new thread¿
Notes:
The code is tested in Google Colab with Tesla P100 GPU and the video streaming is emulated by reading frames from a video file.
I calculate the time it takes to run prediction on a frame using the code in this question.
I tried switching to multiprocessing instead, but couldn't make it work with cuda (I tried both import multiprocessing as well as import torch.multiprocessing with set_stratup_method('spawn')) it just gets stuck when calling start on the process.
Example code:
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
import threading
from typing import List
import numpy as np
import timeit
import cv2
# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)
def get_frames(video: cv2.VideoCapture):
frames = list()
while True:
has_frame, frame = video.read()
if not has_frame:
break
frames.append(frame)
return frames
class CodeTimer:
# Source: https://stackoverflow.com/a/52749808/9977758
def __init__(self, name=None):
self.name = " '" + name + "'" if name else ''
def __enter__(self):
self.start = timeit.default_timer()
def __exit__(self, exc_type, exc_value, traceback):
self.took = (timeit.default_timer() - self.start) * 1000.0
print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')
video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()
def send_frames():
# This function emulates the stream, so here we "get" a frame and add it to our buffer
for frame in get_frames(video):
frames_buffer.append(frame)
# Simulate delays between frames
time.sleep(random.uniform(0.3, 2.1))
def predict_frames():
predicted_frames = 0 # The number of frames predicted so far
while predicted_frames < num_frames: # Stop after we predicted all frames
buffer_length = len(frames_buffer)
if buffer_length <= predicted_frames:
continue # Wait until we get a new frame
# Read all the frames from the point we stopped
for frame in frames_buffer[predicted_frames:]:
# Measure the prediction time
with CodeTimer('In stream prediction'):
predictions.append(predictor(frame))
predicted_frames += 1
t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()
Python threads rely on the GIL which must be locked by all C bindings trying to access Python objects. GPU computing libraries typically use C bindings, and could potentially lock the GIL from time to time and thus pause Python code execution.
It is a wild guess, but this is possible that the predictor function, which needs to go through C and a lock of the GIL finds itself waiting for the other threads that are writing the video buffers. Then depending on how the computation is broken down and how Python juggles with your other thread, I suppose the impact on performance may become visible.
You may:
avoid multi-threading by performing the reading and the prediction in the same thread.
use multi-processing so that the GIL does not interfere between the two processes
code this in a native language such as C, C++...
The problem is in: your hardware, your libraries or, in the differences between your example code and the real code.
I implemented your code on an Nvidia Jetson Xavier. I installed all needed libraries using the following commands:
# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate
#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl
# torchvision
pip install 'git+https://github.com/pytorch/vision.git#v0.9.0'
# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
# ipython bindings (optional)
pip install ipykernel cloudpickle
# opencv
pip install opencv-python
After that I run your example script on an example video and received the following output:
Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms
After the first pass, the detector consistently takes around 400ms to run the detection. Which seems about right for an Jetson Xavier. I do not experience the slowdown you described.
I have to note that the Jetson is a specific piece of hardware. In this hardware the RAM memory is shared between the CPU and the GPU. Therefore I do not have to transfer the data from CPU to GPU. So if your slow down is caused by the transfer between CPU and GPU memory, I will not experience this problem in my setup.
Not seeing the full code, here is a few suggestions:
You might be running into overhead of starting new threads every time. So explore option of the thread pool instead of starting new threads every time.
If you are not moving workload to GPU - that means it's CPU bound task and Python threads is not the right tool for the task. For CPU intensive tasks you should be using https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing
Some operations are I/O bound. For example, each cv2.imread call results in I/O overhead. You can read this article which says : "Not all algorithms can be made parallel and distributed to all cores of a processor — some algorithms are simply single threaded in nature."
This means that multiprocessing for computer vision algorithms must be global: a single operation (such as imread) will not be improved by multithreading. However, you will sometimes gain speed by performing other operations in parallel because they are not limited by I/O or anything else. At this point, you will probably see an overall speedup:
If you run single imread:
non-multithreaded: 5 ms = cost of imread
multithreaded: 7 ms = cost of multithreading + cost of imread
But if you run operations that can be multithreaded :
non multithreaded: 5 ms + 10 ms = cost of imread + cost of operation
multi-threaded: 2ms + 5 ms + 5 ms = cost of multithreading + cost of imread + cost of parallel operations
(these figures are not real, they are just to illustrate what I mean)

Python: parallel execution of a function which has a sequential loop inside

I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction.
Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
import numpy as np
def foo(eps): # need an (unused) argument to use pool.map()
# initialising
# the true values of the actions
q = np.random.normal(0, 1, size=10)
# the estimated values
q_est = np.zeros(10)
# the counter of how many times each of the 10 actions was chosen
n = np.zeros(10)
rewards = []
for i in range(1000):
# choose an action based on its estimated value
a = np.argmax(q_est)
# get the normally distributed reward
rewards.append(np.random.normal(q[a], 1))
# increment the chosen action counter
n[a] += 1
# update the estimated value of the action
q_est[a] += (rewards[-1] - q_est[a]) / n[a]
return rewards
I execute this function 2000 times to get (2000, 1000) array:
reward = np.array([foo(0) for _ in range(2000)])
Then I plot the mean reward across 2000 experiments:
import matplotlib.pyplot as plt
plt.plot(np.arange(1000), reward.mean(axis=0))
sequential plot
which fully corresponds the expected result (looks the same as in the book).
But when I try to execute it in parallel, I get much greater standard deviation of the average reward:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
reward_p = np.array(pool.map(foo, [0]*2000))
plt.plot(np.arange(1000), reward_p.mean(axis=0))
parallel plot
I suppose this is due to the parallelization of a loop inside of the foo. As i reduce the number of cores allocated to the task, the reward plot approaches the expected shape.
Is there a way to get the advantage of the multiprocessing here while getting the correct results?
UPD:
I tried running the same code on Windows 10 and sequential vs parallel and the results turned out to be the same! What may be the reason?
Ubuntu 20.04, Python 3.8.5, jupyter
Windows 10, Python 3.7.3, jupyter
As we found out it is different on windows and ubuntu. It is probably because of this:
spawn The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
Try adding this line to your code:
mp.set_start_method('spawn')

dask.delayed results in no speedup

I am trying to get into Dask. For that I attempted to parallelize some time consuming sequential code I got. The original code is this:
def sequential():
sims = []
chunksize = len(tokens)//4
for i in range(0, len(tokens), chunksize):
print(i, i+chunksize)
chunk = tokens[i:i+chunksize]
sims.append(process(chunk))
return sims
%time sequential()
and the prallelized code is this:
def parallel():
sims = []
chunksize = len(tokens)//4
for i in range(0, len(tokens), chunksize):
print(i, i+chunksize)
chunk = dask.delayed(tokens[i:i+chunksize])
sims.append(dask.delayed(process)(chunk))
return dask.delayed(sims)
%time parallel().visualize()
But the parallelized code always runs around 10% slower than the parallel one. when I visualize the computation graph for sims I get this:
Not sure where list-#8 comes from, but other than that it looks correct. So why is there no speedup? When I look into htop I can see 3 cores active (~30% load each), while for the sequential code I see only 1 core active (100% load). The sequential code runs 7 minutes and the parallel code runs 7 - 8 minutes.
I guess I am misunderstanding how delayed and compute should be used here?
The setup is this, if you require it:
import numpy
import spacy
import dask
nlp = spacy.load('en_core_web_lg')
tokens = [t for t in nlp(" ".join(t.strip() for t in open('./words.txt','r').readlines())) if len(t.text) > 1 and len(t.text) < 20]
def process(chunk):
sims = numpy.zeros([len(chunk),len(tokens)], dtype=numpy.float32)
for i in range(len(chunk)):
for j in range(len(tokens)):
sims[i,j] = chunk[i].similarity(tokens[j])
return sims
You are seeing this behaviour because the default execution engine for dask is based on multiple threads in a single process (the "threaded" scheduler). Python has a lock, the GIL, which ensures the safety of the interpreter by only executing one python statement at a time. Therefore, each thread is spending most of its time waiting for the lock to become available.
To avoid this problem, you have two options:
find a version of your computation that releases the GIL. This is possible if you can phrase it as (mainly) some numpy, pandas, numba, etc., computation, code that executes at the C level and doesn't need the interpreter, unlike your nested loops.
run your code using processes, using either the "mutiprocessing" scheduler or (better) the "distributed" scheduler which, despite the name, also runs well on a single machine.
Further information: http://dask.pydata.org/en/latest/scheduler-overview.html

Categories