How to increase processing speed of python code through parallelization

How to increase processing speed of python code through parallelization - python

I currently have a sequential code with the following parts (all parts are properly encapsulated and isolated in class methods and so on):
Frame capture from network stream with opencv VideoCapture
Processing of the image with yolov7 through pytorch (with cuda)
Classical processing of the yolov7 output
Extra heavy classical processing done every X frames (X will be 30 or 60)
Because this will need to run in real-time (some little constant latency is allowed, but not a growing one), something needs to be done since it currently runs at 15 fp (already ignoring odd frames).
Time profiling shows that the most time consuming processes are (2) and (4) (no surprise).
When I started looking up info for (1) I learned about the threading module which seemed promising and popped up in a lot of stackoverflow answers to increase code speed when image capture from cameras was involved. This led me to see salvation in this module (because of missconceptions of parallelization) until I have just learnt that it still executes one thread thanks to the GIL thing. I am also aware of the existance of asyncio, multithreading and concurrent.futures.ProcessPoolExecutor. I have also read this post on threading. Multicore processing is available.
Aim is to have (1) capture frames into a queue. (2) takes the frame when there's one available and processes it. As soon as it finishes the processing, give the output to (3) and read the next available frame in the queue to keep processing while (3) and (4) are executed too. (3) takes (2) output and processes it super fast and waits until there's output from (2) and repeat. Finally, every X frames, (4) will read the outputs generated by (3) until that point and perform some heavy calculations.
If I have understood well, I should use multiprocessing instead of multithreading since it's a rather intensive calculation problem (apart from the I/O on (1)).
So the questions really are:
For (1), since this is I/O related, is threading combined with queue a good way to go?
For (2), (3) and (4), what is the way to proceed? I need them to run at the same time, especially (2) and (4), since (3) really runs at nearly 300 fps
For (2), is there a way to process two frames at the same time? For example processing even and odd frames at the same time. This is a critical point right now to get to real-time processing.
How difficult is all this? I'm not really an expert in this topic (I'm actually a physicist) so I don't know if I'm getting myself in too much of a slippery slope here.
I just need someone who knows about all this mess to point me in the right direction, so don't hesitate to add some references. Thank you very much in advance!

It is difficult to provide an authoritative answer in a question like this. I can provide with some experience when working with a similar problem (image processing on a Raspberry Pi on a Drone).
In my experience, multiprocessing is the way to go. Because of the PIL limitations, if you are pushing the limits of a single core, threading will do very little. You would use threading in an environment where you have blocking, but not resource intensive operations, such as IO/network. Also, you want to avoid mixing Threads and Processes, especially if you are not confident you know what you are doing.
You essentially want a pipeline of processes, multiple stages, each handling a specific task, running in parallel. In this case, your per frame processing time goes from the sum of your individual stage processing times to the max of your per-stage processing times.
Practically, you want a Process per stage, with a multiprocessing Queue between. You want to test each stage separate, passing in/out items in a controlled manner first, then try to integrate them. It is not trivial, but you can do it.
As to having two processes for (2) in parallel, while possible, there are some things to consider:
Is CUDA capable of handling that. Because you are passing some of the processing over to your GPU, it may become limited by that, not the CPU. Thus, sending twice the amount of work to it may just do nothing.
You want make sure you have a robust way of re-ordering your frames. You'd most likely end up with two Queues published to by (1), and two Queues for (3) to consume from. You need to make sure (3) is consuming items for the right queue, and emmits them ordered. You may want to pass along some metadata to help with this.

As you mentioned, Python is limited by the GIL which means a python process can only use one CPU. For I/O, threading works well but you will not be able to use multiple CPU cores if you have them available.
However, there is the wonderful multiprocessing module in python that allows you to create a new process. This takes more time to create than a thread but can run on a different CPU core.
There are plenty of examples of this, for example this (see the python module page linked above) but here one using a Queue from DigitalOcean
from multiprocessing import Lock, Process, Queue, current_process
import time
import queue # imported for using queue.Empty exception
def do_job(tasks_to_accomplish, tasks_that_are_done):
while True:
try:
'''
try to get task from the queue. get_nowait() function will
raise queue.Empty exception if the queue is empty.
queue(False) function would do the same task also.
'''
task = tasks_to_accomplish.get_nowait()
except queue.Empty:
break
else:
'''
if no exception has been raised, add the task completion
message to task_that_are_done queue
'''
print(task)
tasks_that_are_done.put(task + ' is done by ' + current_process().name)
time.sleep(.5)
return True
def main():
number_of_task = 10
number_of_processes = 4
tasks_to_accomplish = Queue()
tasks_that_are_done = Queue()
processes = []
for i in range(number_of_task):
tasks_to_accomplish.put("Task no " + str(i))
# creating processes
for w in range(number_of_processes):
p = Process(target=do_job, args=(tasks_to_accomplish, tasks_that_are_done))
processes.append(p)
p.start()
# completing process
for p in processes:
p.join()
# print the output
while not tasks_that_are_done.empty():
print(tasks_that_are_done.get())
return True
if __name__ == '__main__':
main()

Related

Python processing items from list/queue and saving progress

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)

First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229

I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

multiprocessing of video frames in python

I am new to multiprocessing in python. I want to extract features from each frame of hour long video files. Processing each frame takes on the order of 30 ms. I thought multiprocessing was a good idea because each frame is processed independentle of all other frames.
I want to store the results of the feature extraction in a custom class.
I read a few examples and ended up using multiprocessing and Queues as suggested here. The result was disappointing though, now each frames takes about 1000 ms to process. I am guessing I produced a ton of overhead.
is there a more efficient way to process the frames in parallel and collect the results?
to illustrate, I put together a dummy example.
import multiprocessing as mp
from multiprocessing import Process, Queue
import numpy as np
import cv2
def main():
#path='path\to\some\video.avi'
coordinates=np.random.random((1000,2))
#video = cv2.VideoCapture(path)
listOf_FuncAndArgLists=[]
for i in range(50):
#video.set(cv2.CAP_PROP_POS_FRAMES,i)
#img_frame_original = video.read()[1]
#img_frame_original=cv2.cvtColor(img_frame_original, cv2.COLOR_BGR2GRAY)
img_frame_dummy=np.random.random((300,300)) #using dummy image for this example
frame_coordinates=coordinates[i,:]
listOf_FuncAndArgLists.append([parallel_function,frame_coordinates,i,img_frame_dummy])
queues=[Queue() for fff in listOf_FuncAndArgLists] #create a queue object for each function
jobs = [Process(target=storeOutputFFF,args=[funcArgs[0],funcArgs[1:],queues[iii]]) for iii,funcArgs in enumerate(listOf_FuncAndArgLists)]
for job in jobs: job.start() # Launch them all
for job in jobs: job.join() # Wait for them all to finish
# And now, collect all the outputs:
return([queue.get() for queue in queues])
def storeOutputFFF(fff,theArgs,que): #add a argument to function for assigning a queue
print 'MULTIPROCESSING: Launching %s in parallel '%fff.func_name
que.put(fff(*theArgs)) #we're putting return value into queue
def parallel_function(frame_coordinates,i,img_frame_original):
#do some image processing that takes about 20-30 ms
dummyResult=np.argmax(img_frame_original)
return(resultClass(dummyResult,i))
class resultClass(object):
def __init__(self,maxIntensity,i):
self.maxIntensity=maxIntensity
self.i=i
if __name__ == '__main__':
mp.freeze_support()
a=main()
[x.maxIntensity for x in a]

Parallel processing in (regular) python is a bit of a pain: in other languages we'd just use threads but the GIL makes that problematic, and using multiprocessing has a big overhead in moving data around. I've found that fine-grained parallelism is (relatively) hard to do, whilst processing 'chunks' of work that take 10's of seconds (or more) to process in a single process can be much more straight-forward.
An easier path to parallel processing your problem - if you're on a UNIXy system - would be to make a python program which processes a segment of video specified on the command-line (i.e. a frame number to start with, and a number of frames to process), and then use the GNU parallel tool to process multiple segments at once. A second python program can consolidate the results from a collection of files, or reading from stdin, piped from parallel. This way means that the processing code doesn't need to do it's own parallelism, but it does require the input file to be multiply accessed and to extract frames starting from mid-points. (This might also be extendable to work across multiple machines without changing the python...)
Using multiprocessing.Pool.map could be used in a similar way if you need a pure-python solution: map over a list of tuples (say, (file, startframe, endframe)) and then open the file in the function and process that segment.

Multiprocessing creates some overhead for starting several processes and bringing them all back together.
Your code does that for every frame.
Try splitting your video into N evenly-sized pieces and processing them in parallel.
Put N equal to number of cores on your machine or something like that (your mileage may vary, but it's a good number to start experimenting with). There's no point in creating 50 processes if, say, 4 of them are getting executed and rest are simply waiting for their turn.

Will I run into trouble with python's Global Interpreter Lock?

I am aware that this question is rather high-level and may be vague. Please ask if you need any more details and I will try to edit.
I am using QuickFix with Python bindings to consume high-throughput market data from circa 30 markets simultaneously. Most of computing the work is done in separate CPUs via the multiprocessing module. These parallel processes are spawned by the main process on startup. If I wish to interact with the market in any way via QuickFix, I have to do this within the main process, thus any commands (to enter orders, for example) which come from the child processes must be piped (via an mp.Queue object we will call Q) to the main process before execution.
This raises the problem of monitoring Q, which must be done within the main process. I cannot use Q.get(), since this method blocks and my entire main process will hang until something shows up in Q. In order to decrease latency, I must check Q frequently, on the order of 50 times per second. I have been using the apscheduler to do this, but I keep getting Warning errors stating that the runtime was missed. These errors are a serious issue because they prevent me from easily viewing important information.
I have therefore refactored my application to use the code posted by MestreLion as an answer to this question. This is working for me because it starts a new thread from the main process, and it does not print error messages. However, I am worried that this will cause nasty problems down the road.
I am aware of the Global Interpreter Lock in python (this is why I used the multiprocessing module to begin with), but I don't really understand it. Owing to the high-frequency nature of my application, I do not know if the Q monitoring thread and the main process consuming lots of incoming messages will compete for resources and slow each other down.
My questions:
Am I likely to run into trouble in this scenario?
If not, can I add more monitoring threads using the present approach and still be okay? There are at least two other things I would like to monitor at high frequency.
Thanks.

#MestreLion's solution that you've linked creates 50 threads per second in your case.
All you need is a single thread to consume the queue without blocking the rest of the main process:
import threading
def consume(queue, sentinel=None):
for item in iter(queue.get, sentinel):
pass_to_quickfix(item)
threading.Thread(target=consume, args=[queue], daemon=True).start()
GIL may or may not matter for performance in this case. Measure it.

Without knowing your scenario, it's difficult to say anything specific. Your question suggests, that the threads are waiting most of the time via get, so GIL is not a problem. Interprocess communication may result in problems much earlier. There you can think of switching to another protocol, using some kind of TCP-sockets. Then you can write the scheduler more efficient with select instead of threads, as threads are also slow and resource consuming. select is a system function, that allows to monitor many socket-connection at once, therefore it scales incredibly efficient with the amount of connections and needs nearly no CPU-power for monitoring.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx

#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2

You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.

Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313

Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.