i write a application/Gui in PyQt5 and would like store large Images (>5000 RGB Images)
Now, I have a function, which store every Picture with cv2.imwrite, but this process takes a lot of time. So i read here in Stackoverflow, that I can do this with multiprocessing. But I´m very new in python.
My Multiprocessing Code:
def SaveImages(self):
jobs = []
for i in range(5):
p = multiprocessing.Process(target = self.SaveAllImages, args=self)
jobs.append(p)
p.start()
In Function SaveAllImages are the basic Code for storing Images per Frame. If I run this Code - there are a Error:
p = multiprocessing.Process(target = SaveAllImages, args=self)
NameError: name 'SaveAllImages' is not defined
But SaveAllImages are definied: def SaveAllImages(self)
So my Question are:
Why i became this error
How can I implement a very simply multiprocessing for storing Images
The error you're seeing is because you're calling a method that doesn't exist, probably because it's not part of self.
You'll likely see better performance with multithreading than multiprocessing. Multiprocessing is best for CPU-bound tasks for the simple reason that Python uses a global lock for all operations. Multiprocessing is a hack to get around this lock. It's nastier to work with than threading and it's best to avoid it unless absolutely necessary.
Multithreading is likely enough for your use case and it won't create lots of gotchas for a new programmer. Here's a working sample setup using Python 3's Futures API that will easily scale your problem size, just add your arguments and actual save code in the marked places.
import concurrent.futures
# Save single image
def save_image(image_arg):
# your image save code goes here
print("Working on image {}...".format(image_arg))
return True
# max_workers specifies the number of threads. If None then use 5x your CPU count
with concurrent.futures.ThreadPoolExecutor(max_workers=None) as executor:
# Images we'll save. Depending on how you generate your images you might not
# want to materialize a list like this to avoid running out of memory.
image_args = ["image1", "image2", "image3"]
# Submit futures to the executor pool.
# Map each future back to the arguments used to create that future. That way
# if one fails we know which image it was that failed.
future_to_args = {executor.submit(save_image, image_arg): image_arg for image_arg in image_args}
# Images are being saved in worker threads. They will complete in any order.
for future in concurrent.futures.as_completed(future_to_args):
image_arg = future_to_args[future]
try:
result = future.result()
except Exception as exc:
print("Saving image {} generated an exception: {}".format(image_arg, exc))
else:
print("Image {} saved successfully.".format(image_arg))
If you insist on multiprocessing, just use ProcessPoolExecutor instead. That might be worthwhile if you also want to generate your images in parallel.
Whether ThreadPoolExecutor or ProcessPoolExecutor is better depends a lot on what the rest of your workload is and how you structured it. Try both to see which works better for you. Note that multiprocessing places restrictions on communication and sharing state between workers, hence why I suggest trying threads first.
Before you try to improve, you should always measure performance.
Use a disk testing program to see what the maximum sustained write throughput of your disk is.
Then use a performance monitoring program to check the write throughput that your program generates (without multi- threading/processing).
If your program can reach the same throughput as the test program most of the time, then there is little you can do.
Assuming that you are using a regular harddisk, the best way to improve write performance is to use an SSD instead.
Related
I currently have a sequential code with the following parts (all parts are properly encapsulated and isolated in class methods and so on):
Frame capture from network stream with opencv VideoCapture
Processing of the image with yolov7 through pytorch (with cuda)
Classical processing of the yolov7 output
Extra heavy classical processing done every X frames (X will be 30 or 60)
Because this will need to run in real-time (some little constant latency is allowed, but not a growing one), something needs to be done since it currently runs at 15 fp (already ignoring odd frames).
Time profiling shows that the most time consuming processes are (2) and (4) (no surprise).
When I started looking up info for (1) I learned about the threading module which seemed promising and popped up in a lot of stackoverflow answers to increase code speed when image capture from cameras was involved. This led me to see salvation in this module (because of missconceptions of parallelization) until I have just learnt that it still executes one thread thanks to the GIL thing. I am also aware of the existance of asyncio, multithreading and concurrent.futures.ProcessPoolExecutor. I have also read this post on threading. Multicore processing is available.
Aim is to have (1) capture frames into a queue. (2) takes the frame when there's one available and processes it. As soon as it finishes the processing, give the output to (3) and read the next available frame in the queue to keep processing while (3) and (4) are executed too. (3) takes (2) output and processes it super fast and waits until there's output from (2) and repeat. Finally, every X frames, (4) will read the outputs generated by (3) until that point and perform some heavy calculations.
If I have understood well, I should use multiprocessing instead of multithreading since it's a rather intensive calculation problem (apart from the I/O on (1)).
So the questions really are:
For (1), since this is I/O related, is threading combined with queue a good way to go?
For (2), (3) and (4), what is the way to proceed? I need them to run at the same time, especially (2) and (4), since (3) really runs at nearly 300 fps
For (2), is there a way to process two frames at the same time? For example processing even and odd frames at the same time. This is a critical point right now to get to real-time processing.
How difficult is all this? I'm not really an expert in this topic (I'm actually a physicist) so I don't know if I'm getting myself in too much of a slippery slope here.
I just need someone who knows about all this mess to point me in the right direction, so don't hesitate to add some references. Thank you very much in advance!
It is difficult to provide an authoritative answer in a question like this. I can provide with some experience when working with a similar problem (image processing on a Raspberry Pi on a Drone).
In my experience, multiprocessing is the way to go. Because of the PIL limitations, if you are pushing the limits of a single core, threading will do very little. You would use threading in an environment where you have blocking, but not resource intensive operations, such as IO/network. Also, you want to avoid mixing Threads and Processes, especially if you are not confident you know what you are doing.
You essentially want a pipeline of processes, multiple stages, each handling a specific task, running in parallel. In this case, your per frame processing time goes from the sum of your individual stage processing times to the max of your per-stage processing times.
Practically, you want a Process per stage, with a multiprocessing Queue between. You want to test each stage separate, passing in/out items in a controlled manner first, then try to integrate them. It is not trivial, but you can do it.
As to having two processes for (2) in parallel, while possible, there are some things to consider:
Is CUDA capable of handling that. Because you are passing some of the processing over to your GPU, it may become limited by that, not the CPU. Thus, sending twice the amount of work to it may just do nothing.
You want make sure you have a robust way of re-ordering your frames. You'd most likely end up with two Queues published to by (1), and two Queues for (3) to consume from. You need to make sure (3) is consuming items for the right queue, and emmits them ordered. You may want to pass along some metadata to help with this.
As you mentioned, Python is limited by the GIL which means a python process can only use one CPU. For I/O, threading works well but you will not be able to use multiple CPU cores if you have them available.
However, there is the wonderful multiprocessing module in python that allows you to create a new process. This takes more time to create than a thread but can run on a different CPU core.
There are plenty of examples of this, for example this (see the python module page linked above) but here one using a Queue from DigitalOcean
from multiprocessing import Lock, Process, Queue, current_process
import time
import queue # imported for using queue.Empty exception
def do_job(tasks_to_accomplish, tasks_that_are_done):
while True:
try:
'''
try to get task from the queue. get_nowait() function will
raise queue.Empty exception if the queue is empty.
queue(False) function would do the same task also.
'''
task = tasks_to_accomplish.get_nowait()
except queue.Empty:
break
else:
'''
if no exception has been raised, add the task completion
message to task_that_are_done queue
'''
print(task)
tasks_that_are_done.put(task + ' is done by ' + current_process().name)
time.sleep(.5)
return True
def main():
number_of_task = 10
number_of_processes = 4
tasks_to_accomplish = Queue()
tasks_that_are_done = Queue()
processes = []
for i in range(number_of_task):
tasks_to_accomplish.put("Task no " + str(i))
# creating processes
for w in range(number_of_processes):
p = Process(target=do_job, args=(tasks_to_accomplish, tasks_that_are_done))
processes.append(p)
p.start()
# completing process
for p in processes:
p.join()
# print the output
while not tasks_that_are_done.empty():
print(tasks_that_are_done.get())
return True
if __name__ == '__main__':
main()
I have this code:
fog_coeff = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
start = time.time()
for f in fog_coeff:
foggy_images= am.add_fog(images[0:278],fog_coeff=f)
for img in foggy_images:
im = Image.fromarray(img)
im.save('./result/'+str(counter)+'-'+str(fog_coeff)+'.jpg')
counter += 1
print("time taken"+ str(time.time()-start))
I want to parallize this. How can I do this? My main idea was to take each value from fog_coeff list and give it to each core. Each core will then process 278 images. Is this the right direction? If so, how can I proceed?
You have two options for this, Threads or Processes. The first are allowed to share memory but they are limited in what they can do concurrently, so you can use variables to share the results for example, but you will have to use locks to avoid fdata races.
On the other side processes do not allow to share memory, gaining full concurrency at the OS level. You will have to use some sort of external communication like sockets to send the output back to the main process, or write their results to files.
The answer would depend on which of these two mechanisms you choose.
Edit: elaborating multi processing.
This is done with the multiprocessing library. You will basically define the function that you want your other process to run and then run it in a different process. Processes are handled by the OS, not by Python, so your OS scheduler will be in charge of where each process can be executed. There are advanced tools like process pools that would allow you to have always 4 processes running (in case you are on a quadra-core) but you won't be able to tell your OS how should he handle those processes. He may want to execute its own background processes.
If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)
First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229
I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.
I have 6 large files which each of them contains a dictionary object that I saved in a hard disk using pickle function. It takes about 600 seconds to load all of them in sequential order. I want to start loading all them at the same time to speed up the process. Suppose all of them have the same size, I hope to load them in 100 seconds instead. I used multiprocessing and apply_async to load each of them separately but it runs like sequential. This is the code I used and it doesn't work.
The code is for 3 of these files but it would be the same for six of them. I put the 3rd file in another hard disk to make sure the IO is not limited.
def loadMaps():
start = timeit.default_timer()
procs = []
pool = Pool(3)
pool.apply_async(load1(),)
pool.apply_async(load2(),)
pool.apply_async(load3(),)
pool.close()
pool.join()
stop = timeit.default_timer()
print('loadFiles takes in %.1f seconds' % (stop - start))
If your code is primarily limited by IO and the files are on multiple disks, you might be able to speed it up using threads:
import concurrent.futures
import pickle
def read_one(fname):
with open(fname, 'rb') as f:
return pickle.load(f)
def read_parallel(file_names):
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(read_one, f) for f in file_names]
return [fut.result() for fut in futures]
The GIL will not force IO operations to run serialized because Python consistently releases it when doing IO.
Several remarks on alternatives:
multiprocessing is unlikely to help because, while it guarantees to do its work in multiple processes (and therefore free of the GIL), it also requires the content to be transferred between the subprocess and the main process, which takes additional time.
asyncio will not help you at all because it doesn't natively support asynchronous file system access (and neither do the popular OS'es). While it can emulate it with threads, the effect is the same as the code above, only with much more ceremony.
Neither option will speed up loading the six files by a factor of six. Consider that at least some of the time is spent creating the dictionaries, which will be serialized by the GIL. If you want to really speed up startup, a better approach is not to create the whole dictionary upfront and switch to an in-file database, possibly using the dictionary to cache access to its content.
I am new to multiprocessing in python. I want to extract features from each frame of hour long video files. Processing each frame takes on the order of 30 ms. I thought multiprocessing was a good idea because each frame is processed independentle of all other frames.
I want to store the results of the feature extraction in a custom class.
I read a few examples and ended up using multiprocessing and Queues as suggested here. The result was disappointing though, now each frames takes about 1000 ms to process. I am guessing I produced a ton of overhead.
is there a more efficient way to process the frames in parallel and collect the results?
to illustrate, I put together a dummy example.
import multiprocessing as mp
from multiprocessing import Process, Queue
import numpy as np
import cv2
def main():
#path='path\to\some\video.avi'
coordinates=np.random.random((1000,2))
#video = cv2.VideoCapture(path)
listOf_FuncAndArgLists=[]
for i in range(50):
#video.set(cv2.CAP_PROP_POS_FRAMES,i)
#img_frame_original = video.read()[1]
#img_frame_original=cv2.cvtColor(img_frame_original, cv2.COLOR_BGR2GRAY)
img_frame_dummy=np.random.random((300,300)) #using dummy image for this example
frame_coordinates=coordinates[i,:]
listOf_FuncAndArgLists.append([parallel_function,frame_coordinates,i,img_frame_dummy])
queues=[Queue() for fff in listOf_FuncAndArgLists] #create a queue object for each function
jobs = [Process(target=storeOutputFFF,args=[funcArgs[0],funcArgs[1:],queues[iii]]) for iii,funcArgs in enumerate(listOf_FuncAndArgLists)]
for job in jobs: job.start() # Launch them all
for job in jobs: job.join() # Wait for them all to finish
# And now, collect all the outputs:
return([queue.get() for queue in queues])
def storeOutputFFF(fff,theArgs,que): #add a argument to function for assigning a queue
print 'MULTIPROCESSING: Launching %s in parallel '%fff.func_name
que.put(fff(*theArgs)) #we're putting return value into queue
def parallel_function(frame_coordinates,i,img_frame_original):
#do some image processing that takes about 20-30 ms
dummyResult=np.argmax(img_frame_original)
return(resultClass(dummyResult,i))
class resultClass(object):
def __init__(self,maxIntensity,i):
self.maxIntensity=maxIntensity
self.i=i
if __name__ == '__main__':
mp.freeze_support()
a=main()
[x.maxIntensity for x in a]
Parallel processing in (regular) python is a bit of a pain: in other languages we'd just use threads but the GIL makes that problematic, and using multiprocessing has a big overhead in moving data around. I've found that fine-grained parallelism is (relatively) hard to do, whilst processing 'chunks' of work that take 10's of seconds (or more) to process in a single process can be much more straight-forward.
An easier path to parallel processing your problem - if you're on a UNIXy system - would be to make a python program which processes a segment of video specified on the command-line (i.e. a frame number to start with, and a number of frames to process), and then use the GNU parallel tool to process multiple segments at once. A second python program can consolidate the results from a collection of files, or reading from stdin, piped from parallel. This way means that the processing code doesn't need to do it's own parallelism, but it does require the input file to be multiply accessed and to extract frames starting from mid-points. (This might also be extendable to work across multiple machines without changing the python...)
Using multiprocessing.Pool.map could be used in a similar way if you need a pure-python solution: map over a list of tuples (say, (file, startframe, endframe)) and then open the file in the function and process that segment.
Multiprocessing creates some overhead for starting several processes and bringing them all back together.
Your code does that for every frame.
Try splitting your video into N evenly-sized pieces and processing them in parallel.
Put N equal to number of cores on your machine or something like that (your mileage may vary, but it's a good number to start experimenting with). There's no point in creating 50 processes if, say, 4 of them are getting executed and rest are simply waiting for their turn.