Python processing items from list/queue and saving progress

Python processing items from list/queue and saving progress - python

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)

First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229

I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

Related

How to increase processing speed of python code through parallelization

I currently have a sequential code with the following parts (all parts are properly encapsulated and isolated in class methods and so on):
Frame capture from network stream with opencv VideoCapture
Processing of the image with yolov7 through pytorch (with cuda)
Classical processing of the yolov7 output
Extra heavy classical processing done every X frames (X will be 30 or 60)
Because this will need to run in real-time (some little constant latency is allowed, but not a growing one), something needs to be done since it currently runs at 15 fp (already ignoring odd frames).
Time profiling shows that the most time consuming processes are (2) and (4) (no surprise).
When I started looking up info for (1) I learned about the threading module which seemed promising and popped up in a lot of stackoverflow answers to increase code speed when image capture from cameras was involved. This led me to see salvation in this module (because of missconceptions of parallelization) until I have just learnt that it still executes one thread thanks to the GIL thing. I am also aware of the existance of asyncio, multithreading and concurrent.futures.ProcessPoolExecutor. I have also read this post on threading. Multicore processing is available.
Aim is to have (1) capture frames into a queue. (2) takes the frame when there's one available and processes it. As soon as it finishes the processing, give the output to (3) and read the next available frame in the queue to keep processing while (3) and (4) are executed too. (3) takes (2) output and processes it super fast and waits until there's output from (2) and repeat. Finally, every X frames, (4) will read the outputs generated by (3) until that point and perform some heavy calculations.
If I have understood well, I should use multiprocessing instead of multithreading since it's a rather intensive calculation problem (apart from the I/O on (1)).
So the questions really are:
For (1), since this is I/O related, is threading combined with queue a good way to go?
For (2), (3) and (4), what is the way to proceed? I need them to run at the same time, especially (2) and (4), since (3) really runs at nearly 300 fps
For (2), is there a way to process two frames at the same time? For example processing even and odd frames at the same time. This is a critical point right now to get to real-time processing.
How difficult is all this? I'm not really an expert in this topic (I'm actually a physicist) so I don't know if I'm getting myself in too much of a slippery slope here.
I just need someone who knows about all this mess to point me in the right direction, so don't hesitate to add some references. Thank you very much in advance!

It is difficult to provide an authoritative answer in a question like this. I can provide with some experience when working with a similar problem (image processing on a Raspberry Pi on a Drone).
In my experience, multiprocessing is the way to go. Because of the PIL limitations, if you are pushing the limits of a single core, threading will do very little. You would use threading in an environment where you have blocking, but not resource intensive operations, such as IO/network. Also, you want to avoid mixing Threads and Processes, especially if you are not confident you know what you are doing.
You essentially want a pipeline of processes, multiple stages, each handling a specific task, running in parallel. In this case, your per frame processing time goes from the sum of your individual stage processing times to the max of your per-stage processing times.
Practically, you want a Process per stage, with a multiprocessing Queue between. You want to test each stage separate, passing in/out items in a controlled manner first, then try to integrate them. It is not trivial, but you can do it.
As to having two processes for (2) in parallel, while possible, there are some things to consider:
Is CUDA capable of handling that. Because you are passing some of the processing over to your GPU, it may become limited by that, not the CPU. Thus, sending twice the amount of work to it may just do nothing.
You want make sure you have a robust way of re-ordering your frames. You'd most likely end up with two Queues published to by (1), and two Queues for (3) to consume from. You need to make sure (3) is consuming items for the right queue, and emmits them ordered. You may want to pass along some metadata to help with this.

As you mentioned, Python is limited by the GIL which means a python process can only use one CPU. For I/O, threading works well but you will not be able to use multiple CPU cores if you have them available.
However, there is the wonderful multiprocessing module in python that allows you to create a new process. This takes more time to create than a thread but can run on a different CPU core.
There are plenty of examples of this, for example this (see the python module page linked above) but here one using a Queue from DigitalOcean
from multiprocessing import Lock, Process, Queue, current_process
import time
import queue # imported for using queue.Empty exception
def do_job(tasks_to_accomplish, tasks_that_are_done):
while True:
try:
'''
try to get task from the queue. get_nowait() function will
raise queue.Empty exception if the queue is empty.
queue(False) function would do the same task also.
'''
task = tasks_to_accomplish.get_nowait()
except queue.Empty:
break
else:
'''
if no exception has been raised, add the task completion
message to task_that_are_done queue
'''
print(task)
tasks_that_are_done.put(task + ' is done by ' + current_process().name)
time.sleep(.5)
return True
def main():
number_of_task = 10
number_of_processes = 4
tasks_to_accomplish = Queue()
tasks_that_are_done = Queue()
processes = []
for i in range(number_of_task):
tasks_to_accomplish.put("Task no " + str(i))
# creating processes
for w in range(number_of_processes):
p = Process(target=do_job, args=(tasks_to_accomplish, tasks_that_are_done))
processes.append(p)
p.start()
# completing process
for p in processes:
p.join()
# print the output
while not tasks_that_are_done.empty():
print(tasks_that_are_done.get())
return True
if __name__ == '__main__':
main()

How to increase speed of reading excel file using python file object?

I am processing round 2800 excel file using python file object which taking more time to read because of that my tool taking 5 hour to execute so I want to know is there any way to make process faster of reading excel file.
reading file excel file code
import os
path=os.getcwd()
folder=path+"\\input"
files = os.listdir(folder)
for file in files:
_input = folder + '\\' + file
f=open(_input)
data=f.read()

Try executing the processing of each Excel in parallel with others, have a look to:
Multiprocessing
Threading

Fundamentally, there are two things you can do: either speed up the processing of each file, or process multiple files simultaneously. The best solution to this depends on why it is taking so long. You could start by looking into if the processing that happens on each file is as fast as it can be.
As for processing in parallel:
If a Python program is taking a long time to run because it's waiting for files to be read and written, it can help to use threading. This will allow one thread to process one file while another thread is waiting for its data to be read or written. Whether or not this will help depends on many factors. If the processing itself accounts for most of the time, it won't help. If file IO accounts for most of the time, it might help. Reading multiple files in parallel won't be faster than reading them sequentially if the hard drive is already serving them as fast as it can. Essentially, threading (in Python) only helps if the computer switches back and forth between waiting for the CPU to finish processing, and then waiting for the hard drive to write, and then waiting for the hard drive to read, etcetera. This is because of the Global Interpreter Lock in Python.
To work around the GIL, we need to use multi-processing, where Python actually launches multiple separate processes. This allows it to use more CPU resources, which can dramatically speed things up. It doesn't come for free, however. Each process takes a lot longer to start up than each thread, and they can't really share much in the way of resources so they will use more memory. Whether or not it's worth it depends on the task at hand.
The easiest (in my opinion) way to use multiple threads or processes in parallel is to use the concurrent library. Assuming we have some function that we want to run on each file:
def process_file(file_path):
pass #do stuff
Then we can run this sequentially:
for file_name in some_list_of_files:
process_file(file_name)
... or in parallel either via threads:
import concurrent.futures
number_of_threads = 4
with concurrent.futures.ThreadPoolExecutor(number_of_threads) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("all done!")
Or with multiprocessing:
if __name__ == "__main__":
number_of_processes = 4
with concurrent.futures.ThreadPoolExecutor(number_of_processes) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("All done!")
We need the if __name__ == "__main__" bit because the processes that we spin up will actually import the Python file (but the name won't be "__main__"), so we need to stop them from recursively redoing the same work.
Which is faster will depend entirely on the actual work that needs doing. Sometimes it's faster to just do it sequentially in the main thread like in "normal" code.

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

In Python, should I use threading for this?

I'm working on a small program that collects certain data and processes it. Currently, the program runs constantly on my server, storing the data to the disk. Every once in a while I run another program that reads the stored data, processes it, sorts it, saves it to a new location, and clears the old data files.
I've never learned about threads, but it SOUNDS like this is a good place to use them? If threading works the way I think it does, I could set up a queue to hold the data, and have a separate thread that could pull data from the queue and process it as it's ready. If the queue is full, thread1 could sleep for a bit. If it's empty, thread2 could sleep for a bit
That would reduce disk writing, get rid of disk reading, and make the data collection run side-by-side with the data processing to save time.
Is any of this accurate? I'm a senior CS student and threads have never once come up (Surely that's a little odd?). I would appreciate any tips/knowledge/advice with using threads, as well as if this is the correct solution to my "problem".
Thanks!

That does sound like a situation where some form of parallelism might be useful. However, since this is Python, you might not want to actually use threads. Python, in the standard implementation, has something called the Global Interpreter Lock. Effectively, to allow the garbage collector to work, only one thread of a Python program can actually be running Python code at any time (modules written directly in C, or external operations such as disk IO or database queries, are not "running Python code" for this purpose, though you will have called them from Python).
Because of this, threading in Python is generally only a good idea if your Python code spends significant amounts of time waiting on responses from non-Python parts of the program, or external sources. If the data collection or processing is being done outside Python (collecting from a database or website, processing in numpy, etc.), that might be reasonable. If your code isn't in this situation often enough, your program ends up wasting more time switching between threads than it gains (because if two threads are both in Python code it still only runs one at a time)
If not, you should try the multiprocessing module instead. This is also a generally safer model, as the only things that can be shared between processes in multiprocessing are the things you explicitly share (while threads share all state, potentially allowing one thread to break another because you forgot to lock something).
Alternately, you might use subprocess. Effectively, this would be having your first program intermittently restart the second one every time it finishes a batch of data.

What's the Best Way to Schedule and Manage Multiple Processes in Python 3

I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!

If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.

You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.