How to parallelise python script for processing 10,000 files?

How to parallelise python script for processing 10,000 files? - python

I have more than 10,000 C files, which i need to pass each one of them to some application foo.exe for processing and generating dis-assembly files for each one of the C files,i.e. at the end of this process i will have 10,000 lst/output files! Assuming that, this process is not IO-Bound (despite the fact that foo.exe Write new lst file to disk for each c file. is it right assumption ?).
My task is
To implement parallel python program to get the job done in minimum time! by utilizing all cpu cores for this task.
My approach
I have implemented this program and it works for me, the pseudo code listed below:
iterate over all c files and push the abs path for each one in a global List, files_list.
calculate the cpu logical cores number (with psutil py module), this will be the maximum threads to be dispatched later. lets assume it is 8 threads.
generate new list, workers_list (its a list of lists) which contains the intervals or indexes (L_index, R_index) yielded from division of files_list by 8 . e.g. if i have 800 c files then workers_list will look like this: workers_list = [[0-99],[100,199],...,[700,799]].
dispatch 8 thread, workers, which each one will manipulate single entry in workers_list. each thread will open process (subprocess.call(...)) and call the foo.exe on the current c file.
posting the relevant code below:
The relevant Code
import multiprocessing
import subprocess
import psutil
import threading
import os
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def slice(self, files):
files_len = len(files)
j = psutil.cpu_count()
slice_step = files_len / j
workers_list = []
lhs = 0
rhs = slice_step
while j:
workers_list.append(files[lhs:rhs])
lhs += slice_step
rhs += slice_step
j -= 1
if j == 1: # last iteration
workers_list.append(files[lhs:files_len])
break
for each in workers_list: #for debug only
print len(each)
return workers_list
def disassemble(self, objectfiles):
for each_object in objectfiles:
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=each_object,
lst=os.path.join(self.outputs, os.path.basename(each_object).rstrip('o') + 'lst'))
p = subprocess.call(cmd, shell=True)
def execute(self):
class FuncThread(threading.Thread):
def __init__(self, target, *args):
self._target = target
self._args = args
threading.Thread.__init__(self)
workers = []
for portion in self.slice(self.files):
workers.append(FuncThread(self.disassemble, portion))
# dispatch the workers
for worker in workers:
worker.start()
# wait or join the previous dispatched workers
for worker in workers:
worker.join()
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()
My Questions
can i do this in more efficient way?
do python have standard lib or module that can get the job done and reduce my code/logic complexity? maybe multiprocessing.Pool?
running on windows, with python 2.7!
thanks

Yes, multiprocessing.Pool can help with this. That also does the work of sharding the list of inputs for each CPU. Here is python code (untested) that should get you on your way.
import multiprocessing
import os
import subprocess
def convert(objectfile):
elfdumpExePath = "C:\.....\elfdump.exe"
output_dir = "C:\.....\out"
cmd = "{elfdump} -T {obj} -o {lst}".format(
elfdump=elfdumpExePath,
obj=objectfile,
lst=os.path.join(output_dir, os.path.basename(objectfile).rstrip('o') + 'lst'))
return cmd
files = ["foo.c", "foo1.c", "foo2.c"]
p = multiprocessing.Pool()
outputs = p.map(convert, files)
Keep in mind that your worker function (convert above) must accept one argument. So if you need to pass in an input path and output path, that must be done as a single argument, and your list of filenames will have to be transformed into a list of pairs, where each pair is input and output.
The answer above is for python 2.7, but keep in mind that python2 has reached its end-of-life. In python3, you can use multiprocessing.Pool in a with statement so that it cleans up on its own.

Posting an answer for my question after strugling with it for a while, and noticing that i can import concurrent.futures in python2.x! this approach reduce code complexity ro minimum and even improve the execution time. unlike my first thoughts these processes is more IO-bound than cpu-bound! yet, the the time efficiency that i have get was enough convinient for run the program with multi-process.
concurrent.futures
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using
ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
Both implement the same interface, which is defined by the abstract
Executor class.
class concurrent.futures.Executor
An abstract class that provides
methods to execute calls asynchronously. It should not be used
directly, but through its concrete subclasses.
submit(fn, *args, **kwargs)
Schedules the callable, fn, to be executed as fn(*args **kwargs) and
returns a Future object representing the execution of the callable.
for further reading please folow the like below:
parallel tasks with concurrent.futures
import multiprocessing
import subprocess
import psutil
import threading
import os
import concurrent.futures
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def disassemble(self, objectfile):
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=objectfile,
lst=os.path.join(self.outputs, os.path.basename(objectfile).rstrip('o') + 'lst'))
return subprocess.call(cmd, shell=True,stdout=subprocess.PIPE)
def execute(self):
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(self.disassemble(file)) for file in self.files]
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()

Related

How to run multiple python scripts simultaneously from a wrapper script in such a way that CPU utilization is maximized?

I have to run about 200-300 python scripts daily having different arguments, for example:
python scripts/foo.py -a bla -b blabla ..
python scripts/foo.py -a lol -b lolol ..
....
Lets say I already have all these arguments for every script present inside a list, and I would like to concurrently execute them such that the CPU is always busy. How can I do so?'
My current solution:
script for running multiple processes:
workers = 15
for i in range(0,len(jobs),workers):
job_string = ""
for j in range(i,min(i+workers,len(jobs))):
job_string += jobs[j] + " & "
if len(job_string) == 0:
continue
print(job_string)
val = subprocess.check_call("./scripts/parallelProcessing.sh '%s'" % job_string,shell=True)
scripts/parallelProcessing.sh (used in the above script)
echo $1
echo "running scripts in parallel"
eval $1
wait
echo "done processing"
Drawback:
I am executing K processes in a batch, and then another K and so on. But CPU cores utilization is much lower as the number of running processes keep reducing, and eventually only one process is running at a time (for a given batch). As a result, the time taken to complete all the processes is significant.
One simple solution is to ensure K processes are always running, i.e once the previous process gets completed, a new one must be scheduled. But I am not sure how to implement such a solution.
Expectations:
As the task is not very latency sensitive, I am looking forward to a simple solution which keeps CPU mostly busy.
Note: Any two of those processes can execute simultaneously without any concurrency issues. The host where these processes run has python2.

This is a technique I developed for calling many external programs using subprocess.Popen. In this example, I'm calling convert make JPEG images from DICOM files.
In short; it uses manageprocs to keep checking a list of running subprocesses. If one is finished, it is removed and a new one is started as long as unprocesses files remain. After that, the remaining processes are watched until they are all finished.
from datetime import datetime
from functools import partial
import argparse
import logging
import os
import subprocess as sp
import sys
import time
def main():
"""
Entry point for dicom2jpg.
"""
args = setup()
if not args.fn:
logging.error("no files to process")
sys.exit(1)
if args.quality != 80:
logging.info(f"quality set to {args.quality}")
if args.level:
logging.info("applying level correction.")
start_partial = partial(start_conversion, quality=args.quality, level=args.level)
starttime = str(datetime.now())[:-7]
logging.info(f"started at {starttime}.")
# List of subprocesses
procs = []
# Do not launch more processes concurrently than your CPU has cores.
# That will only lead to the processes fighting over CPU resources.
maxprocs = os.cpu_count()
# Launch and mange subprocesses for all files.
for path in args.fn:
while len(procs) == maxprocs:
manageprocs(procs)
procs.append(start_partial(path))
# Wait for all subprocesses to finish.
while len(procs) > 0:
manageprocs(procs)
endtime = str(datetime.now())[:-7]
logging.info(f"completed at {endtime}.")
def start_conversion(filename, quality, level):
"""
Convert a DICOM file to a JPEG file.
Removing the blank areas from the Philips detector.
Arguments:
filename: name of the file to convert.
quality: JPEG quality to apply
level: Boolean to indicate whether level adustment should be done.
Returns:
Tuple of (input filename, output filename, subprocess.Popen)
"""
outname = filename.strip() + ".jpg"
size = "1574x2048"
args = [
"convert",
filename,
"-units",
"PixelsPerInch",
"-density",
"300",
"-depth",
"8",
"-crop",
size + "+232+0",
"-page",
size + "+0+0",
"-auto-gamma",
"-quality",
str(quality),
]
if level:
args += ["-level", "-35%,70%,0.5"]
args.append(outname)
proc = sp.Popen(args, stdout=sp.DEVNULL, stderr=sp.DEVNULL)
return (filename, outname, proc)
def manageprocs(proclist):
"""Check a list of subprocesses for processes that have ended and
remove them from the list.
Arguments:
proclist: List of tuples. The last item in the tuple must be
a subprocess.Popen object.
"""
for item in proclist:
filename, outname, proc = item
if proc.poll() is not None:
logging.info(f"conversion of “{filename}” to “{outname}” finished.")
proclist.remove(item)
# since manageprocs is called from a loop, keep CPU usage down.
time.sleep(0.05)
if __name__ == "__main__":
main()
I've left out setup(); it's using argparse to deal with command-line arguments.
Here the thing to be processed is just a list of file names.
But it could also be (in your case) a list of tuples of script names and arguments.

Creating main process for a for loop

This program returns the resolution of the video but since I need for a large scale project I need multiprocessing. I have tried using and parallel processing using a different function but that would just run it multiple times not making it efficent I am posting the entire code. Can you help me create a main process that takes all cores.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
if __name__ == "__main__":
dire = askdirectory()
d = dire[:]
print(dire)
death = os.listdir(dire)
print(death)
for i in death: #multiprocess this loop
dire = d
dire += f"/{i}"
v = VideoFileClip(dire)
print(f"{i}: {v.size}")
This code works fine but I need help with creating a main process(uses all cores) for the for loop alone. can you excuse the variables names I was angry at multiprocessing. Also if you have tips on making the code efficient I would appreciate it.

You are, I suppose, assuming that every file in the directory is a video clip. I am assuming that processing the video clip is an I/O bound "process" for which threading is appropriate. Here I have rather arbitrarily crated a thread pool size of 20 threads this way:
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
You would have to experiment with how large MAX_WORKERS could be before performance degrades. This might be a low number not because your system cannot support lots of threads but because concurrent access to multiple files on your disk that may be spread across the medium may be inefficient.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor as Executor
from functools import partial
def process_video(parent_dir, file):
v = VideoFileClip(f"{parent_dir}/{file}")
print(f"{file}: {v.size}")
if __name__ == "__main__":
dire = askdirectory()
print(dire)
death = os.listdir(dire)
print(death)
worker = partial(process_video, dire)
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
with Executor(max_workers=N_WORKERS) as executor:
results = executor.map(worker, death) # results is a list: [None, None, ...]
Update
According to #Reishin, moviepy results in executing the ffmpeg executable and thus ends up creating a process in which the work is being done. So there us no point in also using multiprocessing here.

moviepy is just an wrapper around ffmpeg and designed to edit clips thus working with one file at time - the performance is quite poor. Invoking each time the new process for a number of files is time consuming. At the end, need of multiple processes might be a result of choice of wrong lib.
I'd like to recommend to use pyAV lib instead, which provides direct py binding for ffmpeg and good performance:
import av
import os
from tkinter.filedialog import askdirectory
import multiprocessing
from concurrent.futures import ThreadPoolExecutor as Executor
MAX_WORKERS = int(multiprocessing.cpu_count() * 1.5)
def get_video_resolution(path):
container = None
try:
container = av.open(path)
frame = next(container.decode(video=0))
return path, f"{frame.width}x{frame.height}"
finally:
if container:
container.close()
def files_to_proccess():
video_dir = askdirectory()
return (full_file_path for f in os.listdir(video_dir) if (full_file_path := os.path.join(video_dir, f)) and not os.path.isdir(full_file_path))
def main():
for f in files_to_proccess():
print(f"{os.path.basename(f)}: {get_video_resolution(f)[1]}")
def main_multi_threaded():
with Executor(max_workers=MAX_WORKERS) as executor:
for path, resolution in executor.map(get_video_resolution, files_to_proccess()):
print(f"{os.path.basename(path)}: {resolution}")
if __name__ == "__main__":
#main()
main_multi_threaded()
Above are single and multi-threaded implementation, with optimal parallelism setting (in case if multithreading is something absolute required)

Python concurrent.futures using subprocess, running several python script

I want to run several python script at the same time using concurrent.futures.
The serial version of my code go and look for a specific python file in folder and execute it.
import re
import os
import glob
import re
from glob import glob
import concurrent.futures as cf
FileList = [];
import time
FileList = [];
start_dir = os.getcwd();
pattern = "Read.py"
for dir,_,_ in os.walk(start_dir):
FileList.extend(glob(os.path.join(dir,pattern))) ;
FileList
i=0
for file in FileList:
dir=os.path.dirname((file))
dirname1 = os.path.basename(dir)
print(dirname1)
i=i+1
Str='python '+ file
print(Str)
completed_process = subprocess.run(Str)`
for the Parallel version of my code:
def Python_callback(future):
print(future.run_type, future.jid)
return "One Folder finished executing"
def Python_execute():
from concurrent.futures import ProcessPoolExecutor as Pool
args = FileList
pool = Pool(max_workers=1)
future = pool.submit(subprocess.call, args, shell=1)
future.run_type = "run_type"
future.jid = FileList
future.add_done_callback(Python_callback)
print("Python executed")
if __name__ == '__main__':
import subprocess
Python_execute()
The issue is that I am not sure how to pass each element of the FileList to separate cpu
Thanks for your help in advance

The smallest change is to use submit once for each element, instead of once for the whole list:
futures = []
for file in FileList:
future = pool.submit(subprocess.call, file, shell=1)
future.blah blah
futures.append(future)
The futures list is only necessary if you want to do something with the futures—wait for them to finish, check their return values, etc.
Meanwhile, you're explicitly creating the pool with max_workers=1. Not surprisingly, this means you'll only get 1 worker child process, so it'll end up waiting for one subprocess to finish before grabbing the next one. If you want to actually run them concurrently, remove that max_workers and let it default to one per core (or pass max_workers=8 or some other number that's not 1, if you have a good reason to override the default).
While we're at it, there are a lot of ways to simplify what you're doing:
Do you really need multiprocessing here? If you need to communicate with each subprocess, that can be painful to do in a single thread—but threads, or maybe asyncio, will work just as well as processes here.
More to the point, it doesn't look like you actually do need anything but launch the process and wait for it to finish, and that can be done in simple, synchronous code.
Why are you building a string and using shell=1 instead of just passing a list and not using the shell? Using the shell unnecessarily creates overhead, safety problems, and debugging annoyances.
You really don't need the jid on each future—it's just the list of all of your invocation strings, which can't be useful. What might be more useful is some kind of identifier, or the subprocess return code, or… probably lots of other things, but they're all things that could be done by reading the return value of subprocess.call or a simple wrapper.
You really don't need the callback either. If you just gather all the futures in a list and as_completed it, you can print the results as they show up more simply.
If you do both of the above, you've got nothing left but a pool.submit inside the loop—which means you can replace the entire loop with pool.map.
You rarely need, or want, to mix os.walk and glob. When you actually have a glob pattern, apply fnmatch over the files list from os.walk. But here, you're just looking for a specific filename in each dir, so really, all you need to filter on is file == 'Read.py'.
You're not using the i in your loop. But if you do need it, it's better to do for i, file in enumerate(FileList): than to do for file in FileList: and manually increment an i.

parallel writing to list in python

I got multiple parallel processes writing into one list in python. My code is:
global_list = []
class MyThread(threading.Thread):
...
def run(self):
results = self.calculate_results()
global_list.extend(results)
def total_results():
for param in params:
t = MyThread(param)
t.start()
while threading.active_count() > 1:
pass
return total_results
I don't like this aproach as it has:
An overall global variable -> What would be the way to have a local variable for the `total_results function?
The way I check when the list is returned seems somewhat clumsy, what would be the standard way?

Is your computation CPU-intensive? If so you should look at the multiprocessing module which is included with Python and offers a fairly easy to use Pool class into which you can feed compute tasks and later get all the results. If you need a lot of CPU time this will be faster anyway, because Python doesn't do threading all that well: only a single interpreter thread can run at a time in one process. Multiprocessing sidesteps that (and offers the Pool abstraction which makes your job easier). Oh, and if you really want to stick with threads, multiprocessing has a ThreadPool too.

1 - Use a class variable shared between all Worker's instances to append your results
from threading import Thread
class Worker(Thread):
results = []
...
def run(self):
results = self.calculate_results()
Worker.results.extend(results) # extending a list is thread safe
2 - Use join() to wait untill all the threads are done and let them have some computational time
def total_results(params):
# create all workers
workers = [Worker(p) for p in params]
# start all workers
[w.start() for w in workers]
# wait for all of them to finish
[w.join() for w in workers]
#get the result
return Worker.results

How to limit number of concurrent threads in Python?

How can I limit the number of concurrent threads in Python?
For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
Here is what I have so far:
def process_file(fname):
# open file and do something
def process_file_thread(queue, fname):
queue.put(process_file(fname))
def process_all_files(d):
files=glob.glob(d + '/*')
q=Queue.Queue()
for fname in files:
t=threading.Thread(target=process_file_thread, args=(q, fname))
t.start()
q.join()
def main():
process_all_files('.')
# Do something after all files have been processed
How can I modify the code so that only 4 threads are run at a time?
Note that I want to wait for all files to be processed and then continue and work on the processed files.

For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
That's exactly what a thread pool does: You create jobs, and the pool runs 4 at a time in parallel. You can make things even simpler by using an executor, where you just hand it functions (or other callables) and it hands you back futures for the results. You can build all of this yourself, but you don't have to.*
The stdlib's concurrent.futures module is the easiest way to do this. (For Python 3.1 and earlier, see the backport.) In fact, one of the main examples is very close to what you want to do. But let's adapt it to your exact use case:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
concurrent.futures.wait(fs)
If you wanted process_file to return something, that's almost as easy:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
for f in concurrent.futures.as_completed(fs):
do_something(f.result())
And if you want to handle exceptions too… well, just look at the example; it's just a try/except around the call to result().
* If you want to build them yourself, it's not that hard. The source to multiprocessing.pool is well written and commented, and not that complicated, and most of the hard stuff isn't relevant to threading; the source to concurrent.futures is even simpler.

I used this technique a few times, I think it's a bit ugly thought:
import threading
def process_something():
something = list(get_something)
def worker():
while something:
obj = something.pop()
# do something with obj
threads = [Thread(target=worker) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parallelise python script for processing 10,000 files? - python

Related

How to run multiple python scripts simultaneously from a wrapper script in such a way that CPU utilization is maximized?

Creating main process for a for loop

Python concurrent.futures using subprocess, running several python script

parallel writing to list in python

How to limit number of concurrent threads in Python?

Categories

Resources