Creating main process for a for loop

Creating main process for a for loop - python

This program returns the resolution of the video but since I need for a large scale project I need multiprocessing. I have tried using and parallel processing using a different function but that would just run it multiple times not making it efficent I am posting the entire code. Can you help me create a main process that takes all cores.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
if __name__ == "__main__":
dire = askdirectory()
d = dire[:]
print(dire)
death = os.listdir(dire)
print(death)
for i in death: #multiprocess this loop
dire = d
dire += f"/{i}"
v = VideoFileClip(dire)
print(f"{i}: {v.size}")
This code works fine but I need help with creating a main process(uses all cores) for the for loop alone. can you excuse the variables names I was angry at multiprocessing. Also if you have tips on making the code efficient I would appreciate it.

You are, I suppose, assuming that every file in the directory is a video clip. I am assuming that processing the video clip is an I/O bound "process" for which threading is appropriate. Here I have rather arbitrarily crated a thread pool size of 20 threads this way:
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
You would have to experiment with how large MAX_WORKERS could be before performance degrades. This might be a low number not because your system cannot support lots of threads but because concurrent access to multiple files on your disk that may be spread across the medium may be inefficient.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor as Executor
from functools import partial
def process_video(parent_dir, file):
v = VideoFileClip(f"{parent_dir}/{file}")
print(f"{file}: {v.size}")
if __name__ == "__main__":
dire = askdirectory()
print(dire)
death = os.listdir(dire)
print(death)
worker = partial(process_video, dire)
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
with Executor(max_workers=N_WORKERS) as executor:
results = executor.map(worker, death) # results is a list: [None, None, ...]
Update
According to #Reishin, moviepy results in executing the ffmpeg executable and thus ends up creating a process in which the work is being done. So there us no point in also using multiprocessing here.

moviepy is just an wrapper around ffmpeg and designed to edit clips thus working with one file at time - the performance is quite poor. Invoking each time the new process for a number of files is time consuming. At the end, need of multiple processes might be a result of choice of wrong lib.
I'd like to recommend to use pyAV lib instead, which provides direct py binding for ffmpeg and good performance:
import av
import os
from tkinter.filedialog import askdirectory
import multiprocessing
from concurrent.futures import ThreadPoolExecutor as Executor
MAX_WORKERS = int(multiprocessing.cpu_count() * 1.5)
def get_video_resolution(path):
container = None
try:
container = av.open(path)
frame = next(container.decode(video=0))
return path, f"{frame.width}x{frame.height}"
finally:
if container:
container.close()
def files_to_proccess():
video_dir = askdirectory()
return (full_file_path for f in os.listdir(video_dir) if (full_file_path := os.path.join(video_dir, f)) and not os.path.isdir(full_file_path))
def main():
for f in files_to_proccess():
print(f"{os.path.basename(f)}: {get_video_resolution(f)[1]}")
def main_multi_threaded():
with Executor(max_workers=MAX_WORKERS) as executor:
for path, resolution in executor.map(get_video_resolution, files_to_proccess()):
print(f"{os.path.basename(path)}: {resolution}")
if __name__ == "__main__":
#main()
main_multi_threaded()
Above are single and multi-threaded implementation, with optimal parallelism setting (in case if multithreading is something absolute required)

Related

How to parallelise python script for processing 10,000 files?

I have more than 10,000 C files, which i need to pass each one of them to some application foo.exe for processing and generating dis-assembly files for each one of the C files,i.e. at the end of this process i will have 10,000 lst/output files! Assuming that, this process is not IO-Bound (despite the fact that foo.exe Write new lst file to disk for each c file. is it right assumption ?).
My task is
To implement parallel python program to get the job done in minimum time! by utilizing all cpu cores for this task.
My approach
I have implemented this program and it works for me, the pseudo code listed below:
iterate over all c files and push the abs path for each one in a global List, files_list.
calculate the cpu logical cores number (with psutil py module), this will be the maximum threads to be dispatched later. lets assume it is 8 threads.
generate new list, workers_list (its a list of lists) which contains the intervals or indexes (L_index, R_index) yielded from division of files_list by 8 . e.g. if i have 800 c files then workers_list will look like this: workers_list = [[0-99],[100,199],...,[700,799]].
dispatch 8 thread, workers, which each one will manipulate single entry in workers_list. each thread will open process (subprocess.call(...)) and call the foo.exe on the current c file.
posting the relevant code below:
The relevant Code
import multiprocessing
import subprocess
import psutil
import threading
import os
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def slice(self, files):
files_len = len(files)
j = psutil.cpu_count()
slice_step = files_len / j
workers_list = []
lhs = 0
rhs = slice_step
while j:
workers_list.append(files[lhs:rhs])
lhs += slice_step
rhs += slice_step
j -= 1
if j == 1: # last iteration
workers_list.append(files[lhs:files_len])
break
for each in workers_list: #for debug only
print len(each)
return workers_list
def disassemble(self, objectfiles):
for each_object in objectfiles:
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=each_object,
lst=os.path.join(self.outputs, os.path.basename(each_object).rstrip('o') + 'lst'))
p = subprocess.call(cmd, shell=True)
def execute(self):
class FuncThread(threading.Thread):
def __init__(self, target, *args):
self._target = target
self._args = args
threading.Thread.__init__(self)
workers = []
for portion in self.slice(self.files):
workers.append(FuncThread(self.disassemble, portion))
# dispatch the workers
for worker in workers:
worker.start()
# wait or join the previous dispatched workers
for worker in workers:
worker.join()
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()
My Questions
can i do this in more efficient way?
do python have standard lib or module that can get the job done and reduce my code/logic complexity? maybe multiprocessing.Pool?
running on windows, with python 2.7!
thanks

Yes, multiprocessing.Pool can help with this. That also does the work of sharding the list of inputs for each CPU. Here is python code (untested) that should get you on your way.
import multiprocessing
import os
import subprocess
def convert(objectfile):
elfdumpExePath = "C:\.....\elfdump.exe"
output_dir = "C:\.....\out"
cmd = "{elfdump} -T {obj} -o {lst}".format(
elfdump=elfdumpExePath,
obj=objectfile,
lst=os.path.join(output_dir, os.path.basename(objectfile).rstrip('o') + 'lst'))
return cmd
files = ["foo.c", "foo1.c", "foo2.c"]
p = multiprocessing.Pool()
outputs = p.map(convert, files)
Keep in mind that your worker function (convert above) must accept one argument. So if you need to pass in an input path and output path, that must be done as a single argument, and your list of filenames will have to be transformed into a list of pairs, where each pair is input and output.
The answer above is for python 2.7, but keep in mind that python2 has reached its end-of-life. In python3, you can use multiprocessing.Pool in a with statement so that it cleans up on its own.

Posting an answer for my question after strugling with it for a while, and noticing that i can import concurrent.futures in python2.x! this approach reduce code complexity ro minimum and even improve the execution time. unlike my first thoughts these processes is more IO-bound than cpu-bound! yet, the the time efficiency that i have get was enough convinient for run the program with multi-process.
concurrent.futures
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using
ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
Both implement the same interface, which is defined by the abstract
Executor class.
class concurrent.futures.Executor
An abstract class that provides
methods to execute calls asynchronously. It should not be used
directly, but through its concrete subclasses.
submit(fn, *args, **kwargs)
Schedules the callable, fn, to be executed as fn(*args **kwargs) and
returns a Future object representing the execution of the callable.
for further reading please folow the like below:
parallel tasks with concurrent.futures
import multiprocessing
import subprocess
import psutil
import threading
import os
import concurrent.futures
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def disassemble(self, objectfile):
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=objectfile,
lst=os.path.join(self.outputs, os.path.basename(objectfile).rstrip('o') + 'lst'))
return subprocess.call(cmd, shell=True,stdout=subprocess.PIPE)
def execute(self):
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(self.disassemble(file)) for file in self.files]
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()

Reading text files with multiprocessing slower than without

I have some text files that I need to read with Python. The text files keep an array of floats only (ie no strings) and the size of the array is 2000-by-2000. I tried to use the multiprocessing package but for some reason it now runs slower. The times I have on my pc for the code attached below are
Multi thread: 73.89 secs
Single thread: 60.47 secs
What am I doing wrong here, is there a way to speed up this task? My pc is powered by an Intel Core i7 processor and in real life I have several hundreds of these text files, 600 or even more.
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
import os
import time
from datetime import datetime
def read_from_disk(full_path):
print('%s reading %s' % (datetime.now().strftime('%H:%M:%S'), full_path))
out = np.genfromtxt(full_path, delimiter=',')
return out
def make_single_path(n):
return r"./dump/%d.csv" % n
def save_flatfiles(n):
for i in range(n):
temp = np.random.random((2000, 2000))
_path = os.path.join('.', 'dump', str(i)+'.csv')
np.savetxt(_path, temp, delimiter=',')
if __name__ == "__main__":
# make some text files
n = 10
save_flatfiles(n)
# list with the paths to the text files
file_list = [make_single_path(d) for d in range(n)]
pool = ThreadPool(8)
start = time.time()
results = pool.map(read_from_disk, file_list)
pool.close()
pool.join()
print('finished multi thread in %s' % (time.time()-start))
start = time.time()
for d in file_list:
out = read_from_disk(d)
print('finished single thread in %s' % (time.time() - start))
print('Done')

You are using multiprocessing.dummy which replicates the API of multiprocessing but actually it is a wrapper around the threading module.
So, basically you are using Threads instead of Process. And threads in python are not useful( Due to GIL) when you want to perform computational tasks.
So Replace:
from multiprocessing.dummy import Pool as ThreadPool
With:
from multiprocessing import Pool
I've tried running your code on my machine having a i5 processor, it finished execution in 45 seconds. so i would say that's a big improvement.
Hope this clears your understanding.

Concurrent.futures opens new windows in tkinter instead of running the function

I am trying to run a function concurrently over multiple files in a gui using tkinter and concurrent.futures
Outside of the GUI, this script works fine. However, whenever I translate it over into the GUI script, instead of running the function in parallel, the script opens up 5 new gui tkinter windows (the number of windows it opens is equal to the number of processors I allow the program to use).
Ive looked over the code thoroughly and just cant understand why it is opening new windows as opposed to just running the function over the files.
Can anyone see something I am missing?
An abridged version of the code is below. I have cut out a significant part of the code and only left in the parts pertinent to parralelization. This code undoubtedly has variables in it that I have not defined in this example.
import pandas as pd
import numpy as np
import glob
from pathlib import Path
from tkinter import *
from tkinter import filedialog
from concurrent.futures import ProcessPoolExecutor
window = Tk()
window.title('Problem with parralelizing')
window.geometry('1000x700')
def calculate():
#establish where the files are coming from to operate on
folder_input = folder_entry_var.get()
#establish the number of processors to use
nbproc = int(np_var.get())
#loop over files to get a list of file to be worked on by concurrent.futures
files = []
for file in glob.glob(rf'{folder_input}'+'//*'):
files.append(file)
#this function gets passed to concurrent.futures. I have taken out a significant portion of
#function itself as I do not believe the problem resides in the function itself.
def process_file(filepath):
excel_input = excel_entry_var.get()
minxv = float(min_x_var.get())
maxxv = float(man_x_var.get())
output_dir = odir_var.get()
path = filepath
event_name = Path(path).stem
event['event_name'] = event_name
min_x = 292400
max_x = 477400
list_of_objects = list(event.object.unique())
missing_master_surface = []
for line in list_of_objects:
df = event.loc[event.object == line]
current_y = df.y.max()
y_cl = df.x.values.tolist()
full_ys = np.arange(min_x,max_x+200,200).tolist()
for i in full_ys:
missing_values = []
missing_v_y = []
exist_yn = []
event_name_list = []
if i in y_cl:
next
elif i not in y_cl:
missing_values.append(i)
missing_v_y.append(current_y)
exist_yn.append(0)
event_name_list.append(event_name)
# feed the function to processpool executer to run. At this point, I hear the processors
# spin up, but all it does is open 5 new tkinter windows (the number of windows is proportionate
#to the number of processors I give it to run
if __name__ == '__main__':
with ProcessPoolExecutor(max_workers=nbproc) as executor:
executor.map(process_file, files)
window.mainloop()

Ive looked over the code thoroughly and just cant understand why it is opening new windows as opposed to just running the function over the files.
Each process has to reload your code. At the very top of your code you do window = Tk(). That is why you get one window per process.

Why this multiprocessing code is slower than the serial one?

I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s

The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)

python3.5 multiprocessing pool hangs repeatedly

I am using multiprocessing.Pool to create a pool of workers to load files into Pandas. It hangs all the time, probably about 75% of the time.
import pandas as pd
import multiprocessing as mp
def load(filename):
thing = pd.read_table(filename)
return thing
files = ['a','b','c'] # A list of a bunch of files
with mp.Pool(5) as pool:
result = pool.map(load, files)
By "hangs" I mean never finishes. I've straced the procs and they are waiting on futuexes, so I have no idea what that means. Am I invoking the pool correctly?
Again, it works perfectly 25% of the time, so I must be doing something right... thx!
Ubuntu Xenial Python3.5.2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating main process for a for loop - python

Related

How to parallelise python script for processing 10,000 files?

Reading text files with multiprocessing slower than without

Concurrent.futures opens new windows in tkinter instead of running the function

Why this multiprocessing code is slower than the serial one?

python3.5 multiprocessing pool hangs repeatedly

Categories

Resources