I have some text files that I need to read with Python. The text files keep an array of floats only (ie no strings) and the size of the array is 2000-by-2000. I tried to use the multiprocessing package but for some reason it now runs slower. The times I have on my pc for the code attached below are
Multi thread: 73.89 secs
Single thread: 60.47 secs
What am I doing wrong here, is there a way to speed up this task? My pc is powered by an Intel Core i7 processor and in real life I have several hundreds of these text files, 600 or even more.
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
import os
import time
from datetime import datetime
def read_from_disk(full_path):
print('%s reading %s' % (datetime.now().strftime('%H:%M:%S'), full_path))
out = np.genfromtxt(full_path, delimiter=',')
return out
def make_single_path(n):
return r"./dump/%d.csv" % n
def save_flatfiles(n):
for i in range(n):
temp = np.random.random((2000, 2000))
_path = os.path.join('.', 'dump', str(i)+'.csv')
np.savetxt(_path, temp, delimiter=',')
if __name__ == "__main__":
# make some text files
n = 10
save_flatfiles(n)
# list with the paths to the text files
file_list = [make_single_path(d) for d in range(n)]
pool = ThreadPool(8)
start = time.time()
results = pool.map(read_from_disk, file_list)
pool.close()
pool.join()
print('finished multi thread in %s' % (time.time()-start))
start = time.time()
for d in file_list:
out = read_from_disk(d)
print('finished single thread in %s' % (time.time() - start))
print('Done')
You are using multiprocessing.dummy which replicates the API of multiprocessing but actually it is a wrapper around the threading module.
So, basically you are using Threads instead of Process. And threads in python are not useful( Due to GIL) when you want to perform computational tasks.
So Replace:
from multiprocessing.dummy import Pool as ThreadPool
With:
from multiprocessing import Pool
I've tried running your code on my machine having a i5 processor, it finished execution in 45 seconds. so i would say that's a big improvement.
Hope this clears your understanding.
Related
This program returns the resolution of the video but since I need for a large scale project I need multiprocessing. I have tried using and parallel processing using a different function but that would just run it multiple times not making it efficent I am posting the entire code. Can you help me create a main process that takes all cores.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
if __name__ == "__main__":
dire = askdirectory()
d = dire[:]
print(dire)
death = os.listdir(dire)
print(death)
for i in death: #multiprocess this loop
dire = d
dire += f"/{i}"
v = VideoFileClip(dire)
print(f"{i}: {v.size}")
This code works fine but I need help with creating a main process(uses all cores) for the for loop alone. can you excuse the variables names I was angry at multiprocessing. Also if you have tips on making the code efficient I would appreciate it.
You are, I suppose, assuming that every file in the directory is a video clip. I am assuming that processing the video clip is an I/O bound "process" for which threading is appropriate. Here I have rather arbitrarily crated a thread pool size of 20 threads this way:
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
You would have to experiment with how large MAX_WORKERS could be before performance degrades. This might be a low number not because your system cannot support lots of threads but because concurrent access to multiple files on your disk that may be spread across the medium may be inefficient.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor as Executor
from functools import partial
def process_video(parent_dir, file):
v = VideoFileClip(f"{parent_dir}/{file}")
print(f"{file}: {v.size}")
if __name__ == "__main__":
dire = askdirectory()
print(dire)
death = os.listdir(dire)
print(death)
worker = partial(process_video, dire)
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
with Executor(max_workers=N_WORKERS) as executor:
results = executor.map(worker, death) # results is a list: [None, None, ...]
Update
According to #Reishin, moviepy results in executing the ffmpeg executable and thus ends up creating a process in which the work is being done. So there us no point in also using multiprocessing here.
moviepy is just an wrapper around ffmpeg and designed to edit clips thus working with one file at time - the performance is quite poor. Invoking each time the new process for a number of files is time consuming. At the end, need of multiple processes might be a result of choice of wrong lib.
I'd like to recommend to use pyAV lib instead, which provides direct py binding for ffmpeg and good performance:
import av
import os
from tkinter.filedialog import askdirectory
import multiprocessing
from concurrent.futures import ThreadPoolExecutor as Executor
MAX_WORKERS = int(multiprocessing.cpu_count() * 1.5)
def get_video_resolution(path):
container = None
try:
container = av.open(path)
frame = next(container.decode(video=0))
return path, f"{frame.width}x{frame.height}"
finally:
if container:
container.close()
def files_to_proccess():
video_dir = askdirectory()
return (full_file_path for f in os.listdir(video_dir) if (full_file_path := os.path.join(video_dir, f)) and not os.path.isdir(full_file_path))
def main():
for f in files_to_proccess():
print(f"{os.path.basename(f)}: {get_video_resolution(f)[1]}")
def main_multi_threaded():
with Executor(max_workers=MAX_WORKERS) as executor:
for path, resolution in executor.map(get_video_resolution, files_to_proccess()):
print(f"{os.path.basename(path)}: {resolution}")
if __name__ == "__main__":
#main()
main_multi_threaded()
Above are single and multi-threaded implementation, with optimal parallelism setting (in case if multithreading is something absolute required)
I have built a very single program to understand how parallelization may help (save time).
This program simply write a list of integer in a file, and I split the treatment by range of integer.
Reading relevant question, I have put Lock, but no improvement.
The program run, but instead of having in my file, for instance, 1000000 lines, I will have 99996xx...and the number of worker do not change anything (even with one worker, the problem curiously remains...however no competition problem between worker).
May somebody explain me what is wrong ? (I have tried with opening and writing a file within the worker and then concatenate all : it runs. but I'd like to understand why I do not manage to write in a shared file with multiprocessing).
Please find below the code:
import csv
import time
import multiprocessing
SizeOfSet=8000000
PathCorpus='/home'
Myfile='output.csv'
MyFile=PathCorpus+'/'+Myfile
step=1000000
nb_worker=SizeOfSet//step
def worker_verrou(verrou,start,end):
verrou.acquire()
for k in range(start,end):
try:
writerfile.writerow([k])
except:
print('pb with writerow %i',k)
verrou.release()
return
pointeurwritefile = open(MyFile, "wt")
writerfile = csv.writer(pointeurwritefile)
verrou=multiprocessing.Lock()
start_time = time.time()
jobs=[]
for k in range(nb_worker):
print('Starting process number %i', k)
p = multiprocessing.Process(target=worker_verrou, args=(verrou, k*step,(k+1)*step))
jobs.append(p)
p.start()
for j in jobs:
j.join()
print '%s.exitcode = %s' % (j.name, j.exitcode)
interval = time.time() - start_time
print('Total time in seconds:', interval)
print('ended')
I got a function called cartoonize(image_path), which takes 'path/to/image' as an argument. The script is a bit hefty takes a couple of minutes to process a 1920x1080 sized image. I tried to use all my 8 cores using multiprocessing module, but there is no performance gain I see with the following code. Another problem is how to save the image. It returns a CV2 object. Normally, I save the image to the disk by see below code, but with multiprocessing it is giving error "img is not a numpy array, neither a scalar." I also want a performance gain, which I can't figure out how to do it efficiently.
out_final = cartoonize('path/to/image'))
cv2.imwrite('cartoon.png', out_final)
import multiprocessing
if __name__ == '__main__':
# mark the start time
startTime = time.time()
print "cartoonizing please wait ..."
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
pool_outputs = pool.apply_async(cartoonize, args=(image_path,))
pool.close()
pool.join()
print ('Pool:', pool_outputs)
# mark the end time
endTime = time.time()
print('Took %.3f seconds' % (startTime - endTime))
Question: multiprocessing it is giving error "img is not a numpy array, neither a scalar."
pool_outputs = pool.apply_async(cartoonize, args=(image_path,))
Your code returns AsyncResult
Python » 3.6.1 Documentation » multiprocessing.pool.AsyncResult
class multiprocessing.pool.AsyncResult
The class of the result returned by Pool.apply_async() and Pool.map_async().
I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.
I have many many small tasks to do in a for loop. I Want to use concurrency to speed it up. I used joblib for its easy to integrate. However, I found using joblib makes my program run much slower than a simple for iteration. Here is the demo code:
import time
import random
from os import path
import tempfile
import numpy as np
import gc
from joblib import Parallel, delayed, load, dump
def func(a, i):
'''a simple task for demonstration'''
a[i] = random.random()
def memmap(a):
'''use memory mapping to prevent memory allocation for each worker'''
tmp_dir = tempfile.mkdtemp()
mmap_fn = path.join(tmp_dir, 'a.mmap')
print 'mmap file:', mmap_fn
_ = dump(a, mmap_fn) # dump
a_mmap = load(mmap_fn, 'r+') # load
del a
gc.collect()
return a_mmap
if __name__ == '__main__':
N = 10000
a = np.zeros(N)
# memory mapping
a = memmap(a)
# parfor
t0 = time.time()
Parallel(n_jobs=4)(delayed(func)(a, i) for i in xrange(N))
t1 = time.time()-t0
# for
t0 = time.time()
[func(a, i) for i in xrange(N)]
t2 = time.time()-t0
# joblib time vs for time
print t1, t2
On my laptop with i5-2520M CPU, 4 cores, Win7 64bit, the running time is 6.464s for joblib and 0.004s for simplely for loop.
I've made the arguments as memory mapping to prevent the overhead of reallocation for each worker.
I've red this relative post, still not solved my problem.
Why is that happen? Did I missed some disciplines to correctly use joblib?
"Many small tasks" are not a good fit for joblib. The coarser the task granularity, the less overhead joblib causes and the more benefit you will have from it. With tiny tasks, the cost of setting up worker processes and communicating data to them will outweigh any any benefit from parallelization.