I search the way to use all cores. But all I try, only decrease spreed.
I tried following:
from joblib import Parallel, delayed
import multiprocessing
from time import time
import numpy as np
inputs = range(1000)
def processInput(i):
return i * i
using multiprocessing
num_cores = multiprocessing.cpu_count()
start=time()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
print 'multiproc time ', time()-start
without multiprocessing
start=time()
results =[]
for i in inputs:
results.append(processInput(i))
print 'simple time ', time()-start
and get output:
multiproc time 0.14687204361
simple time 0.000839948654175
This is a classic problem with multi-threading / multi-processing. Whenever you want to process something in parallel, you should make sure that the time you are saving because of parallelism is greater than the time it takes to manage parallel processes.
Try increasing the input size. Then you will see the impact of parallelism.
Related
I am running prediction using a trained tensorflow model and generating data using it on the images coming from a simulator. But the issue here I need to save image too for each prediction I am making which is creating delay in the loop sometime causing issues in simulator. Is there any way we can use python's multiprocessing module to create a producer consumer architecture to avoid the IO cost in the loop?
for data in data_arr:
speed=float(data['speed'])
image=Image.open(BytesIO(base64.b64decode(data['image'])))
image=np.asarray(image)
img_c=image.copy()
image=img_preprocess(image)
image=np.array([image])
steering_angle=float(model_steer.predict(image))
#throttle=float(model_thr.predict(image))
throttle=1.0-speed/speed_limit
save_image(img_c,steering_angle)
print('{} {} {}'.format(steering_angle,throttle,speed))
send_control(steering_angle,throttle)
I tried to experiment similar concept for processing images from color to grayscale but instead of decreasing time. The total time increased from 0.1 sec to 17 sec.
import numpy as np
import cv2
import os
import time
from multiprocessing import Pool,RawArray
import ctypes
files_path=os.listdir('./imgs/')
files_path=list(map(lambda x:'./imgs/'+x,files_path))
temp_img=np.zeros((160,320))
var_dict = {}
def init_worker(X, h,w):
# Using a dictionary is not strictly necessary. You can also
# use global variables.
var_dict['X']=X
var_dict['h'] = h
var_dict['w'] = w
def worker_func(idx):
# Simply computes the sum of the i-th row of the input matrix X
X_np = np.frombuffer(var_dict['X'], dtype=np.uint8)
X_np=X_np.reshape(var_dict['h'],var_dict['w'])
cv2.imwrite('./out/'+str(idx)+'.jpg',X_np)
if __name__=='__main__':
start_time=time.time()
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
h,w=img.shape[:2]
mulproc_array=RawArray(ctypes.c_uint8,160*320)
X_np = np.frombuffer(mulproc_array, dtype=np.uint8).reshape(160,320)
np.copyto(X_np,img)
#cv2.imwrite('./out/'+str(idx)+'.jpg',img)
with Pool(processes=1, initializer=init_worker, initargs=(mulproc_array, h,w)) as pool:
pool.map(worker_func,[idx])
end_time=time.time()
print('Time taken=',(end_time-start_time))
there is no reason for using RawArray, as multiprocessing will already use pickle for objects transfer which has approximately the same size as the numpy array, and using RawArray is different from your use case.
you don't need to wait for the saving function to end, you can run it asynchronously.
you shouldn't be closing the pool until you are done with everything, as creating a worker takes a very long time (in the order of 10-100ms)
def worker_func(img,idx):
cv2.imwrite('./out/'+str(idx)+'.jpg',img)
if __name__=='__main__':
start_time=time.time()
with Pool(processes=1) as pool:
results = []
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) # do other work here
# next line converts image to uint8 before sending it to reduce its size
results.append(pool.apply_async(worker_func,args=(img.astype(np.uint8),idx)))
end_time=time.time() # technically the transfer is done, at this line.
for res in results:
res.get() # call this before closing the pool to make sure all images are saved.
print('Time taken=',(end_time-start_time))
you might want to experiment with threading instead of multiprocessing, to avoid data copy altogether, since writing to disk drops the GIL, but the results are not guaranteed to be faster.
Is there a way to use both ThreadPool and Pool in python to parallelise a loop by specifying the number of CPUs and cores you wish to use?
For example I would have a loop execute as:
from multiprocessing.dummy import Pool as ThreadPool
from tqdm import tqdm
import numpy as np
def my_function(x):
return x + 1
pool = ThreadPool(4)
my_array = np.arange(0,1e6,1)
results = list(tqdm(pool.imap(my_function, my_array),total=len(my_array)))
For 4 cores but it I wanted to spread these out on multiple CPUs as well, is there a simple way to adapt the code?
You are confusing between a core and a CPU. Generally, for all purposes both are considered to be the same(let's call them processor from now on).
When creating a thread pool in python, the threads are user level threads and are run on the same processor, due to Global Interpreter Lock(GIL) in python. As only one thread can control the python interpreter at a time. So, using (python)threads we don't get any real concurrency in data-intensive tasks.
How to solve this? Easy. Spawn multiple python processes running on different processors(each with its own interpreter). This is where the multi processing(mp) module is used, to spawn multiple processes from the parent python process in which it is called.
You can verify this by running htop(on linux, mac) and analysing the number of python processes. In case of mp module, they all will have the same name as the parent script where the pool.map function is called.
Timings for your code on a 8 core mac: 39.7s
Timing for this code on the same machine : 2.9s(note I can use 8 cores at max, but for comparison purposes using only 4)
Below is the modified code:
from multiprocessing.dummy import Pool as ThreadPool
from tqdm import tqdm
import numpy as np
import time
import multiprocessing as mp
def my_function(x):
return x + 1
pool = ThreadPool(4)
my_array = np.arange(0,1e6,1)
t1 = time.time()
# results = list(tqdm(pool.imap(my_function, my_array),total=len(my_array)))
pool = mp.Pool(processes=4) # Generally, set to 2*num_cores you have
res = pool.map(my_function, my_array)
print("Time taken = ", time.time() - t1)
multiprocessing.dummy.Pool is exactly simple ThreadPool, which don't use multicores and multicpus (because of GIL). you must use multiprocessing.Pool to run Process, which is process in your OS (if you define Pool(N) - N is number of this processes, if no - number of your cores in OS is default). Arguments this processes get from internal queue of Pool. 'case of that U will use all cpu and all core in your OS
I have this code that I tried to make parallel based on a previous question. Here is the code using 2 processes.
import multiprocessing
import timeit
start_time = timeit.default_timer()
d1 = dict( (i,tuple([i*0.1,i*0.2,i*0.3])) for i in range(500000) )
d2={}
def fun1(gn):
x,y,z = d1[gn]
d2.update({gn:((x+y+z)/3)})
#
if __name__ == '__main__':
gen1 = [x for x in d1.keys()]
#fun1(gen1)
p= multiprocessing.Pool(2)
p.map(fun1,gen1)
print('Script finished')
stop_time = timeit.default_timer()
print(stop_time - start_time)
Output is:
Script finished
1.8478448875989333
If I change the program to sequential,
fun1(gen1)
#p= multiprocessing.Pool(2)
#p.map(fun1,gen1)
output is:
Script finished
0.8345944193950299
So parallel loop is taking more time that sequential loop, more than double. (My computer has 2 cores, running on Windows.) I tried to find similar questions on the topic, this and this but could not figure out the reason. How can I get performance improvement using multiprocessing module in this example?
When you do p.map(fun1,gen1) you send gen1 over to the other process. This includes serializing the list which is 500000 elements big.
Comparing serialization to the small computation, it takes much longer.
You can measure or profile where the time is spent.
I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s
The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)
I have many many small tasks to do in a for loop. I Want to use concurrency to speed it up. I used joblib for its easy to integrate. However, I found using joblib makes my program run much slower than a simple for iteration. Here is the demo code:
import time
import random
from os import path
import tempfile
import numpy as np
import gc
from joblib import Parallel, delayed, load, dump
def func(a, i):
'''a simple task for demonstration'''
a[i] = random.random()
def memmap(a):
'''use memory mapping to prevent memory allocation for each worker'''
tmp_dir = tempfile.mkdtemp()
mmap_fn = path.join(tmp_dir, 'a.mmap')
print 'mmap file:', mmap_fn
_ = dump(a, mmap_fn) # dump
a_mmap = load(mmap_fn, 'r+') # load
del a
gc.collect()
return a_mmap
if __name__ == '__main__':
N = 10000
a = np.zeros(N)
# memory mapping
a = memmap(a)
# parfor
t0 = time.time()
Parallel(n_jobs=4)(delayed(func)(a, i) for i in xrange(N))
t1 = time.time()-t0
# for
t0 = time.time()
[func(a, i) for i in xrange(N)]
t2 = time.time()-t0
# joblib time vs for time
print t1, t2
On my laptop with i5-2520M CPU, 4 cores, Win7 64bit, the running time is 6.464s for joblib and 0.004s for simplely for loop.
I've made the arguments as memory mapping to prevent the overhead of reallocation for each worker.
I've red this relative post, still not solved my problem.
Why is that happen? Did I missed some disciplines to correctly use joblib?
"Many small tasks" are not a good fit for joblib. The coarser the task granularity, the less overhead joblib causes and the more benefit you will have from it. With tiny tasks, the cost of setting up worker processes and communicating data to them will outweigh any any benefit from parallelization.