Why this multiprocessing code is slower than the serial one?

Why this multiprocessing code is slower than the serial one? - python

I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s

The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)

Related

Python Multiprocessing: Writing to file every k iterations

I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)

My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()

How to get speed by multiprocessing in Python (linux), insteed of losing

I search the way to use all cores. But all I try, only decrease spreed.
I tried following:
from joblib import Parallel, delayed
import multiprocessing
from time import time
import numpy as np
inputs = range(1000)
def processInput(i):
return i * i
using multiprocessing
num_cores = multiprocessing.cpu_count()
start=time()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
print 'multiproc time ', time()-start
without multiprocessing
start=time()
results =[]
for i in inputs:
results.append(processInput(i))
print 'simple time ', time()-start
and get output:
multiproc time 0.14687204361
simple time 0.000839948654175

This is a classic problem with multi-threading / multi-processing. Whenever you want to process something in parallel, you should make sure that the time you are saving because of parallelism is greater than the time it takes to manage parallel processes.
Try increasing the input size. Then you will see the impact of parallelism.

Using joblib makes the program run much slower, why?

I have many many small tasks to do in a for loop. I Want to use concurrency to speed it up. I used joblib for its easy to integrate. However, I found using joblib makes my program run much slower than a simple for iteration. Here is the demo code:
import time
import random
from os import path
import tempfile
import numpy as np
import gc
from joblib import Parallel, delayed, load, dump
def func(a, i):
'''a simple task for demonstration'''
a[i] = random.random()
def memmap(a):
'''use memory mapping to prevent memory allocation for each worker'''
tmp_dir = tempfile.mkdtemp()
mmap_fn = path.join(tmp_dir, 'a.mmap')
print 'mmap file:', mmap_fn
_ = dump(a, mmap_fn) # dump
a_mmap = load(mmap_fn, 'r+') # load
del a
gc.collect()
return a_mmap
if __name__ == '__main__':
N = 10000
a = np.zeros(N)
# memory mapping
a = memmap(a)
# parfor
t0 = time.time()
Parallel(n_jobs=4)(delayed(func)(a, i) for i in xrange(N))
t1 = time.time()-t0
# for
t0 = time.time()
[func(a, i) for i in xrange(N)]
t2 = time.time()-t0
# joblib time vs for time
print t1, t2
On my laptop with i5-2520M CPU, 4 cores, Win7 64bit, the running time is 6.464s for joblib and 0.004s for simplely for loop.
I've made the arguments as memory mapping to prevent the overhead of reallocation for each worker.
I've red this relative post, still not solved my problem.
Why is that happen? Did I missed some disciplines to correctly use joblib?

"Many small tasks" are not a good fit for joblib. The coarser the task granularity, the less overhead joblib causes and the more benefit you will have from it. With tiny tasks, the cost of setting up worker processes and communicating data to them will outweigh any any benefit from parallelization.

Python multiprocessing speed

I wrote this bit of code to test out Python's multiprocessing on my computer:
from multiprocessing import Pool
var = range(5000000)
def test_func(i):
return i+1
if __name__ == '__main__':
p = Pool()
var = p.map(test_func, var)
I timed this using Unix's time command and the results were:
real 0m2.914s
user 0m4.705s
sys 0m1.406s
Then, using the same var and test_func() I timed:
var = map(test_func, var)
and the results were
real 0m1.785s
user 0m1.548s
sys 0m0.214s
Shouldn't the multiprocessing code be much faster than plain old map?

Why it should.
In map function, you are just calling the function sequentially.
Multiprocessing pool creates a set of workers to which your task will be mapped.
It is coordinating multiple worker processes to run these functions.
Try doing some significant work inside your function and then time them and see if multiprocessing helps you to compute faster.
You have to understand that there will be overheads in using multiprocessing. Only when the computing effort is significantly greater than these overheads that you will see it's benefits.
See the last example in excellent introduction by Hellmann: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
maxtasksperchild=2,
)
pool_outputs = pool.map(do_calculation, inputs)
You create pools depending on cores that you have.

There is an overhead on using parallelization. There is only benefit if each work unit takes long enough to compensate the overhead.
Also if you only have one CPU (or CPU thread) on your machine, there's no point in using parallelization at all. You'll only see gains if you have at least a hyperthreaded machine or at least two CPU cores.
In your case a simple addition operation doesn't compensate that overhead.
Try something a bit more costly such as:
from multiprocessing import Pool
import math
def test_func(i):
j = 0
for x in xrange(1000000):
j += math.atan2(i, i)
return j
if __name__ == '__main__':
var = range(500)
p = Pool()
var = p.map(test_func, var)

Parfor for Python

I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]

The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res

There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.

Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.

This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.

I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.

I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum

I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished

Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why this multiprocessing code is slower than the serial one? - python

Related

Python Multiprocessing: Writing to file every k iterations

How to get speed by multiprocessing in Python (linux), insteed of losing

Using joblib makes the program run much slower, why?

Python multiprocessing speed

Parfor for Python

Categories

Resources