Why Multiprocessing is taking more time than sequential processing?

Why Multiprocessing is taking more time than sequential processing? - python

The below code is taking around 15 seconds to get the result. But when I run a it sequentially it
only takes around 11 seconds. What can be the reason for this ?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()

Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:
First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.
E.g., this could be accomplished with the following modifications of your code:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.
Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.
Edit:
As #KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.

Related

Function parallelized with mp.Pool() taking longer execution time with each iteration

I parallelized a function that (i) generates a list of sublists (ii) parameterizes a matrix with each sublist and (iii) returns the eigenvalues of the matrix. I parallelized step (ii) and (iii) so that the eigenvalue calculation for each sublist is parallelized. Then I run this function for many iterations but when I am monitoring the execution time for each iteration I observe that every execution of the function takes longer and longer (iter 1 -0.705 mins, iter 2 - 0.711 mins, iter 3 - 0.717 mins and so on). The pseudocode is given below
import multiprocessing as mp
def parallelized_func():
global calc_eig
pool = mp.Pool()
def calc_eig(sublist):
max_eig = other_module.parameterize_and_decompose(sublist)
return max_eig
list_of_sublists = some_generator.generate(n_sublists = 100)
max_eig = pool.map(calc_eig, list_of_sublists)
pool.close()
return max(max_eig)
for iter in range(10000):
start = time.time()
this_max_eig = parallelized_func()
end = time.time()
print(end-start)
I am not sure why the executions are taking longer with each iteration. I am monitoring memory and core utilization through htop and there doesn't seem to be any abnormalities. I also checked the other modules I am using inside the function and they do not have increasing execution times so I am guessing the problem is the way I am using Pool. Any feedback would be appreciated.

How to use python Ray to parallelise over a large list?

I want to parallelise the operation of a function on each element of a list using ray. A simplified snippet is below
import numpy as np
import time
import ray
import psutil
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus)
#ray.remote
def f(a, b, c):
return a * b - c
def g(a, b, c):
return a * b - c
def my_func_par(large_list):
# arguments a and b are constant just to illustrate
# argument c is is each element of a list large_list
[f.remote(1.5, 2, i) for i in large_list]
def my_func_seq(large_list):
# arguments a anf b are constant just to illustrate
# argument c is is each element of a list large_list
[g(1.5, 2, i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print(time.time() - s)
>>> 2.007
s = time.time()
my_func_seq(my_list)
print(time.time() - s)
>>> 0.0372
The problem is, when I time my_func_par, it is much slower (~54x as can be seen above) than my_func_seq. One of the authors of ray does answer a comment on this blog that seems to explain what I am doing is setting up len(large_list) different tasks, which is incorrect.
How do I use ray and modify the code above to run it in parallel? (maybe by splitting large_list into chunks with the number of chunks being equal to the number of cpus)
EDIT: There are two important criteria in this question
The function f needs to accept multiple arguments
It may be necessarry to use ray.put(large_list) so that the larg_list variable can be stored in shared memory rather than copied to each processor

To add to what Sang said above:
Ray Distributed multiprocessing.Pool supports a fixed-size pool of Ray Actors for easier parallelization.
import numpy as np
import time
import ray
from ray.util.multiprocessing import Pool
pool = Pool()
def f(x):
# time.sleep(1)
return 1.5 * 2 - x
def my_func_par(large_list):
pool.map(f, large_list)
def my_func_seq(large_list):
[f(i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print('Parallel time: ' + str(time.time() - s))
s = time.time()
my_func_seq(my_list)
print('Sequential time: ' + str(time.time() - s))
With the above code, my_func_par runs much faster (about 0.1 sec). If you play with the code and make f(x) slower by something like time.sleep, you can see the clear advantage of multiprocessing.

The reason why the parallized version is slower is that running ray tasks unavoidably have overhead to run (although it puts lots of effort to optimize it). It is because running things in parallel requires to have inter-process communication, serialization, and things like that.
That being said, if your function is really fast (as fast as the running function takes less time than other overhead in distributed computation, in which your code is perfectly the case because the function f is really really tiny. I assume it will take less than a microsecond to run that function).
This means you should make f function more computationally heavier in order to get benefit from parallelization. Your proposed solution might not work because even after that, the function f might be still lightweight enough depending on your list size.

Why is there no influence after using more than 2 processes in Pool?

By using the map function in the multiprocessing library I see no difference in execution time when using more than 2 processes. I'm running the program using 4 cores.
The actual code is pretty straight forward and calculates the first 4000 fibonacci numbers 4 times (= the amount of cores). It distributes the work evenly between N cores (e.g. when using a Pool with 2 processes, each process will calculate the first 4000 fibonacci numbers twice). This whole process is done for N = 1 up to the amount of cores.
The output, with in every row the amount of cores and the corresponding execution time in seconds, is:
3,147
1,72
1,896
1.899
Does anyone know why there is no decrease in execution time given more than 2 cores? The actual code is:
import multiprocessing
from time import time
def F(_):
for n in range(4 * 10 ** 3):
a, b = 0, 1
for i in range(0, n):
a, b = b, a + b
return
def pool_fib():
n_cores = multiprocessing.cpu_count()
args = list(range(multiprocessing.cpu_count()))
for i in range(1, n_cores + 1):
with multiprocessing.Pool(i) as p:
start = time()
p.map(F, args)
print(i, time() - start)
if __name__ == '__main__':
pool_fib()

Provided you are using a fairly modern CPU, multiprocessing.cpu_count() won't give you the number of physical cores your machine has, but the number of hyper-threads. In a nutshell, hyper-threading allows a single physical core to have n (most commonly, two) pipelines, which fools your OS into thinking you've got n times the number of cores you really have. This is helpful, when you are doing some stuff that might starve a core with data (most notably, IO or RAM lookups caused by cache-misses), but your workload is purely arithmetic and it is not likely to starve your CPU, resulting in little to no gains from hyper-threading. And the little gains you might get will be overshadowed by multiprocessing overhead, which is quite significant.
P.S.
I usually post this sort of things as comments, but I've exceeded the comment size limitation. By the way, if you've chosen the Fibonacci series for something more that just an example, you might want to consider a faster algorithm: Fast Fibonacci computation

Python multiprocessing not yielding the speed-up expected

I am trying to optimize my code using Python's multiprocessing.Pool module, but I am not getting the speed-up results that I would logically expect.
The main method I am doing involves calculating matrix-vector products for a large number of vectors and a fixed large sparse matrix. Below is a toy example which performs what I need, but with random matrices.
import time
import numpy as np
import scipy.sparse as sp
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
res = []
for i in range(10):
res.append(calculate(np.random.rand(N), matrix = matrix))
print time.time() - t
The method terminates in about 30 seconds.
Now, since the calculation of each element of results does not depend on the results of any other calculation, it is natural to think that paralel calculation will speed up the process. The idea is to create 4 processes and if each does some of the calculations, then the time it takes for all the processes to complete should decrease by some factor around 4. To do this, I wrote the following code:
import time
import numpy as np
import scipy.sparse as sp
from multiprocessing import Pool
from functools import partial
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
input = []
for i in range(10):
input.append(np.random.rand(N))
mp = partial(calculate, matrix = matrix)
p = Pool(4)
res = p.map(mp, input)
print time.time() - t
My problem is that this code takes slightly above 20 seconds to run, so I did not even improve performance by a factor of 2! Even worse, the performance does not improve even if the pool contains 8 processes! Any idea why the speed-up is not happening?
Note: My actual method takes much longer, and the input vectors are stored in a file. If I split the file in 4 pieces and then run my script in a separate process for each file manually, each process terminates four times as quickly as it would for the whole file (as expected). I am confuzed why this speed-up (which is obviously possible) is not happening with multiprocessing.Pool
Edi: I have just found Multiprocessing.Pool makes Numpy matrix multiplication slower this question which may be related. I have to check, though.

Try:
p = Pool(4)
for i in range(10):
input = np.random.rand(N)
p.apply_async(calculate, args=(input, matrix)) # perform function calculate as new process with arguments input and matrix
p.close()
p.join() # wait for all processes to complete
I suspect that the "partial" object and map are resulting in a blocking behavior. (though I have never used partial, so I'm not familiar with it.)
"apply_async" (or "map_async") are multiprocessing methods that specifically do not block - (see: Python multiprocessing.Pool: when to use apply, apply_async or map?)
Generally, for "embarrassingly parallel problems" like this, apply_async works for me.
EDIT:
I tend to write results to MySQL databases when I'm done - the implementation I provided doesn't work if that's not your approach. "map" is probably the right answer if you want to use order in the list as your way of tracking which entry is which, but I remain suspicious of the "partial" objects.

Parralelized code runs much slower in Python than in Matlab

I have a piece of code that does the following:
for each file (already read in the RAM):
call a function and obtain a result
add the results up and disply
Each file can be analyzed in parallel. The function that analyzes each file is as following:
# Complexity = 1000*19*19 units of work
def fun(args):
(a, b, p) = args
for itr in range(1000):
for i in range(19):
for j in range(19):
# The following random number generated depends on
# latest values in (i-1, j), (i+1, j), (i, j-1) & (i, j+1)
# cells of latest a and b arrays
u = np.random.rand();
if (u < p):
a[i, j] += -1
else:
b[i, j] += 1
return a+b
I am using multiprocessing package to achieve parallelism:
import numpy as np
import time
from multiprocessing import Pool, cpu_count
if __name__ == '__main__':
t = time.time()
pool = Pool(processes=cpu_count())
args = [None]*100
for i in range(100):
a = np.random.randint(2, size=(19, 19))
b = np.random.randint(2, size=(19, 19))
p = np.random.rand()
args[i] = (a, b, p)
result = pool.map(fun, args)
for i in range(2, 100):
result[0] += result[i]
print result[0]
print time.time() - t
I have written equivalent MATLAB code that uses parfor and calls fun in each iteration of parfor:
tic
args = cell(100, 1);
r = cell(100, 1);
parfor i = 1:100
a = randi(2, 19, 19);
b = randi(2, 19, 19);
p = rand();
args{i}.a = a;
args{i}.b = b;
args{i}.p = p;
r{i} = fun(args{i});
end
for i = 2:100
r{1} = r{1} + r{i};
end
disp(r{1});
toc
The implementation of fun is as follows:
function [ ret ] = fun( args )
a = args.a;
b = args.b;
p = args.p;
for itr = 1:1000
for i = 1:19
for j = 1:19
u = rand();
if (u < p)
a(i, j) = a(i, j) + -1;
else
b(i, j) = b(i, j) + 1;
end
end
end
end
ret = a + b;
end
I find that MATLAB is blazingly fast, it takes around 1.5 seconds on a dual core processor compared to around 33-34 seconds taken by Python program. Why is this so?
EDIT: A lot of answers suggested that I should vectorize the random number generation. Actually the way it works is, random number generated depends on the latest a and b 2D arrays. I just put a simple rand() call to keep program simple and readable. In actuality of my program, random number is always generated by looking at certain horizontally and vertically neighbouring cells of (i, j) cell. So it is not possible to vectorize that.

Have you benchmarked both implementations of fun in a non-parallel context? One might just be a lot faster. In particular, those nested loops in the Python fun look like they might turn in to a faster vectorized solution in Matlab, or may be optimized by Matlab's JIT.
Stick both implementations in profilers to see where they're spending their time. Convert both implementations to non-parallel and profile those first, to make sure they're equivalent in performance before introducing the complexities of the parallelization stuff.
And one last check - you're setting up Matlab's Parallel Computation Toolbox with a local pool of workers, right, and not hooking in to a remote machine or picking up some other resources? How many workers on the Matlab side?

I did some tests on your Python code, though without the multiprocessing part and achieved a roughly 25x speedup by making the following changes:
Used Python lists instead of a NumPy array, because the latter is really slow when you need to do a lot of indexing. My timings are including the time it takes to do ndarray.tolist() so this might actually be a viable option, as long as the array's are not to big I think.
Ran it in PyPy instead of the regular Python interpreter, because PyPy has a JIT-compiler, to make the comparison with MATLAB a bit more fair. Regular CPython doesn't have such a feature.
Made the "random" call local to the function, i.e. do rand = np.random.rand and later u = rand(), because in Python lookups in the local namespace are faster, which can be significant in tight loops such as this.
Used Python's random.random instead of np.random.rand (also bound to a name local to the function).
Exchanged range for the generator-based xrange.
(This list is sorted from most significant speedup to only very small gain)
Of course there's also the parallel computing aspect. With multiprocessing all the Python objects that are passed around between processes are "pickled", meaning they have to be serialized and de-serialized on top of being copied between processes. MATLAB also copies data between processes (or threads?), but probably does so in a less wasteful way. Next to that, setting up a multiprocessing.Pool also takes a short while, which may not be fair to your MATLAB benchmark, but I'm not sure about this.
Based on your timings and mine, I would say that Python and MATLAB could be about equally fast for this specific task. But, unfortunately you have to jump through some hoops to get that speed with Python. Maybe using the #autojit capability of Numba could be interesting in this regard, if you have that available.

Try this version of fun and see if it gives you a speedup.
def fun(args):
a, b, p = args
n = 1000
u = np.random.random((n, 19, 19))
msk = u < p
msk_sum = msk.sum(0)
a -= msk_sum
b += (n - msk_sum)
return a + b
This is the more efficient way of implement this kind of function using numpy.
These kinds of nested loops can have quite high overhead in interpreted languages like matlab and python, but I suspect the JIT is compensating, at least partially, in matlab so the performance difference between the vectorized and loop implementations will be smaller. Cpython does not currently have any optimizations for these kinds of loops (that I know of) but but at least one python implementation, pypy, does have a JIT. Unfortunately pypy currently only has limited numpy support.
Update:
It seems like you have an iterative algorithm, and at least in my experience, those are the hardest to optimize with numpy/cpython. Consider using cython, this tutorial might also be useful, to write your nested loops. Other people might have other suggestions, but that's the best I can think of.

Given the available information, you would probably benefit from using Cython, which let's you also use some parallelism. Depending on the random numbers you need you could for instance use GSL to generate them.
Original question
There's no need to use multiprocessing, because fun can be easily vectorized, which yields a massive speed up (over 50x).
I'm not sure why matlab is not hit so hard as numpy, but its JIT may be saving it. Python does not like doing two . look ups in the inner loop, nor calling expensive functions there.
def fun_fast(args):
a, b, p = args
for i in xrange(19):
for j in xrange(19):
u = np.random.rand(1000)
msk = u < p
msk_sum = msk.sum()
a[i, j] -= msk_sum
b[i, j] += msk.size - msk_sum
return a + b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.