I am trying to optimize my code using Python's multiprocessing.Pool module, but I am not getting the speed-up results that I would logically expect.
The main method I am doing involves calculating matrix-vector products for a large number of vectors and a fixed large sparse matrix. Below is a toy example which performs what I need, but with random matrices.
import time
import numpy as np
import scipy.sparse as sp
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
res = []
for i in range(10):
res.append(calculate(np.random.rand(N), matrix = matrix))
print time.time() - t
The method terminates in about 30 seconds.
Now, since the calculation of each element of results does not depend on the results of any other calculation, it is natural to think that paralel calculation will speed up the process. The idea is to create 4 processes and if each does some of the calculations, then the time it takes for all the processes to complete should decrease by some factor around 4. To do this, I wrote the following code:
import time
import numpy as np
import scipy.sparse as sp
from multiprocessing import Pool
from functools import partial
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
input = []
for i in range(10):
input.append(np.random.rand(N))
mp = partial(calculate, matrix = matrix)
p = Pool(4)
res = p.map(mp, input)
print time.time() - t
My problem is that this code takes slightly above 20 seconds to run, so I did not even improve performance by a factor of 2! Even worse, the performance does not improve even if the pool contains 8 processes! Any idea why the speed-up is not happening?
Note: My actual method takes much longer, and the input vectors are stored in a file. If I split the file in 4 pieces and then run my script in a separate process for each file manually, each process terminates four times as quickly as it would for the whole file (as expected). I am confuzed why this speed-up (which is obviously possible) is not happening with multiprocessing.Pool
Edi: I have just found Multiprocessing.Pool makes Numpy matrix multiplication slower this question which may be related. I have to check, though.
Try:
p = Pool(4)
for i in range(10):
input = np.random.rand(N)
p.apply_async(calculate, args=(input, matrix)) # perform function calculate as new process with arguments input and matrix
p.close()
p.join() # wait for all processes to complete
I suspect that the "partial" object and map are resulting in a blocking behavior. (though I have never used partial, so I'm not familiar with it.)
"apply_async" (or "map_async") are multiprocessing methods that specifically do not block - (see: Python multiprocessing.Pool: when to use apply, apply_async or map?)
Generally, for "embarrassingly parallel problems" like this, apply_async works for me.
EDIT:
I tend to write results to MySQL databases when I'm done - the implementation I provided doesn't work if that's not your approach. "map" is probably the right answer if you want to use order in the list as your way of tracking which entry is which, but I remain suspicious of the "partial" objects.
Related
The below code is taking around 15 seconds to get the result. But when I run a it sequentially it
only takes around 11 seconds. What can be the reason for this ?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()
Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:
First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.
E.g., this could be accomplished with the following modifications of your code:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.
Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.
Edit:
As #KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.
I want to parallelise the operation of a function on each element of a list using ray. A simplified snippet is below
import numpy as np
import time
import ray
import psutil
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus)
#ray.remote
def f(a, b, c):
return a * b - c
def g(a, b, c):
return a * b - c
def my_func_par(large_list):
# arguments a and b are constant just to illustrate
# argument c is is each element of a list large_list
[f.remote(1.5, 2, i) for i in large_list]
def my_func_seq(large_list):
# arguments a anf b are constant just to illustrate
# argument c is is each element of a list large_list
[g(1.5, 2, i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print(time.time() - s)
>>> 2.007
s = time.time()
my_func_seq(my_list)
print(time.time() - s)
>>> 0.0372
The problem is, when I time my_func_par, it is much slower (~54x as can be seen above) than my_func_seq. One of the authors of ray does answer a comment on this blog that seems to explain what I am doing is setting up len(large_list) different tasks, which is incorrect.
How do I use ray and modify the code above to run it in parallel? (maybe by splitting large_list into chunks with the number of chunks being equal to the number of cpus)
EDIT: There are two important criteria in this question
The function f needs to accept multiple arguments
It may be necessarry to use ray.put(large_list) so that the larg_list variable can be stored in shared memory rather than copied to each processor
To add to what Sang said above:
Ray Distributed multiprocessing.Pool supports a fixed-size pool of Ray Actors for easier parallelization.
import numpy as np
import time
import ray
from ray.util.multiprocessing import Pool
pool = Pool()
def f(x):
# time.sleep(1)
return 1.5 * 2 - x
def my_func_par(large_list):
pool.map(f, large_list)
def my_func_seq(large_list):
[f(i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print('Parallel time: ' + str(time.time() - s))
s = time.time()
my_func_seq(my_list)
print('Sequential time: ' + str(time.time() - s))
With the above code, my_func_par runs much faster (about 0.1 sec). If you play with the code and make f(x) slower by something like time.sleep, you can see the clear advantage of multiprocessing.
The reason why the parallized version is slower is that running ray tasks unavoidably have overhead to run (although it puts lots of effort to optimize it). It is because running things in parallel requires to have inter-process communication, serialization, and things like that.
That being said, if your function is really fast (as fast as the running function takes less time than other overhead in distributed computation, in which your code is perfectly the case because the function f is really really tiny. I assume it will take less than a microsecond to run that function).
This means you should make f function more computationally heavier in order to get benefit from parallelization. Your proposed solution might not work because even after that, the function f might be still lightweight enough depending on your list size.
Good day
I am trying to speed up a computation that involves many independent integrations. To do this I am using pythons Joblib and multiprocessing. So far I have succeeded with parallelizing the inner loop of my computation, but I would like to do the same with the outer loop. Since parallel programming messes with my mind, I am wondering if someone could help me. So far I have:
from joblib import Parallel, delayed
import multiprocessing
N = 10 # Some number
inputs = range(1,N,2)
num_cores = multiprocessing.cpu_count()
def processInput(n):
u_1 = lambda x,y: f(x,y)g(n,m) # Some function
Cn = scintegrate.nquad(u_1, [[A,B],[C,D]]) # A number
return Cn*F(x,y)*G(n,m)
resultsN = []
for m in range(1,N,2): # How can this be parallelized?
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in inputs)
resultsN = add + resultsN
resultsN = sum(resultsN)
This have so far produced the correct results. Now I would like to do the same with outer loop. Does anyone have an idea how I can do this?
I am also wondering if the u_1 declaration can be done outside the processInput, and any other suggestions for improvement will be appreciated.
Thanks for any replies.
If I understand correctly, you run your function processInput(n) for a range of n values, and you need to do that m times and add everything together. Here, the index m only keeps count of how many times you want to run your processing function and add the results together, but nothing else. This allows you to do everything with just one layer of parallelism, namely creating a list of inputs which already contains repeated values, and dividing that workload amongst your cores. The quick intuition is that instead of processing inputs [1,2,3,4] in parallel and then doing that a bunch of times, you run in parallel inputs [1,1,1,2,2,2,3,3,3,4,4,4]. Here is what it could look like (I've changed your functions into a simpler function that I can run).
import numpy as np
from joblib import Parallel, delayed
import multiprocessing
from math import ceil
N = 10 # Some number
inputs = range(1,N,2)
num_cores = multiprocessing.cpu_count()
def processInput(n): # toy function
return n
resultsN = []
# your original solution with an additional loop that needs
# to be parallelized
for m in range(1,N,2):
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in inputs)
resultsN = add + resultsN
resultsN = sum(resultsN)
print resultsN
# solution with only one layer of parallelization
ext_inputs = np.repeat(inputs,ceil(m/2.0)).tolist()
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in ext_inputs)
resultsN = sum(add)
print resultsN
The ceil is required because in your original loop m skips every second value.
I have a large-ish array artist_topic_probs (112,312 item rows by ~100 feature columns), and I want to calculate the pairwise cosine similarity between a (large sample) of random pairs of rows from this array. Here's the relevant bits of my current code
# the number of random pairs to check (10 million here)
random_sample_size=10000000
# I want to make sure they're unique, and that I'm never comparing a row to itself
# so I generate my set of comparisons like so:
np.random.seed(99)
comps = set()
while len(comps)<random_sample_size:
a = np.random.randint(0,112312)
b= np.random.randint(0,112312)
if a!=b:
comp = tuple(sorted([a,b]))
comps.add(comp)
# convert to list at the end to ensure sort order
# not positive if this is needed...I've seen conflicting opinions
comps = list(sorted(comps))
This generates a list of tuples, where each are the two rows between which I'll calculate similarity. Then I just use a simple loop to calculate all the similarities:
c_dists = []
from scipy.spatial.distance import cosine
for a,b in comps:
c_dists.append(cosine(artist_topic_probs[a],artist_topic_probs[b]))
(of course, cosine here gives distance, not a similarity, but we can easily get that with sim = 1.0 - dist. I used similarity in the title because it's the more common term)
This works fine, but isn't too fast, and I need to repeat the procedure many times. I have 32 cores to work with, so parallelization seems like a good bet, but I'm not sure the best way to go about it. My idea was something like:
pool = mp.Pool(processes=32)
c_dists = [pool.apply(cosine, args=(artist_topic_probs[a],artist_topic_probs[b]))
for a,b in comps]
But testing this approach out on my laptop with some test data hasn't been working (it just hangs, or at least is taking so much longer than the simple loop that I got sick of waiting and killed it). My concern is the indexing of the matrix being some sort of bottleneck, but I'm not sure. Any ideas on how to effectively parallelize this (or otherwise speed up the process)?
First of all, you might want to use itertools.combinations and random.sample to get unique pairs in the future, but it won't work in this case due to memory issues. Then, multiprocessing is not multithreading, i.e. spawning a new process involves huge system overhead. There is little sense in spawning a process for each individual task. A task must be well worth the overhead to rationalise starting a new process, hence you'd better split all work into separate jobs (into as many pieces as the number of cores you want to use). Then, don't forget that multiprocessing implementation serialises the entire namespace and loads it into memory N times, where N is the number of processes. This can lead to intensive swapping if you don't have enough RAM to store N copies of your huge array. So you might want to reduce the number of cores.
Updated to restore initial order as you requested.
I made a test data-set of identical vectors, hence cosine must return a vector of zeros.
from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools
def worker(enumerated_comps):
return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]
def slice_iterable(iterable, chunk):
"""
Slices an iterable into chunks of size n
:param chunk: the number of items per slice
:type chunk: int
:type iterable: collections.Iterable
:rtype: collections.Generator
"""
_it = iter(iterable)
return itertools.takewhile(
bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
)
# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))
n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))
pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
print(c_dists)
Output:
[2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16]
These values are fairly close to zero.
P.S.
From the multiprocessing.Pool.apply docs
Equivalent of the apply() built-in function. It blocks until the
result is ready, so apply_async() is better suited for performing
work in parallel. Additionally, func is only executed in one of the
workers of the pool.
scipy.spatial.distance.cosine, as you can see following the link, introduces a significant overhead in your computations because for each invocation it computes the norm of the two vectors that you're analyzing at each invocation, for the size of your sample
this amounts to 20 millions norms computed, if you memorize the norms of your ~100 thousand vectors in advance you can save approximately 60% of your computation time because you have a dot product, u*v, and two norm calculations, and each of these three operations is roughly equivalent in terms of operations count.
Further, you're using explicit loops, if you could put your logic inside a vectorized numpy operator you could trim another large slice of your computational time.
Eventually, you talk about cosine similarity... consider that scipy.spatial.distance.cosine computes the cosine distance instead, the relationship is easy, cs = cd - 1 but I haven't seen this in your posted code.
I m using griddata to "mount" array with a great number of shapes and
i would like to know if i can calculate functions (on each slice) on each my 4 cores in order to accelerate the process?
import numpy
size = 8.
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array=zeros((Y.shape[0],X.shape[0],size))
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X**2+Y**2+X+Y
array[:,:,3] = X**3+Y**3+X**2+Y**2+X+Y
array[:,:,4] = X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,5] = X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**7+Y**7+X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
So here i would like to calculate array[:,:,0] & array[:,:,1] with the first core, then array[:,:,2] & array[:,:,3] with the second core...?
----EDIT LATER---
There is no link between different "slices"...My different functions are independent
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X*np.cos(X)+Y*np.sin(Y)
array[:,:,3] = X**3+np.sin(X)+X**2+Y**2+np.sin(Y)
...
You can try with multiprocessing.Pool :
from multiprocessing import Pool
import numpy as np
size = 8.
Y=(np.arange(2000))
X=(np.arange(2000))
(xx,yy)=np.meshgrid(X,Y)
array=np.zeros((Y.shape[0],X.shape[0],size))
def func(i): # you need to call a function with Pool
array_=np.zeros((Y.shape[0],X.shape[0]))
for j in range(1,i):
array_+=X**j+Y**j
return array_
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor
result=p.map(func, range(1,8))
for i in range(1,8):
array[::,::,i]=result[i-1]
Keep in mind that multiprocessing in python does not share memory, that's why you have to create the array_ and add the for-loop at the end of the code.
As your application (with these dimensions) doesn't need a lot of computing time, it is possible that you will be slower with this method. Also you will create multiple copies of all your variables, wich may cause a memory overflow.
You should also double-check the func I wrote, as I didn't completely verify that it does what it is supposed to do :)
If you want to apply a single function over an array of data, then using e.g. a multiprocessing.Pool is a good solution, provided that both the input and output of the calculation are relatively small.
You want to do many different calculations to two input arrays, which results in an array being returned for every one of those calculations.
Since separate processes do not share memory, the X and Y arrays have to be transported to each worker process when it is are started. And the result of each calculation (which is also a numpy array the same size as X and Y) has to be returned to the parent process.
Depending on e.g. the size of the arrays and the amount of cores, the overhead from the transfer of all those array between worker processes and the parent process via interprocess communication ("IPC") will cost time, reducing the advantages of using multiple cores.
Keep in mind that the parent process has to listen for and handle IPC requests from all the worker processes. So you've shifted the bottleneck from calculation to communication.
So it is not a given that multiprocessing will actually improve performance in this case. It depends on the details of the actual problem (number of cores, array size, amount of physical memory et cetera).
You will have to do some careful performance measurements using e.g. Pool or Process with realistic array sizes.
Three things:
The most important question is why are you doing this?.
Your NumPy build may already be making use of multiple cores. I am not sure off the top of my head how to check, see questions like this or if absolutely necessary take a look at the Numexpr library https://github.com/pydata/numexpr
About the "Y" in your likely XY problem - you are re-calculating data that you can instead re-use:
.
import numpy
size = 8
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array = zeros((Y.shape[0], X.shape[0], size))
array[..., 0] = 0
for i in range(1, size):
array[..., 1] = X ** i + Y ** i + array[..., i - 1]