Double parallel loop with Python Joblib - python

Good day
I am trying to speed up a computation that involves many independent integrations. To do this I am using pythons Joblib and multiprocessing. So far I have succeeded with parallelizing the inner loop of my computation, but I would like to do the same with the outer loop. Since parallel programming messes with my mind, I am wondering if someone could help me. So far I have:
from joblib import Parallel, delayed
import multiprocessing
N = 10 # Some number
inputs = range(1,N,2)
num_cores = multiprocessing.cpu_count()
def processInput(n):
u_1 = lambda x,y: f(x,y)g(n,m) # Some function
Cn = scintegrate.nquad(u_1, [[A,B],[C,D]]) # A number
return Cn*F(x,y)*G(n,m)
resultsN = []
for m in range(1,N,2): # How can this be parallelized?
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in inputs)
resultsN = add + resultsN
resultsN = sum(resultsN)
This have so far produced the correct results. Now I would like to do the same with outer loop. Does anyone have an idea how I can do this?
I am also wondering if the u_1 declaration can be done outside the processInput, and any other suggestions for improvement will be appreciated.
Thanks for any replies.

If I understand correctly, you run your function processInput(n) for a range of n values, and you need to do that m times and add everything together. Here, the index m only keeps count of how many times you want to run your processing function and add the results together, but nothing else. This allows you to do everything with just one layer of parallelism, namely creating a list of inputs which already contains repeated values, and dividing that workload amongst your cores. The quick intuition is that instead of processing inputs [1,2,3,4] in parallel and then doing that a bunch of times, you run in parallel inputs [1,1,1,2,2,2,3,3,3,4,4,4]. Here is what it could look like (I've changed your functions into a simpler function that I can run).
import numpy as np
from joblib import Parallel, delayed
import multiprocessing
from math import ceil
N = 10 # Some number
inputs = range(1,N,2)
num_cores = multiprocessing.cpu_count()
def processInput(n): # toy function
return n
resultsN = []
# your original solution with an additional loop that needs
# to be parallelized
for m in range(1,N,2):
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in inputs)
resultsN = add + resultsN
resultsN = sum(resultsN)
print resultsN
# solution with only one layer of parallelization
ext_inputs = np.repeat(inputs,ceil(m/2.0)).tolist()
add = Parallel(n_jobs=num_cores)(delayed(processInput)(n) for n in ext_inputs)
resultsN = sum(add)
print resultsN
The ceil is required because in your original loop m skips every second value.

Related

Does this manipulation in python save more time for my highly inefficient program?

I am using the following code unchanged in form but changed in content:
import numpy as np
import matplotlib.pyplot as plt
import random
from random import seed
from random import randint
import math
from math import *
from random import *
import statistics
from statistics import *
n=1000
T_plot=[0];
X_relm=[0];
class Objs:
def __init__(self, xIn, yIn, color):
self.xIn= xIn
self.yIn = yIn
self.color = color
def yfT(self, t):
return self.yIn*t+self.yIn*t
def xfT(self, t):
return self.xIn*t-self.yIn*t
xi=np.random.uniform(0,1,n);
yi=np.random.uniform(0,1,n);
O1 = [Objs(xIn = i, yIn = j, color = choice(["Black", "White"])) for i,j
in zip(xi,yi)]
X=sorted(O1,key=lambda x:x.xIn)
dt=1/(2*n)
T=20
iter=40000
Black=[]
White=[]
Xrelm=[]
for i in range(1,iter+1):
t=i*dt
for j in range(n-1):
check=X[j].xfT(t)-X[j+1].xfT(t);
if check<0:
X[j],X[j+1]=X[j+1],X[j]
if check<-10:
X[j].color,X[j+1].color=X[j+1].color,X[j].color
if X[j].color=="Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
Xrel=mean(Black)-mean(White)
Xrelm.append(Xrel)
plot1=plt.figure(1);
plt.plot(T_plot,Xrelm);
plt.xlabel("time")
plt.ylabel("Relative ")
and it keeps running (I left it for 10 hours) without giving output for some parameters simply because it's too big I guess. I know that my code is not faulty totally (in the sense that it should give something even if wrong) because it does give outputs for fewer time steps and other parameters.
So, I am focusing on trying to optimize my code so that it takes lesser time to run. Now, this is a routine task for coders but I am a newbie and I am coding simply because the simulation will help in my field. So, in general, any inputs of a general nature that give insights on how to make one's code faster are appreciated.
Besides that, I want to ask whether defining a function a priori for the inner loop will save any time.
I do not think it should save any time since I am doing the same thing but I am not sure maybe it does. If it doesn't, any insights on how to deal with nested loops in a more efficient way along with those of general nature are appreciated.
(I have tried to shorten the code as far as I could and still not miss relevant information)
There are several issues in your code:
the mean is recomputed from scratch based on the growing array. Thus, the complexity of mean(Black)-mean(White) is quadratic to the number of elements.
The mean function is not efficient. Using a basic sum and division is much faster. In fact, a manual mean is about 25~30 times faster on my machine.
The CPython interpreter is very slow so you should avoid using loops as much as possible (OOP code does not help either). If this is not possible and your computation is expensive, then consider using a natively compiled code. You can use tools like PyPy, Numba or Cython or possibly rewrite a part in C.
Note that strings are generally quite slow and there is no reason to use them here. Consider using enumerations instead (ie. integers).
Here is a code fixing the first two points:
dt = 1/(2*n)
T = 20
iter = 40000
Black = []
White = []
Xrelm = []
cur1, cur2 = 0, 0
sum1, sum2 = 0.0, 0.0
for i in range(1,iter+1):
t = i*dt
for j in range(n-1):
check = X[j].xfT(t) - X[j+1].xfT(t)
if check < 0:
X[j],X[j+1] = X[j+1],X[j]
if check < -10:
X[j].color, X[j+1].color = X[j+1].color, X[j].color
if X[j].color == "Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
delta1, delta2 = sum(Black[cur1:]), sum(White[cur2:])
sum1, sum2 = sum1+delta1, sum2+delta2
cur1, cur2 = len(Black), len(White)
Xrel = sum1/cur1 - sum2/cur2
Xrelm.append(Xrel)
Consider resetting Black and White to an empty list if you do not use them later.
This is several hundreds of time faster. It now takes 2 minutes as opposed to >20h (estimation) for the initial code.
Note that using a compiled code should be at least 10 times faster here so the execution time should be no more than dozens of seconds.
As mentioned in earlier comments, this one is a bit too broad to answer.
To illustrate; your iteration itself doesn't take very long:
import time
start = time.time()
for i in range(10000):
for j in range(10000):
pass
end = time.time()
print (end-start)
On my not-so-great machine that takes ~2s to complete.
So the looping portion is only a tiny fraction of your 10h+ run time.
The detail of what you're doing in the loop is the key.
Whilst very basic, the approach I've shown in the code above could be applied to your existing code to work out which bit(s) are the least performant and then raise a new question with some more specific, actionable detail.

Why Multiprocessing is taking more time than sequential processing?

The below code is taking around 15 seconds to get the result. But when I run a it sequentially it
only takes around 11 seconds. What can be the reason for this ?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()
Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:
First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.
E.g., this could be accomplished with the following modifications of your code:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.
Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.
Edit:
As #KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.

Python: edit matrix row in parallel

here is my problem:
I would like to define an array of persons and change the entries of this array in a for loop. Since I also would like to see the asymptotics of the resulting distribution, I want to repeat this simulation quiet a lot, thus I'm using a matrix to store the several array in each row. I know how to do this with two for loops:
import random
import numpy as np
nobs = 100
rep = 10**2
steps = 10**2
dmoney = 1
state = np.matrix([[10] * nobs] * rep)
for i in range(steps):
for j in range(rep)
sample = random.sample(range(state.shape[1]),2)
state[j,sample[0]] = state[j,sample[0]] + dmoney
state[j,sample[1]] = state[j,sample[1]] - dmoney
I thought I use the multiprocessing library but I don't know how to do it, because in my simple mind, the workers manipulate the same global matrix in parallel, which I read is not a good idea.
So, how can I do this, to speed up calculations?
Thanks in advance.
OK, so this might not be much use, I haven't profiled it to see if there's a speed-up, but list comprehensions will be a little faster than normal loops anyway.
...
y_ix = np.arange(rep) # create once as same for each loop
for i in range(steps):
# presumably the two locations in the population to swap need refreshing each loop
x_ix = np.array([np.random.choice(nobs, 2) for j in range(rep)])
state[y_ix, x_ix[:,0]] += dmoney
state[y_ix, x_ix[:,1]] -= dmoney
PS what numpy splits over multiple processors depends on what libraries have been included when compiled (BLAS etc). You will be able to find info on line about this.
EDIT I can confirm, after comparing the original with the numpy indexed version above, that the original method is faster!

Python - Multi processing to mount an array

I m using griddata to "mount" array with a great number of shapes and
i would like to know if i can calculate functions (on each slice) on each my 4 cores in order to accelerate the process?
import numpy
size = 8.
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array=zeros((Y.shape[0],X.shape[0],size))
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X**2+Y**2+X+Y
array[:,:,3] = X**3+Y**3+X**2+Y**2+X+Y
array[:,:,4] = X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,5] = X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**7+Y**7+X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
So here i would like to calculate array[:,:,0] & array[:,:,1] with the first core, then array[:,:,2] & array[:,:,3] with the second core...?
----EDIT LATER---
There is no link between different "slices"...My different functions are independent
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X*np.cos(X)+Y*np.sin(Y)
array[:,:,3] = X**3+np.sin(X)+X**2+Y**2+np.sin(Y)
...
You can try with multiprocessing.Pool :
from multiprocessing import Pool
import numpy as np
size = 8.
Y=(np.arange(2000))
X=(np.arange(2000))
(xx,yy)=np.meshgrid(X,Y)
array=np.zeros((Y.shape[0],X.shape[0],size))
def func(i): # you need to call a function with Pool
array_=np.zeros((Y.shape[0],X.shape[0]))
for j in range(1,i):
array_+=X**j+Y**j
return array_
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor
result=p.map(func, range(1,8))
for i in range(1,8):
array[::,::,i]=result[i-1]
Keep in mind that multiprocessing in python does not share memory, that's why you have to create the array_ and add the for-loop at the end of the code.
As your application (with these dimensions) doesn't need a lot of computing time, it is possible that you will be slower with this method. Also you will create multiple copies of all your variables, wich may cause a memory overflow.
You should also double-check the func I wrote, as I didn't completely verify that it does what it is supposed to do :)
If you want to apply a single function over an array of data, then using e.g. a multiprocessing.Pool is a good solution, provided that both the input and output of the calculation are relatively small.
You want to do many different calculations to two input arrays, which results in an array being returned for every one of those calculations.
Since separate processes do not share memory, the X and Y arrays have to be transported to each worker process when it is are started. And the result of each calculation (which is also a numpy array the same size as X and Y) has to be returned to the parent process.
Depending on e.g. the size of the arrays and the amount of cores, the overhead from the transfer of all those array between worker processes and the parent process via interprocess communication ("IPC") will cost time, reducing the advantages of using multiple cores.
Keep in mind that the parent process has to listen for and handle IPC requests from all the worker processes. So you've shifted the bottleneck from calculation to communication.
So it is not a given that multiprocessing will actually improve performance in this case. It depends on the details of the actual problem (number of cores, array size, amount of physical memory et cetera).
You will have to do some careful performance measurements using e.g. Pool or Process with realistic array sizes.
Three things:
The most important question is why are you doing this?.
Your NumPy build may already be making use of multiple cores. I am not sure off the top of my head how to check, see questions like this or if absolutely necessary take a look at the Numexpr library https://github.com/pydata/numexpr
About the "Y" in your likely XY problem - you are re-calculating data that you can instead re-use:
.
import numpy
size = 8
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array = zeros((Y.shape[0], X.shape[0], size))
array[..., 0] = 0
for i in range(1, size):
array[..., 1] = X ** i + Y ** i + array[..., i - 1]

Python multiprocessing not yielding the speed-up expected

I am trying to optimize my code using Python's multiprocessing.Pool module, but I am not getting the speed-up results that I would logically expect.
The main method I am doing involves calculating matrix-vector products for a large number of vectors and a fixed large sparse matrix. Below is a toy example which performs what I need, but with random matrices.
import time
import numpy as np
import scipy.sparse as sp
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
res = []
for i in range(10):
res.append(calculate(np.random.rand(N), matrix = matrix))
print time.time() - t
The method terminates in about 30 seconds.
Now, since the calculation of each element of results does not depend on the results of any other calculation, it is natural to think that paralel calculation will speed up the process. The idea is to create 4 processes and if each does some of the calculations, then the time it takes for all the processes to complete should decrease by some factor around 4. To do this, I wrote the following code:
import time
import numpy as np
import scipy.sparse as sp
from multiprocessing import Pool
from functools import partial
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
input = []
for i in range(10):
input.append(np.random.rand(N))
mp = partial(calculate, matrix = matrix)
p = Pool(4)
res = p.map(mp, input)
print time.time() - t
My problem is that this code takes slightly above 20 seconds to run, so I did not even improve performance by a factor of 2! Even worse, the performance does not improve even if the pool contains 8 processes! Any idea why the speed-up is not happening?
Note: My actual method takes much longer, and the input vectors are stored in a file. If I split the file in 4 pieces and then run my script in a separate process for each file manually, each process terminates four times as quickly as it would for the whole file (as expected). I am confuzed why this speed-up (which is obviously possible) is not happening with multiprocessing.Pool
Edi: I have just found Multiprocessing.Pool makes Numpy matrix multiplication slower this question which may be related. I have to check, though.
Try:
p = Pool(4)
for i in range(10):
input = np.random.rand(N)
p.apply_async(calculate, args=(input, matrix)) # perform function calculate as new process with arguments input and matrix
p.close()
p.join() # wait for all processes to complete
I suspect that the "partial" object and map are resulting in a blocking behavior. (though I have never used partial, so I'm not familiar with it.)
"apply_async" (or "map_async") are multiprocessing methods that specifically do not block - (see: Python multiprocessing.Pool: when to use apply, apply_async or map?)
Generally, for "embarrassingly parallel problems" like this, apply_async works for me.
EDIT:
I tend to write results to MySQL databases when I'm done - the implementation I provided doesn't work if that's not your approach. "map" is probably the right answer if you want to use order in the list as your way of tracking which entry is which, but I remain suspicious of the "partial" objects.

Categories