Python Multiprocessing, Best practice for mapping results to scientific numpy arrays - python

I don't really understand how to handle multiprocessing in Python when mapping the function results to multidimensional arrays. I provide a simple example of how I calculate it serially. The parallel computing does not work. Often, I pass a lot of arguments to a function, so this would be a very annoying way of doing it. Is there a "better way", than creating all i,j-pairs with a reshaped meshgrid?
import concurrent.futures
import numpy as np
def complex_function(i,j):
# this is a computationally intense function
return i,j,i+j
all_i = np.arange(3)
all_j = np.arange(6)
#%% serial
solution = np.empty((len(all_i),len(all_j)),dtype=float)
for i in range(len(all_i)):
for j in range(len(all_j)):
solution[i,j] = complex_function(all_i[i],all_j[j])[2]
#%% parallel
solution = np.empty((len(all_i),len(all_j)),dtype=float)
I,J = np.meshgrid(all_i, all_j, sparse=False, indexing='ij')
I = I.reshape(-1)
J = J.reshape(-1)
with concurrent.futures.ProcessPoolExecutor() as executor:
for i, j, result in executor.map(complex_function, I, J):
solution[i,j] = result
Okay, now I want to know wheather I can use nested functions like
def dummy_function(i,j):
result = complex_function(i,j)
return result
and then call dummy_function(i,j) with the executer.

Related

Generate two independent dictionaries in parallel

I want to build a dictionary of function evaluations in a parallel manner, but I am struggling to figure out how to do this efficiently.
Take the case of a randomly constructed matrix:
import functools
import multiprocessing
import numpy as np
import time
#generate random symmetric matrix
N = 500
b = np.random.random_integers(-2000,2000,size=(N,N))
b_symm = (b + b.T)/2
#identity matrix
ident = np.eye(N)
# define worker function:
def func(w, b_mat):
if w !=0:
L = np.linalg.inv(1j * w * ident - b_mat)
else:
L = np.linalg.pinv(-b_mat)
return L
I now want to sample over many values of w, and construct a dictionary of outputs. This would be an embarrassingly parallel problem. I can do this by using a shared dictionary, using something like this:
def dict_builder(w, d):
d[w] = func(w, b_symm)
manager = multiprocessing.Manager()
val_dict = manager.dict()
wrange = np.linspace(-10,10,200)
processors=2
pool = multiprocessing.Pool(processors)
st = time.time()
data = pool.map(functools.partial(dict_builder, d= val_dict), wrange,2)
pool.close()
pool.join()
en = time.time()
print("parallel test took ",en - st," seconds.")
but this seems more complicated than necessary since I am only evaluating the function at unique points, and comes with the overhead of having a shared memory object.
What I would like to is split wrange into n chunks, where n is the number of processors, build n dictionaries independently, then combine them into a single dictionary. So two questions: 1) Would this be computationally advantageous? 2) What is the best way to implement this using the multiprocessing module?

Avoiding memory error when using numpy's argsort

the following sourcecode produces a memoryerror on my machine:
import numpy as np
x = np.random.random([100,100,100])
y = np.random.random([100,100,100])
c_sort = np.argsort(x, axis = 2)
f = y[c_sort]
Do you have a nice and easy idea how to avoid the memory error?
The other way to do this is
x = np.random.random([100,100,100])
y = np.random.random([100,100,100])
f = np.zeros([100,100,100])
for i in range(100):
for j in range(100):
f[i,j,:] = y[i,j, np.argsort(x[i,j,:])]
But I wonder why the solutions above does not lead to the same result?
After discussion in the comments, it seems the loopy version is the correct one. So, to optimize it, we can use advanced-indexing. Thus, given the argsort indices as idx = np.argsort(x,axis=2), we can have f like so -
m,n = y.shape[:2]
f = y[np.arange(m)[:,None,None], np.arange(n)[:,None], idx]
Generic helper function for advanced-indexing take_along_axis could be useful.

Python, parallelization with joblib: Delayed with multiple arguments

I am using something similar to the following to parallelize a for loop over two matrices
from joblib import Parallel, delayed
import numpy
def processInput(i,j):
for k in range(len(i)):
i[k] = 1
for t in range(len(b)):
j[t] = 0
return i,j
a = numpy.eye(3)
b = numpy.eye(3)
num_cores = 2
(a,b) = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
but I'm getting the following error: Too many values to unpack (expected 2)
Is there a way to return 2 values with delayed? Or what solution would you propose?
Also, a bit OP, is there a more compact way, like the following (which doesn't actually modify anything) to process the matrices?
from joblib import Parallel, delayed
def processInput(i,j):
for k in i:
k = 1
for t in b:
t = 0
return i,j
I would like to avoid the use of has_shareable_memory anyway, to avoid possible bad interactions in the actual script and lower performances(?)
Probably too late, but as an answer to the first part of your question:
Just return a tuple in your delayed function.
return (i,j)
And for the variable holding the output of all your delayed functions
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
Now results is a list of tuples each holding some (i,j) and you can just iterate through results.

python numpy - optimize chisq function by removing explicit python loop?

I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!
You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)
It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0

Python append performance

I'm having some performance problems with 'append' in Python.
I'm writing an algorithm that checks if there are two overlapping circles in a (large) set of circles.
I start by putting the extreme points of the circles (x_i-R_i & x_i+R_i) in a list and then sorting the list.
class Circle:
def __init__(self, middle, radius):
self.m = middle
self.r = radius
In between I generate N random circles and put them in the 'circles' list.
"""
Makes a list with all the extreme points of the circles.
Format = [Extreme, left/right ~ 0/1 extreme, index]
Seperate function for performance reason, python handles local variables faster.
Garbage collect is temporarily disabled since a bug in Python makes list.append run in O(n) time instead of O(1)
"""
def makeList():
"""gc.disable()"""
list = []
append = list.append
for circle in circles:
append([circle.m[0]-circle.r, 0, circles.index(circle)])
append([circle.m[0] + circle.r, 1, circles.index(circle)])
"""gc.enable()"""
return list
When running this with 50k circles it takes over 75 seconds to generate the list. As you might see in the comments I wrote I disabled garbage collect, put it in a separate function, used
append = list.append
append(foo)
instead of just
list.append(foo)
I disabled gc since after some searching it seems that there's a bug with python causing append to run in O(n) instead of O(c) time.
So is this way the fastest way or is there a way to make this run faster?
Any help is greatly appreciated.
Instead of
for circle in circles:
... circles.index(circle) ...
use
for i, circle in enumerate(circles):
... i ...
This could decrease your O(n^2) to O(n).
Your whole makeList could be written as:
sum([[[circle.m[0]-circle.r, 0, i], [circle.m[0]+circle.r, 1, i]] for i, circle in enumerate(circles)], [])
Your performance problem is not in the append() method, but in your use of circles.index(), which makes the whole thing O(n^2).
A further (comparitively minor) improvement is to use a list comprehension instead of list.append():
mylist = [[circle.m[0] - circle.r, 0, i]
for i, circle in enumerate(circles)]
mylist += [[circle.m[0] + circle.r, 1, i]
for i, circle in enumerate(circles)]
Note that this will give the data in a different order (which should not matter as you are planning to sort it anyway).
I've just tried several tests to improve "append" function's speed. It will definitely helpful for you.
Using Python
Using list(map(lambda - known as a bit faster means than for+append
Using Cython
Using Numba - jit
CODE CONTENT : getting numbers from 0 ~ 9999999, square them, and put them into a new list using append.
Using Python
import timeit
st1 = timeit.default_timer()
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.7 s
Using list(map(lambda
import timeit
st1 = timeit.default_timer()
result = list(map(lambda x : x**2 , range(0,10000000) ))
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.6 s
Using Cython
the coding in a .pyx file
def f1():
cpdef double i
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
and I compiled it and ran it in .py file.
in .py file
import timeit
from c1 import *
st1 = timeit.default_timer()
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 1.6 s
Using Numba - jit
import timeit
from numba import jit
st1 = timeit.default_timer()
#jit(nopython=True, cache=True)
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 0.57 s
CONCLUSION :
As you mentioned above, changing the simple append form boosted up the speed a bit. And using Cython is much faster than in Python. However, turned out using Numba is the best choice in terms of speed improvement for 'for+append' !
Try using deque in the collections package to append large rows of data, without performance diminishing. And then convert a deque back to a DataFrame using List Comprehension.
Sample Case:
from collections import deque
d = deque()
for row in rows:
d.append([value_x, value_y])
df = pd.DataFrame({'column_x':[item[0] for item in d],'column_y':[item[1] for item in d]})
This is a real time-saver.
If performance were an issue, I would avoid using append. Instead, preallocate an array and then fill it up. I would also avoid using index to find position within the list "circles". Here's a rewrite. It's not compact, but I'll bet it's fast because of the unrolled loop.
def makeList():
"""gc.disable()"""
mylist = 6*len(circles)*[None]
for i in range(len(circles)):
j = 6*i
mylist[j] = circles[i].m[0]-circles[i].r
mylist[j+1] = 0
mylist[j+2] = i
mylist[j+3] = circles[i].m[0] + circles[i].r
mylist[j+4] = 1
mylist[j+5] = i
return mylist

Categories