Python, parallelization with joblib: Delayed with multiple arguments

Python, parallelization with joblib: Delayed with multiple arguments - python

I am using something similar to the following to parallelize a for loop over two matrices
from joblib import Parallel, delayed
import numpy
def processInput(i,j):
for k in range(len(i)):
i[k] = 1
for t in range(len(b)):
j[t] = 0
return i,j
a = numpy.eye(3)
b = numpy.eye(3)
num_cores = 2
(a,b) = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
but I'm getting the following error: Too many values to unpack (expected 2)
Is there a way to return 2 values with delayed? Or what solution would you propose?
Also, a bit OP, is there a more compact way, like the following (which doesn't actually modify anything) to process the matrices?
from joblib import Parallel, delayed
def processInput(i,j):
for k in i:
k = 1
for t in b:
t = 0
return i,j
I would like to avoid the use of has_shareable_memory anyway, to avoid possible bad interactions in the actual script and lower performances(?)

Probably too late, but as an answer to the first part of your question:
Just return a tuple in your delayed function.
return (i,j)
And for the variable holding the output of all your delayed functions
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
Now results is a list of tuples each holding some (i,j) and you can just iterate through results.

Related

Python Multiprocessing, Best practice for mapping results to scientific numpy arrays

I don't really understand how to handle multiprocessing in Python when mapping the function results to multidimensional arrays. I provide a simple example of how I calculate it serially. The parallel computing does not work. Often, I pass a lot of arguments to a function, so this would be a very annoying way of doing it. Is there a "better way", than creating all i,j-pairs with a reshaped meshgrid?
import concurrent.futures
import numpy as np
def complex_function(i,j):
# this is a computationally intense function
return i,j,i+j
all_i = np.arange(3)
all_j = np.arange(6)
#%% serial
solution = np.empty((len(all_i),len(all_j)),dtype=float)
for i in range(len(all_i)):
for j in range(len(all_j)):
solution[i,j] = complex_function(all_i[i],all_j[j])[2]
#%% parallel
solution = np.empty((len(all_i),len(all_j)),dtype=float)
I,J = np.meshgrid(all_i, all_j, sparse=False, indexing='ij')
I = I.reshape(-1)
J = J.reshape(-1)
with concurrent.futures.ProcessPoolExecutor() as executor:
for i, j, result in executor.map(complex_function, I, J):
solution[i,j] = result
Okay, now I want to know wheather I can use nested functions like
def dummy_function(i,j):
result = complex_function(i,j)
return result
and then call dummy_function(i,j) with the executer.

How to find last "K" indexes of vector satisfying condition (Python) ? (Analogue of Matlab's "find" )

Consider some vector:
import numpy as np
v = np.arange(10)
Assume we need to find last 2 indexes satisfying some condition.
For example in Matlab it would be written e.g.
find(v <5 , 2,'last')
answer = [ 3 , 4 ] (Note: Matlab indexing from 1)
Question: What would be the clearest way to do that in Python ?
"Nice" solution should STOP search when it finds 2 desired results, it should NOT search over all elements of vector.
So np.where does not seems to be "nice" in that sense.
We can easyly write that using "for", but is there any alternative way ?
I am afraid using "for" since it might be slow (at least it is very much so in Matlab).

This attempt doesn't use numpy, and it is probably not very idiomatic.
Nevertheless, if I understand it correctly, zip, filter and reversed are all lazy iterators that take only the elements that they really need. Therefore, you could try this:
x = list(range(10))
from itertools import islice
res = reversed(list(map(
lambda xi: xi[1],
islice(
filter(
lambda xi: xi[0] < 5,
zip(reversed(x), reversed(range(len(x))))
),
2
)
)))
print(list(res))
Output:
[3, 4]
What it does (from inside to outside):
create index range
reverse both array and indices
zip the reversed array with indices
filter the two (value, index)-pairs that you need, extract them by islice
Throw away the values, retain only indices with map
reverse again
Even though it looks somewhat monstrous, it should all be lazy, and stop after it finds the first two elements that you are looking for. I haven't compared it with a simple loop, maybe just using a loop would be both simpler and faster.

Any solution you'd find will iterate over the list even if the loop is 'hidden' inside a function.
The solution to your problem depends on the assumptions you can make e.g. is the list sorted?
for the general case I'd iterate over the loop starting at the end:
def find(condition, k, v):
indices = []
for i, var in enumerate(reversed(v)):
if condition(var):
indices.append(len(v) - i - 1)
if len(indices) >= k:
break
return indices
The condition should then be passed as a function, so you can use a lambda:
v = range(10)
find(lambda x: x < 5, 3, v)
will output
[4, 3, 2]

I'm not aware of a "good" numpy solution to short-circuiting.
The most principled way to go would be using something like Cython which to brutally oversimplify it adds fast loops to Python. Once you have set that up it would be easy.
If you do not want to do that you'd have to employ some gymnastics like:
import numpy as np
def find_last_k(vector, condition, k, minchunk=32):
if k > minchunk:
minchunk = k
l, r = vector.size - minchunk, vector.size
found = []
n_found = 0
while r > 0:
if l <= 0:
l = 0
found.append(l + np.where(condition(vector[l:r]))[0])
n_found += len(found[-1])
if n_found >= k:
break
l, r = 3 * l - 2 * r, l
return np.concatenate(found[::-1])[-k:]
This tries balancing loop overhead and numpy "inflexibility" by searching in chunks, which we grow exponentially until enough hits are found.
Not exactly pretty, though.

This is what I've found that seems to do this job for the example described (using argwhere which returns all indices that meet the criteria and then we find the last two of these as a numpy array):
ind = np.argwhere(v<5)
ind[-2:]
This searches through the entire array so is not optimal but is easy to code.

python numpy - optimize chisq function by removing explicit python loop?

I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!

You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)

It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0

python printing a generator list after vectorization?

I am new with vectorization and generators. So far I have created the following function:
import numpy as np
def ismember(a,b):
for i in a:
if len(np.where(b==i)[0]) == 0:
lv_var = 0
else:
lv_var = np.int(np.where(b==i)[0])
yield lv_var
vect = np.vectorize(ismember)
A = np.array(xrange(700000))
B = np.array(xrange(700000))
lv_result = vect(A,B)
When I try to cast lv_result as a list or loop through the resulting numpy array I get a list of generator objects. I need to somehow get the actual result. How do I print the actual results from this function ? .next() on generator doesn't seem to do the job.
Could someone tell me what is that I am doing wrong or how could I reconfigure the code to achieve the end goal?
---------------------------------------------------
OK so I understand the vectorize part now (thanks Viet Nguyen for the example).
I was also able to print the generator object results. The code has been modified. Please see below.
For the generator part:
What I am trying to do is to mimic a MATLAB function called ismember (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.
From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B
One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get any better performance wise.
To print the Generator, I have created function f() below.
import numpy as np
def ismember(a,b):
for i in a:
index = np.where(b==i)[0]
if len(index) == 0:
yield 0
else:
yield index
def f(A, gen_obj):
my_array = np.arange(len(A))
for i in my_array:
my_array[i] = gen_obj.next()
return my_array
A = np.arange(700000)
B = np.arange(700000)
gen_obj = ismember(A,B)
f(A, gen_obj)
print 'done'
Note: if we were to try the above code with smaller arrays:
Lets say.
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
The result will be an array of : [4 0 0 4 3]
Just like matlabs function: the goal is to get the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B.
Numpy's intersection function doesn't help me to achieve the goal. Also the size of the returning array needs to be kept the same size as the size of array A.
So far this process is taking forever(for arrays of 700k elements). Unfortunately I haven't been able to find the best solution yet. Any inputs on how could I reconfigure the code to achieve the end goal, with the best performance, will be much appreciated.
Optimization Problem solved in:
python-run-generator-using-multiple-cores-for-optimization

I believe you've misunderstood the inputs to a numpy.vectorize function. The "vectorized" function operates on the arrays on a per-element basis (see numpy.vectorize reference). Your function ismember seems to presume that the inputs a and b are arrays. Instead, think of the function as something you would use with built-in map().
> import numpy as np
> def mask(a, b):
> return 1 if a == b else 0
> a = np.array([1, 2, 3, 4])
> b = np.array([1, 3, 4, 5])
> maskv = np.vectorize(mask)
> maskv(a, b)
array([1, 0, 0, 0])
Also, if I'm understanding your intention correctly, NumPy comes with an intersection function.

Python append performance

I'm having some performance problems with 'append' in Python.
I'm writing an algorithm that checks if there are two overlapping circles in a (large) set of circles.
I start by putting the extreme points of the circles (x_i-R_i & x_i+R_i) in a list and then sorting the list.
class Circle:
def __init__(self, middle, radius):
self.m = middle
self.r = radius
In between I generate N random circles and put them in the 'circles' list.
"""
Makes a list with all the extreme points of the circles.
Format = [Extreme, left/right ~ 0/1 extreme, index]
Seperate function for performance reason, python handles local variables faster.
Garbage collect is temporarily disabled since a bug in Python makes list.append run in O(n) time instead of O(1)
"""
def makeList():
"""gc.disable()"""
list = []
append = list.append
for circle in circles:
append([circle.m[0]-circle.r, 0, circles.index(circle)])
append([circle.m[0] + circle.r, 1, circles.index(circle)])
"""gc.enable()"""
return list
When running this with 50k circles it takes over 75 seconds to generate the list. As you might see in the comments I wrote I disabled garbage collect, put it in a separate function, used
append = list.append
append(foo)
instead of just
list.append(foo)
I disabled gc since after some searching it seems that there's a bug with python causing append to run in O(n) instead of O(c) time.
So is this way the fastest way or is there a way to make this run faster?
Any help is greatly appreciated.

Instead of
for circle in circles:
... circles.index(circle) ...
use
for i, circle in enumerate(circles):
... i ...
This could decrease your O(n^2) to O(n).
Your whole makeList could be written as:
sum([[[circle.m[0]-circle.r, 0, i], [circle.m[0]+circle.r, 1, i]] for i, circle in enumerate(circles)], [])

Your performance problem is not in the append() method, but in your use of circles.index(), which makes the whole thing O(n^2).
A further (comparitively minor) improvement is to use a list comprehension instead of list.append():
mylist = [[circle.m[0] - circle.r, 0, i]
for i, circle in enumerate(circles)]
mylist += [[circle.m[0] + circle.r, 1, i]
for i, circle in enumerate(circles)]
Note that this will give the data in a different order (which should not matter as you are planning to sort it anyway).

I've just tried several tests to improve "append" function's speed. It will definitely helpful for you.
Using Python
Using list(map(lambda - known as a bit faster means than for+append
Using Cython
Using Numba - jit
CODE CONTENT : getting numbers from 0 ~ 9999999, square them, and put them into a new list using append.
Using Python
import timeit
st1 = timeit.default_timer()
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.7 s
Using list(map(lambda
import timeit
st1 = timeit.default_timer()
result = list(map(lambda x : x**2 , range(0,10000000) ))
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.6 s
Using Cython
the coding in a .pyx file
def f1():
cpdef double i
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
and I compiled it and ran it in .py file.
in .py file
import timeit
from c1 import *
st1 = timeit.default_timer()
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 1.6 s
Using Numba - jit
import timeit
from numba import jit
st1 = timeit.default_timer()
#jit(nopython=True, cache=True)
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 0.57 s
CONCLUSION :
As you mentioned above, changing the simple append form boosted up the speed a bit. And using Cython is much faster than in Python. However, turned out using Numba is the best choice in terms of speed improvement for 'for+append' !

Try using deque in the collections package to append large rows of data, without performance diminishing. And then convert a deque back to a DataFrame using List Comprehension.
Sample Case:
from collections import deque
d = deque()
for row in rows:
d.append([value_x, value_y])
df = pd.DataFrame({'column_x':[item[0] for item in d],'column_y':[item[1] for item in d]})
This is a real time-saver.

If performance were an issue, I would avoid using append. Instead, preallocate an array and then fill it up. I would also avoid using index to find position within the list "circles". Here's a rewrite. It's not compact, but I'll bet it's fast because of the unrolled loop.
def makeList():
"""gc.disable()"""
mylist = 6*len(circles)*[None]
for i in range(len(circles)):
j = 6*i
mylist[j] = circles[i].m[0]-circles[i].r
mylist[j+1] = 0
mylist[j+2] = i
mylist[j+3] = circles[i].m[0] + circles[i].r
mylist[j+4] = 1
mylist[j+5] = i
return mylist

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, parallelization with joblib: Delayed with multiple arguments - python

Related

Python Multiprocessing, Best practice for mapping results to scientific numpy arrays

How to find last "K" indexes of vector satisfying condition (Python) ? (Analogue of Matlab's "find" )

python numpy - optimize chisq function by removing explicit python loop?

python printing a generator list after vectorization?

Python append performance

Categories

Resources