python numpy - optimize chisq function by removing explicit python loop?

python numpy - optimize chisq function by removing explicit python loop? - python

I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!

You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)

It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0

Related

Python Multiprocessing, Best practice for mapping results to scientific numpy arrays

I don't really understand how to handle multiprocessing in Python when mapping the function results to multidimensional arrays. I provide a simple example of how I calculate it serially. The parallel computing does not work. Often, I pass a lot of arguments to a function, so this would be a very annoying way of doing it. Is there a "better way", than creating all i,j-pairs with a reshaped meshgrid?
import concurrent.futures
import numpy as np
def complex_function(i,j):
# this is a computationally intense function
return i,j,i+j
all_i = np.arange(3)
all_j = np.arange(6)
#%% serial
solution = np.empty((len(all_i),len(all_j)),dtype=float)
for i in range(len(all_i)):
for j in range(len(all_j)):
solution[i,j] = complex_function(all_i[i],all_j[j])[2]
#%% parallel
solution = np.empty((len(all_i),len(all_j)),dtype=float)
I,J = np.meshgrid(all_i, all_j, sparse=False, indexing='ij')
I = I.reshape(-1)
J = J.reshape(-1)
with concurrent.futures.ProcessPoolExecutor() as executor:
for i, j, result in executor.map(complex_function, I, J):
solution[i,j] = result
Okay, now I want to know wheather I can use nested functions like
def dummy_function(i,j):
result = complex_function(i,j)
return result
and then call dummy_function(i,j) with the executer.

Is there a way to get every element of a list without using loops?

I found this task in a book of my prof:
def f(x):
return f = log(exp(z))
def problem(M: List)
return np.array([f(x) for x in M])
How do I implement a solution?

Numpy is all about performing operations on entire arrays. Your professor is expecting you to use that functionality.
Start by converting your list M into array z:
z = np.array(M)
Now you can do elementwise operations like exp and log:
e = np.exp(z)
f = 1 + e
g = np.log(f)
The functions np.exp and np.log are applied to each element of an array. If the input is not an array, it will be converted into one.
Operations like 1 + e work on an entire array as well, in this case using the magic of broadcasting. Since 1 is a scalar, it can unambiguously expanded to the same shape as e, and added as if by np.add.
Normally, the sequence of operations can be compactified into a single line, similarly to what you did in your initial attempt. You can reduce the number of operations slightly by using np.log1p:
def f(x):
return np.log1p(np.exp(x))
Notice that I did not convert x to an array first since np.exp will do that for you.
A fundamental problem with this naive approach is that np.exp will overflow for values that we would expect to get reasonable results. This can be solved using the technique in this answer:
def f(x):
return np.log1p(np.exp(-np.abs(x))) + np.maximum(x, 0)

Transforming a divide and conquer recursive algorithm into an iterative version

I would like to transform a recursive algorithm on an array into an iterative function. It is not a tail recursive algorithm and has two recursive calls followed by some operation.
The algorithm is a divide-and-conquer algorithm where at each step the array is split into two subarrays and some function f is applied to the two previous outcomes. In practice f is complicated, so the iterative algorithm should use the function f, for a minimal working example I have used a simple addition.
Below is a minimal working example of the recursive program in python.
import numpy as np
def f(left,right):
#In practice some complicated function of left and right
value=left+right
return value
def recursive(w,i,j):
if i==j:
#termination condition when the subarray has size 1
return w[i]
else:
k=(j-i)//2+i
#split the array into two subarrays between indices i,k and k+1,j
left=recursive(w,i,k)
right=recursive(w,k+1,j)
return f(left,right)
a=np.random.rand(10)
print(recursive(a,0,a.shape[0]-1))
Now if I want to write this iteratively I realize that I need a stack to store intermediate results, and that at each step I need to apply f to the two elements on the top of the stack. I am just not sure how to construct the order in which I put elements in the stack without recursion. Here is an attempt at a solution which is certainly not optimal since it seems there should be a way to remove the first loop and use only one stack:
def iterative(w):
stack=[]
stack2=[]
stack3=[]
i=0
j=w.shape[0]-1
stack.append((i,j))
while (i,j)!=(w.shape[0]-1,w.shape[0]-1):
(i,j)=stack.pop()
stack2.append((i,j))
if i==j:
pass
else:
k=int(np.floor((j-i)/2)+i)
stack.append((k+1,j))
stack.append((i,k))
while len(stack2)>0:
(i,j)=stack2.pop()
if i==j:
stack3.append(w[i])
else:
right=stack3.pop()
left=stack3.pop()
stack3.append(f(left,right))
return stack3.pop()
Edit : The real problem I am interested in has as input an array of tensors of different sizes, and the operation f solves a linear program involving these tensors and outputs a new tensor. I cannot iterate simply over the initial array since the size of the output of f grows exponentially in this case. This is why I use this divide and conquer approach, which reduces this size. The recursive program works fine, but slows down dramatically for large size, possibly due to the frames that python opens and keeps track of.

Below I transformed the program to use a continuation (then) and a trampoline (run/recur). It evolves a linear iterative process and it will not overflow the stack. If you're not running into a stack overflow issue, this won't do much to help your specific problem, but it can teach you how to flatten branching computations.
This process of converting a normal function to continuation passing style can be a mechanical one. If you squint your eyes a little bit, you'll see how the program has most of the same elements as yours. Inline comments show code side-by-side -
import numpy as np
def identity (x):
return x
def recur (*values):
return (recur, values)
def run (f):
acc = f ()
while type (acc) is tuple and acc [0] is recur:
acc = f (*acc [1])
return acc
def myfunc (a):
# def recursive(w,i,j)
def loop (w = a, i = 0, j = len(a)-1, then = identity):
if i == j: # same
return then (w[i]) # wrap in `then`
else: # same
k = (j - i) // 2 + i # same
return recur \ # left=recursive(w,i,k)
( w
, i
, k
, lambda left:
recur # right=recursive(w,k+1,j)
( w
, k + 1
, j
, lambda right:
then # wrap in `then`
(f (left, right)) # same
)
)
return run (loop)
def f (a, b):
return a + b # same
a = np.random.rand(10) # same
print(a, myfunc(a)) # recursive(a, 0, a.shape[0]-1)
# [0.5732646 0.88264091 0.37519826 0.3530782 0.83281033 0.50063843 0.59621896 0.50165139 0.05551734 0.53719382]
# 5.208212213881435

Python, parallelization with joblib: Delayed with multiple arguments

I am using something similar to the following to parallelize a for loop over two matrices
from joblib import Parallel, delayed
import numpy
def processInput(i,j):
for k in range(len(i)):
i[k] = 1
for t in range(len(b)):
j[t] = 0
return i,j
a = numpy.eye(3)
b = numpy.eye(3)
num_cores = 2
(a,b) = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
but I'm getting the following error: Too many values to unpack (expected 2)
Is there a way to return 2 values with delayed? Or what solution would you propose?
Also, a bit OP, is there a more compact way, like the following (which doesn't actually modify anything) to process the matrices?
from joblib import Parallel, delayed
def processInput(i,j):
for k in i:
k = 1
for t in b:
t = 0
return i,j
I would like to avoid the use of has_shareable_memory anyway, to avoid possible bad interactions in the actual script and lower performances(?)

Probably too late, but as an answer to the first part of your question:
Just return a tuple in your delayed function.
return (i,j)
And for the variable holding the output of all your delayed functions
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
Now results is a list of tuples each holding some (i,j) and you can just iterate through results.

NumPy tensordot MemoryError

I have two matrices -- A is 3033x3033, and X is 3033x20. I am running the following lines (as suggested in the answer to another question I asked):
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
return np.tensordot(A.reshape(n, n, -1) * c, c, axes=[(0,1),(0,1)])
On the final line, Python simply stops and says "MemoryError". How can I get around this, either by changing some setting in Python or performing this operation in a more memory-efficient way?

Here is a function that does the calculation without any for loops and without any large temporary array. See the related question for a longer answer, complete with a test script.
def fbest(A, X):
""
KA_best = np.tensordot(A.sum(1)[:,None] * X, X, axes=[(0,), (0,)])
KA_best += np.tensordot(A.sum(0)[:,None] * X, X, axes=[(0,), (0,)])
KA_best -= np.tensordot(np.dot(A, X), X, axes=[(0,), (0,)])
KA_best -= np.tensordot(X, np.dot(A, X), axes=[(0,), (0,)])
return KA_best
I profiled the code with your size arrays:
I love sp.einsum by the way. It is a great place to start when speeding up array operations by removing for loops. You can do SOOOO much with one call to sp.einsum.
The advantage of np.tensordot is that it links to whatever fast numerical library you have installed (i.e. MKL). So, tensordot will run faster and in parallel when you have the right libraries installed.

If you replace the final line with
return np.einsum('ij,ijk,ijl->kl',A,c,c)
you avoid creating the A.reshape(n, n, -1) * c (3301 by 3301 by 20) intermediate that I think is your main problem.
My impression is that the version I give is probably slower (for cases where it doesn't run out of memory), but I haven't rigourously timed it.
It's possible you could go further and avoid creating c, but I can't immediately see how to do it. It'd be a case of following writing the whole thing in terms of sums of matrix indicies and seeing what it simplified to.

You can employ a two-nested loop format iterating along the last dimension of X. Now, that last dimension is 20, so hopefully it would still be efficient enough and more importantly leave minimum memory footprint. Here's the implementation -
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
out = np.empty((d,d)) # d is a small number: 20
for i in range(d):
for j in range(d):
out[i,j] = (A*c[:,:,i]*(c[:,:,j])).sum()
return out
You can replace the last line with np.einsum -
out[i,j] = np.einsum('ij->',A*c[:,:,i]*c[:,:,j])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python numpy - optimize chisq function by removing explicit python loop? - python

Related

Python Multiprocessing, Best practice for mapping results to scientific numpy arrays

Is there a way to get every element of a list without using loops?

Transforming a divide and conquer recursive algorithm into an iterative version

Python, parallelization with joblib: Delayed with multiple arguments

NumPy tensordot MemoryError

Categories

Resources