numpy matrix population very slow - python

I am writing a program to perform numerical calculation with a Hessian matrix. The Hessian matrix is 500 x 500 and I need to populate it hundreds of times over. I am populating it with two for loops each time. My problem is that preventatively slow. Here is my code:
#create these outside function
hess = np.empty([500,500])
b = np.empty([500])
def hess_h(x):
#create these first so they aren't calculated every iteration
for k in range(500):
b[k] = (1-np.dot(a[k],x))**2
for i in range(500):
for j in range(500):
if i == j:
#these are values along diagonal
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
#the matrix is symmetric so only calculate upper triangle
elif j > i :
hess[i,j] = -float(a[i,j]*sum(a[i]))/b[i]
elif i > j:
hess[i,j] = hess[j,i]
return hess
I calculate that hess_h(np.zeros(500)) takes 10.2289998531 sec to run. That is too long and I need to figure out another way.

Look for patterns in your calculation, in particular things that you can calculate over the whole range of i and j.
I see for example a diagonal where i==j
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
Can you change that to a one time expression, something like:
2*(1-x)**2 + 4*x**2)/(1-x**2)**2 - np.diagonal(a)*sum(a)/b
The other pieces work with up and lower triangular elements. There are functions like np.triu that give you their indices.
I'm trying to give you tools and though processes for solving this with a few numpy vectorized operaitons, instead of iterating over all elements of i and j.
Looks like
-a[i,j]*sum(a[i])/b[i]
is used for every element. I assume a is a (500,500) array. Can you use
-a*a.sum(axis=?)/b
b can be 'vectorized'
b[k] = (1-np.dot(a[k],x))**2
with something like:
(1 - np.dot(a, x))**2
or
(1 - np.einsum('kj,ji',a,x))**2
test the details on a smaller a.

Related

How to speed up this DP function in python with vectorization

So I have this definition here,
DP[i,j] = f[i,j] + min(DP[iāˆ’1, j āˆ’1], DP[iāˆ’1, j], DP[iāˆ’1, j +1])
which defines the minimum accrued cost to go from the top of the NxM matrix to the bottom of the matrix. Each cell in f represents a value/cost (1.2, 0, 10, etc.) to travel to that cell from another cell.
The matrix may be large (1500x1500, It's Gradient map of an image), and the DP algorithm I programmed came out to be about a second per run for my matrices. This matrix needs to run hundreds of times per execution, so total program run time comes out to be several minutes long. This loop is about 99% of my bottleneck, so I am trying to optimize this loop with Python/numpys vectorization methods. I only have access to Numpy, and Scipy.
Note: I don't program in python hardly at all, so the solution may just be obvious idk.
First attempt, Just the straightforward loop, time here is about 2-2.5 seconds per run
DP = f.copy()
for r in range(2, len(DP) - 1): # Start at row 2 since row one doesn't change
for c in range(1, len(DP[0]) - 1):
DP[r][c] += min(DP[r - 1, c-1:c+2])
Second attempt, I tried to leverage some numpy vectorizations functions "fromiter" to calculate entire rows at a time rather than column by column, time here is about 1-1.5 seconds per run. My goal is to get this at least an order of magnitude faster, but I am stumped on how else I can optimize this.
DP = f.copy()
for r in range(2, len(DP) - 1):
def foo(arr):
idx, val = arr
if idx == 0 or idx == len(DP[[0]) - 1:
return np.inf
return val + min(DP[r - 1, idx - 1], DP[r - 1, idx], DP[r - 1, idx + 1])
DP[r, :] = np.fromiter(map(foo, enumerate(DP[r, :])))
As hpaulj stated, being your problem inherently sequential it will be hard to fully vectorize, although it seems possible (every cell is updated based on values of the row r=2, the difference is the considered number of triplets from row 2 for each of the following rows) so perhaps you can find a smart way to do it!
That being said, a quick and half-vectorized solution would be to use the neat way of performing sliding windows with fancy indexing proposed by user42541, so we replace the inner loop with a vectorized call:
indexer = np.arange(3)[:,None] + np.arange(DP.shape[1] - 2)[None,:]
for r in range(2, DP.shape[0] - 1):
DP[r,1:-1] += np.min(DP[r-1,indexer], axis = 0)
This results in a speed-up relative to your double loop method (your vectorized solution didn't work in my pc) of about two orders of magnitude for a 1500x1500 array of integers.

Analyzing the complexity matrix path-finding

Recently in my homework, I was assinged to solve the following problem:
Given a matrix of order nxn of zeros and ones, find the number of paths from [0,0] to [n-1,n-1] that go only through zeros (they are not necessarily disjoint) where you could only walk down or to the right, never up or left. Return a matrix of the same order where the [i,j] entry is the number of paths in the original matrix that go through [i,j], the solution has to be recursive.
My solution in python:
def find_zero_paths(M):
n,m = len(M),len(M[0])
dict = {}
for i in range(n):
for j in range(m):
M_top,M_bot = blocks(M,i,j)
X,Y = find_num_paths(M_top),find_num_paths(M_bot)
dict[(i,j)] = X*Y
L = [[dict[(i,j)] for j in range(m)] for i in range(n)]
return L[0][0],L
def blocks(M,k,l):
n,m = len(M),len(M[0])
assert k<n and l<m
M_top = [[M[i][j] for i in range(k+1)] for j in range(l+1)]
M_bot = [[M[i][j] for i in range(k,n)] for j in range(l,m)]
return [M_top,M_bot]
def find_num_paths(M):
dict = {(1, 1): 1}
X = find_num_mem(M, dict)
return X
def find_num_mem(M,dict):
n, m = len(M), len(M[0])
if M[n-1][m-1] != 0:
return 0
elif (n,m) in dict:
return dict[(n,m)]
elif n == 1 and m > 1:
new_M = [M[0][:m-1]]
X = find_num_mem(new_M,dict)
dict[(n,m-1)] = X
return X
elif m == 1 and n>1:
new_M = M[:n-1]
X = find_num_mem(new_M, dict)
dict[(n-1,m)] = X
return X
new_M1 = M[:n-1]
new_M2 = [M[i][:m-1] for i in range(n)]
X,Y = find_num_mem(new_M1, dict),find_num_mem(new_M2, dict)
dict[(n-1,m)],dict[(n,m-1)] = X,Y
return X+Y
My code is based on the idea that the number of paths that go through [i,j] in the original matrix is equal to the product of the number of paths from [0,0] to [i,j] and the number of paths from [i,j] to [n-1,n-1]. Another idea is that the number of paths from [0,0] to [i,j] is the sum of the number of paths from [0,0] to [i-1,j] and from [0,0] to [i,j-1]. Hence I decided to use a dictionary whose keys are matricies of the form [[M[i][j] for j in range(k)] for i in range(l)] or [[M[i][j] for j in range(k+1,n)] for i in range(l+1,n)] for some 0<=k,l<=n-1 where M is the original matrix and whose values are the number of paths from the top of the matrix to the bottom. After analizing the complexity of my code I arrived at the conclusion that it is O(n^6).
Now, my instructor said this code is exponential (for find_zero_paths), however, I disagree.
The recursion tree (for find_num_paths) size is bounded by the number of submatrices of the form above which is O(n^2). Also, each time we add a new matrix to the dictionary we do it in polynomial time (only slicing lists), SO... the total complexity is polynomial (poly*poly = poly). Also, the function 'blocks' runs in polynomial time, and hence 'find_zero_paths' runs in polynomial time (2 lists of polynomial-size times a function which runs in polynomial time) so all in all the code runs in polynomial time.
My question: Is the code polynomial and my O(n^6) bound is wrong or is it exponential and I am missing something?
Unfortunately, your instructor is right.
There is a lot to unpack here:
Before we start, as quick note. Please don't use dict as a variable name. It hurts ^^. Dict is a reserved keyword for a dictionary constructor in python. It is a bad practice to overwrite it with your variable.
First, your approach of counting M_top * M_bottom is good, if you were to compute only one cell in the matrix. In the way you go about it, you are unnecessarily computing some blocks over and over again - that is why I pondered about the recursion, I would use dynamic programming for this one. Once from the start to end, once from end to start, then I would go and compute the products and be done with it. No need for O(n^6) of separate computations. Sine you have to use recursion, I would recommend caching the partial results and reusing them wherever possible.
Second, the root of the issue and the cause of your invisible-ish exponent. It is hidden in the find_num_mem function. Say you compute the last element in the matrix - the result[N][N] field and let us consider the simplest case, where the matrix is full of zeroes so every possible path exists.
In the first step, your recursion creates branches [N][N-1] and [N-1][N].
In the second step, [N-1][N-1], [N][N-2], [N-2][N], [N-1][N-1]
In the third step, you once again create two branches from every previous step - a beautiful example of an exponential explosion.
Now how to go about it: You will quickly notice that some of the branches are being duplicated over and over. Cache the results.

How can I fix this for loop to solve a system of linear equations through the Jacobi iteration method?

I'm trying to write a function that goes through the Jacobi iteration method for solving a system of linear equations. I've got most of it down, I just need to figure out how to iterate the last for loop either 1000 times or until the break condition is met. How can I make it so the value of x updates each iteration?
import numpy as np
def Jacobi(A,b,err):
n = A.shape
k = np.zeros(n[0])
D = np.zeros(n)
U = np.zeros(n)
L = np.zeros(n)
for i in range(n[0]):
for j in range(n[0]):
if i == j:
D[i,j] = A[i,j]
elif i < j:
U[i,j] = A[i,j]
else:
L[i,j] = A[i,j]
w = []
for i in range(1000):
x = np.linalg.inv(D)*(U+L)*x +np.linalg.inv(D)*b
w.append(x)
if abs(w[-1] - w[-2]) < err:
break
return w[-1]
For reference, my error statement says a list index in the if clause is out of range. I assume this is because there's only one element in w since I don't know how to make the for loop. Thanks in advance for any help.
I'm pretty sure you missed the intent of that exercise, if you can use inv, then you can also use linalg.inv(A) or better linalg.solve(A,b). Note that you have sign errors and that the multiplication * is not the matrix multiplication between numpy arrays. (Your declaration of the arrays is incompatible with their later use.)
Your specific problem can be solved by adding an additional test
if i>1 and abs(w[-1] - w[-2]) < err:
when the first condition fails the second is not evaluated.
You should contemplate if it is a waste of memory to construct the w list when all you ever need is the last two entries.
x_last, x = x, jacobi_step(A,b,x)
would also work to have these available.
The preparation can be reduced to
D=np.diag(A); A_reduced = A-np.diag(D);
then the Jacobi step is simply, using that the arithmetic operations are applied element-wise by default
x_last, x = x, (b-A_reduced.dot(x))/D

Efficient Particle-Pair Interactions Calculation

I have an N-body simulation that generates a list of particle positions, for multiple timesteps in the simulation. For a given frame, I want to generate a list of the pairs of particles' indices (i, j) such that dist(p[i], p[j]) < masking_radius. Essentially I'm creating a list of "interaction" pairs, where the pairs are within a certain distance of each other. My current implementation looks something like this:
interaction_pairs = []
# going through each unique pair (order doesn't matter)
for i in range(num_particles):
for j in range(i + 1, num_particles):
if dist(p[i], p[j]) < masking_radius:
interaction_pairs.append((i,j))
Because of the large number of particles, this process takes a long time (>1 hr per test), and it is severely limiting to what I need to do with the data. I was wondering if there was any more efficient way to structure the data such that calculating these pairs would be more efficient instead of comparing every possible combination of particles. I was looking into KDTrees, but I couldn't figure out a way to utilize them to compute this more efficiently. Any help is appreciated, thank you!
Since you are using python, sklearn has multiple implementations for nearest neighbours finding:
http://scikit-learn.org/stable/modules/neighbors.html
There is KDTree and Balltree provided.
As for KDTree the main point is to push all the particles you have into KDTree, and then for each particle ask query: "give me all particles in range X". KDtree usually do this faster than bruteforce search.
You can read more for example here: https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/kdtrees.pdf
If you are using 2D or 3D space, then other option is to just cut the space into big grid (which cell size of masking radius) and assign each particle into one grid cell. Then you can find possible candidates for interaction just by checking neighboring cells (but you also have to do a distance check, but for much fewer particle pairs).
Here's a fairly simple technique using plain Python that can reduce the number of comparisons required.
We first sort the points along either the X, Y, or Z axis (selected by axis in the code below). Let's say we choose the X axis. Then we loop over point pairs like your code does, but when we find a pair whose distance is greater than the masking_radius we test whether the difference in their X coordinates is also greater than the masking_radius. If it is, then we can bail out of the inner j loop because all points with a greater j have a greater X coordinate.
My dist2 function calculates the squared distance. This is faster than calculating the actual distance because computing the square root is relatively slow.
I've also included code that behaves similar to your code, i.e., it tests every pair of points, for speed comparison purposes; it also serves to check that the fast code is correct. ;)
from random import seed, uniform
from operator import itemgetter
seed(42)
# Make some fake data
def make_point(hi=10.0):
return [uniform(-hi, hi) for _ in range(3)]
psize = 1000
points = [make_point() for _ in range(psize)]
masking_radius = 4.0
masking_radius2 = masking_radius ** 2
def dist2(p, q):
return (p[0] - q[0])**2 + (p[1] - q[1])**2 + (p[2] - q[2])**2
pair_count = 0
test_count = 0
do_fast = 1
if do_fast:
# Sort the points on one axis
axis = 0
points.sort(key=itemgetter(axis))
# Fast
for i, p in enumerate(points):
left, right = i - 1, i + 1
for j in range(i + 1, psize):
test_count += 1
q = points[j]
if dist2(p, q) < masking_radius2:
#interaction_pairs.append((i, j))
pair_count += 1
elif q[axis] - p[axis] >= masking_radius:
break
if i % 100 == 0:
print('\r {:3} '.format(i), flush=True, end='')
total_pairs = psize * (psize - 1) // 2
print('\r {} / {} tests'.format(test_count, total_pairs))
else:
# Slow
for i, p in enumerate(points):
for j in range(i+1, psize):
q = points[j]
if dist2(p, q) < masking_radius2:
#interaction_pairs.append((i, j))
pair_count += 1
if i % 100 == 0:
print('\r {:3} '.format(i), flush=True, end='')
print('\n', pair_count, 'pairs')
output with do_fast = 1
181937 / 499500 tests
13295 pairs
output with do_fast = 0
13295 pairs
Of course, if most of the point pairs are within masking_radius of each other, there won't be much benefit in using this technique. And sorting the points adds a little bit of time, but Python's TimSort is rather efficient, especially if the data is already partially sorted, so if the masking_radius is sufficiently small you should see a noticeable improvement in the speed.

Python vectorization with a constant

I have a series X of length n(=300,000). Using a window length of w (=40), I need to implement:
mu(i)= X(i)-X(i-w)
s(i) = sum{k=i-w to i} [X(k)-X(k-1) - mu(i)]^2
I was wondering if there's a way to prevent loops here. The fact that mu(i) is constant in second equation is causing complications in vectorization. I did the following so far:
x1=x.shift(1)
xw=x.shift(w)
mu= x-xw
dx=(x-x1-mu)**2 # wrong because mu wouldn't be constant for each i
s=pd.rolling_sum(dx,w)
The above code would work (and was working) in a loop setting but takes too long, so any help regarding vectorization or other speed improvement methods would be helpful. I posted this on crossvalidated with mathjax formatting but that doesn't seem to work here.
https://stats.stackexchange.com/questions/241050/python-vectorization-with-a-constant
Also just to clarify, I wasn't using a double loop, just a single one originally:
for i in np.arange(w, len(X)):
x=X.ix[i-w:i,0] # clip a series of size w
x1=x.shift(1)
mu.ix[i]= x.ix[-1]-x.ix[0]
temp= (x-x1-mu.ix[i])**2 # returns a series of size w but now mu is constant
s.ix[i]= temp.sum()
Approach #1 : One vectorized approach would be using broadcasting -
N = X.shape[0]
a = np.arange(N)
k2D = a[:,None] - np.arange(w+1)[::-1]
mu1D = X - X[a-w]
out = ((X[k2D] - X[k2D-1] - mu1D[:,None])**2).sum(-1)
We can further optimize the last step to get squared summations with np.einsum -
subs = X[k2D] - X[k2D-1] - mu1D[:,None]
out = np.einsum('ij,ij->i',subs,subs)
Further improvement is possible with the use of NumPy strides to get X[k2D] and X[k2D-1].
Approach #2 : To save on memory when working very large arrays, we can use one loop instead of two loops used in the original code, like so -
N = X.shape[0]
s = np.zeros((N))
k_idx = np.arange(-w,1)
for i in range(N):
mu = X[i]-X[i-w]
s[i] = ((X[k_idx]-X[k_idx-1] - mu)**2).sum()
k_idx += 1
Again, np.einsum could be used here to compute s[i], like so -
subs = X[k_idx]-X[k_idx-1] - mu
s[i] = np.einsum('i,i->',subs,subs)

Categories