Speeding up nested loops in python - python

How can I speed up this code in python?
while ( norm_corr > corr_len ):
correlation = 0.0
for i in xrange(6):
for j in xrange(6):
correlation += (p[i] * T_n[j][i]) * ((F[j] - Fbar) * (F[i] - Fbar))
Integral += correlation
T_n =np.mat(T_n) * np.mat(TT)
T_n = T_n.tolist()
norm_corr = correlation / variance
Here, TT is a fixed 6x6 matrix, p is a fixed 1x6 matrix, and F is fixed 1x6 matrix. T_n is the nth power of TT.
This while loop might be repeated for 10^4 times.

The way to do these things quickly is to use Numpy's built-in functions and operators to perform the operations. Numpy is implemented internally with optimized C code and if you set up your computation properly, it will run much faster.
But leveraging Numpy effectively can sometimes be tricky. It's called "vectorizing" your code - you have to figure out how to express it in a way that acts on whole arrays, rather than with explicit loops.
For example in your loop you have p[i] * T_n[j][i], which IMHO can be done with a vector-by-matrix multiplication: if v is 1x6 and m is 6x6 then v.dot(m) is 1x6 that computes dot products of v with the columns of m. You can use transposes and reshapes to work in different dimensions, if necessary.

Related

solve Ax=b for outrigger A matrix python

I implement Crank-Nicolson 2D finite-difference method.
I get a matrix A which is banded with 1 band above and below the main diagonal, but also contains 2 additional bands , further apart from the main diagonal, so it is NOT penta-diagonal.
A picture showing the structure is below. My matrix is the RHS one. The LHS is easy, it's the penta-diagonal one.
I couldn't find up until now a way to solve Ax = b with A being the RHS matrix from the photo in python.
I could barely find a name for it, in these lecture notes https://ocw.mit.edu/ans7870/2/2.086/F12/MIT2_086F12_notes_unit5.pdf it is called an 'outrigger' matrix (page 403).
At the moment I am using spsolve from from scipy.sparse.linalg, into which I feed two arguments, namely sparse.csc_matrix(A) and sparse.csc_array(b), where A and b have been defined initially as A = sparse.dok_matrix((size, size), dtype=np.complex64) and b = sparse.dok_array((size, 1), dtype=np.complex64), then populated with values by iterating element by element through them.
It is extremely slow and I was wondering maybe someone more experienced knows a way to exploit the structure appearing in A.
Thank you!
You should consider ussing the Gauss-Seidel method.
If your system is diagonal dominant it will converge, if it is not you probably can make it so by changing using a higher resolution grid.
Where both x and b have shape (N, M) and A has shape (N, N).
Let L = np.diag(np.diag(A)), vL = np.diag(A).reshape(N, 1) and U = A - L.
The inv(L) * (b - U # x) iteration can be written as (b - U # x) / vL, so each iteration will have O(n) complexity if you use sparse matrices.
If you want to make it even more efficient you can do the multiplications by sum of rolled diagonal matrices.
np.roll(np.diag(np.roll(A, k, axis=0)) * x[:,0], -k, axis=0).reshape(N, M)
You can precompute the rolled diagonals, then your matrix multiplication is performed by 4 (or five if the structure is not symmetric) vector multiplications, and some additional rolling and adding operations.

Using numpy to contract indices with delta function

I have two NxM arrays in numpy, a and b. I would like to perform a vectorized operation that does the following:
c = np.zeros(N)
for i in range(N):
for j in range(M):
c[i] += a[i, j]*b[i, j]
Stated in a more mathematical way, I have two matrices A and B, and want to compute the diagonal of the matrix A*B (being imprecise with matrix transposition, etc). I've been trying to accomplish something like this with the tensordot function, but haven't had much success. This is an operation that is going to be performed many times, so I would like for it to be efficient (i.e., without literally calculating the matrix AB and just taking the diagonal from that).

Creating a symmetric array with power of an element

I am trying to create an array which is symmetric with elements placed as below
I have written the following code to get this form with parameter being 0.5 and dimension being 4-by-4.
import numpy as np
a = np.eye(4)
for i in range(4):
for j in range(4):
a[i, j] = (0.5) ** (np.abs(i-j))
This does what I need but for large dimension (1000s) this causes a lot of overhead. Is there any other low complexity method to get this matrix? Thanks.
We can leverage broadcasting after creating a ranged array to represent the iterator variable and then performing an outer-subtraction to simulate i-j part -
n = 4
p = 0.5
I = np.arange(n)
out = p ** (np.abs(I[:,None]-I))
Optimization #1
We can do a hashing based one with indexing, so that we optimize on expensive power computations, like so -
out = (p**np.arange(n))[(np.abs(I[:,None]-I))]
Optimization #2
We can optimize further to use multi-cores with numexpr -
import numexpr as ne
out = ne.evaluate('p**abs(I2D-I)',{'I2D':I[:,None],'I':I})

Creating a sparse matrix from lists of sub matrices (Python)

This is my first SO question ever. Let me know if I could have asked it better :)
I am trying to find a way to splice together lists of sparse matrices into a larger block matrix.
I have python code that generates lists of square sparse matrices, matrix by matrix. In pseudocode:
Lx = [Lx1, Lx1, ... Lxn]
Ly = [Ly1, Ly2, ... Lyn]
Lz = [Lz1, Lz2, ... Lzn]
Since each individual Lx1, Lx2 etc. matrix is computed sequentially, they are appended to a list--I could not find a way to populate an array-like object "on the fly".
I am optimizing for speed, and the bottleneck features a computation of Cartesian products item-by-item, similar to the pseudocode:
M += J[i,j] * [ Lxi *Lxj + Lyi*Lyj + Lzi*Lzj ]
for all combinations of 0 <= i, j <= n. (J is an n-dimensional square matrix of numbers).
It seems that vectorizing this by computing all the Cartesian products in one step via (pseudocode):
L = [ [Lx1, Lx2, ...Lxn],
[Ly1, Ly2, ...Lyn],
[Lz1, Lz2, ...Lzn] ]
product = L.T * L
would be faster. However, options such as np.bmat, np.vstack, np.hstack seem to require arrays as inputs, and I have lists instead.
Is there a way to efficiently splice the three lists of matrices together into a block? Or, is there a way to generate an array of sparse matrices one element at a time and then np.vstack them together?
Reference: Similar MATLAB code, used to compute the Hamiltonian matrix for n-spin NMR simulation, can be found here:
http://spindynamics.org/Spin-Dynamics---Part-II---Lecture-06.php
This is scipy.sparse.bmat:
L = scipy.sparse.bmat([Lx, Ly, Lz], format='csc')
LT = scipy.sparse.bmat(zip(Lx, Ly, Lz), format='csr') # Not equivalent to L.T
product = LT * L
I have a "vectorized" solution, but it's almost twice as slow as the original code. Both the bottleneck shown above, and the final dot product shown in the last line below, take about 95% of the calculation time according to kernprof tests.
# Create the matrix of column vectors from these lists
L_column = bmat([Lx, Ly, Lz], format='csc')
# Create the matrix of row vectors (via a transpose of matrix with
# transposed blocks)
Lx_trans = [x.T for x in Lx]
Ly_trans = [y.T for y in Ly]
Lz_trans = [z.T for z in Lz]
L_row = bmat([Lx_trans, Ly_trans, Lz_trans], format='csr').T
product = L_row * L_column
I was able to get a tenfold speed increase by not using sparse matrices and using an array of arrays.
Lx = np.empty((1, nspins), dtype='object')
Ly = np.empty((1, nspins), dtype='object')
Lz = np.empty((1, nspins), dtype='object')
These are populated with the individual Lx arrays (formerly sparse matrices) as they are generated. Using the array structure allows the transpose and Cartesian product to perform as desired:
Lcol = np.vstack((Lx, Ly, Lz)).real
Lrow = Lcol.T # As opposed to sparse version of code, this works!
Lproduct = np.dot(Lrow, Lcol)
The individual Lx[n] matrices are still "bundled", so Product is an n x n matrix. This means in-place multiplication of the n x n J array with Lproduct works:
scalars = np.multiply(J, Lproduct)
Each matrix element is then added on to the final hamiltonian matrix:
for n in range(nspins):
for m in range(nspins):
M += scalars[n, k].real

Optimizing histogram distance metric for two matrices in Python

I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result

Categories