What is the most efficient method to compute this matrix? - python

I have a N-square matrix B of integers and i want to build the matrix A such that
A[m,n] = sum([B[i,j] for i in range(1,m) for j in range(1,n)])
As B can be quite big, computing A naively coefficient by coefficient takes much time.
What is the most effective way to compute A ?

import numpy as np
A = np.cumsum(np.cumsum(a, axis=0), axis=1)
You're in luck. Two calls to numpy.cumsum (cumulative sum) should do the trick.
https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html

Related

python equivalent for `eigs` in matlab with a matrix function

If I want to calculate the k smallest eigenvalues of the matrix multiplication AA' with A of size 300K by 512 and "'" is the transpose, then that would be infeasible to do it in traditional way. Matlab however provides a nice functionality by using a function argument that perform the product Afun = #(x) A*(A'*x)); to the eigs function. Then, to find the smallest 6 eigenvalues/eigenvectors we call d = eigs(Afun,300000,6,'smallestabs'), where the second input is the size of the matrix AA'. Is there a function in python that performs a similar thing?
To my knowledge, there is no such functionality in numpy. However, I don't see any limitations by using simply numpy.linalg.eigvals for retrieving an array of the matrix eigenvalues. Then simply find the N smallest with a sort:
import numpy as np
import numpy.linalg
A = np.array() # your matrix
eigvals = numpy.linalg.eigvals(A)
eigvals.sort()
smallest_6_eigvals = eigvals[:6]

How to vectorize multiple matrix multiplication

I have a 2d matrix A[1000*90] and B[90*90*1000]
I would like to calculate C[1000*90]
For i in range(1000)
C[i,:]=np.matmul(A[i,:],B[:,:,i]
I understand if I use a vectorized formula it's going to be faster, seems like einsum might be the function I'm looking for, but I am having trouble cyphering the syntax of einsum. Is it np.einsum(ij,jki->ik,A,B)?
Your einsum is correct. But there is a better way as pointed out by hpaulj.
Using Matmul:
import numpy as np
A =np.random.rand(1000,90)
B = np.random.rand(90,90,1000)
C = A[:,np.newaxis,:]#B.transpose(2,0,1) ## Matrix multiplication
C = C = C.reshape(-1,C.shape[2])
np.array_equal(C,np.einsum('ij,jki->ik',A,B)) # check if both give same result

numpy - Computing "element-wise" difference between two arrays along first axis

Suppose I have two arrays A and B with dimensions (n1,m1,m2) and (n2,m1,m2), respectively. I want to compute the matrix C with dimensions (n1,n2) such that C[i,j] = sum((A[i,:,:] - B[j,:,:])^2). Here is what I have so far:
import numpy as np
A = np.array(range(1,13)).reshape(3,2,2)
B = np.array(range(1,9)).reshape(2,2,2)
C = np.zeros(shape=(A.shape[0], B.shape[0]) )
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.sum(np.square(A[i,:,:] - B[j,:,:]))
C
What is the most efficient way to do this? In R I would use a vectorized approach, such as outer. Is there a similar method for Python?
Thanks.
You can use scipy's cdist, which is pretty efficient for such calculations after reshaping the input arrays to 2D, like so -
from scipy.spatial.distance import cdist
C = cdist(A.reshape(A.shape[0],-1),B.reshape(B.shape[0],-1),'sqeuclidean')
Now, the above approach must be memory efficient and thus a better one when working with large datasizes. For small input arrays, one can also use np.einsum and leverage NumPy broadcasting, like so -
diffs = A[:,None]-B
C = np.einsum('ijkl,ijkl->ij',diffs,diffs)

Can I do (x_i-x_j)^T(x_i-x_j) for x_i, x_j are rows in a X matrix with numpy native function instead of loop

I need to compute in numpy where $x_i$ and $x_j$ are rows in a matrix $X$. Now I am using loop, which is very slow. Is there any numpy native function allows such computation, like einsum:
n=X.shape[0]
Y=np.zeros((n,n))
for i in range(n):
x=(X-X[i])**2
x=np.sum(x, axis=1)
Y[i]=x
return Y
BTW, I am very confused with einsum. Is there any good material for its introduction. The manual page on numpy was not very clear to me.
Approach #1
You can use broadcasting as a vectorized approach -
import numpy as np
Y = np.sum((X - X[:,None,:])**2,2)
This should be efficient with relatively smaller input arrays.
Approach #2
Seems like you are performing euclidean distance calculations and getting the squared distances. So, you can use distance.cdist like so -
import numpy as np
from scipy.spatial import distance
Y = distance.cdist(X, X, 'sqeuclidean')
This should be efficient with large input arrays.

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

Categories