Covariance of large matrix - python

I have a large Matrix (70000x784) that I want to compute the covariance Matrix (70000x70000) of. I tried using numpy.cov(), but I get a memory error because there are too many observations (and yes I am running a 62 bit Version of Python on a 62-bit computer).
I attempted to calculate the covariance Matrix using a nested for loop (which is really slow), but I know it's not correct because the resulting covariance Matrix is not symmetrical (X[i,j]!=X[j,i]).
Surely, there must be an easier and quicker way to do this?
Here is my attempt, where the input Matrix With Dimensions 70000x784 is X_scaled:
Xt = np.transpose(X_scaled)
aveRows = np.mean(Xt,axis=0)
for i, val in enumerate(X[:,0]):
for j, val in enumerate(X[:,0]):
cov_matrix[i,j] = np.mean((X_scaled[i,:]-aveRows[i])*(X_scaled[j,:]-aveRows[j]),axis=0)
#increase cov_matrix by one row and one column:
cov_matrix = np.lib.pad(cov_matrix, ((0,1),(0,1)), 'constant', constant_values=(0))
print(cov_matrix.shape)

Related

Optimal Kernel Calculation of Largely Sparse Matrix

I have a fairly large (1600,1600) sparse matrix B and I need to find its kernel (note: I know it to be 1dimensional, i.e. just one vector).
One approach is to "bruteforce" it by using scipy.linalg.null_space(). That is a fine way I would say, but for larger matrices it may not be the most optimal way. Therefore, I tried to implement my own method by using the sparse scipy library and svd
def kernel(matrix):
# Convert the matrix to a compressed sparse row (CSR) format
matrix = csr_matrix(matrix)
# Calculate the singular value decomposition (SVD) of the matrix
U, s, vh = linalg.svds(matrix, which='SM')
ind = np.where(s == np.min(s))[0][0] # get the index of the vector with lowest singular value
return vh[ind].T
The problem I am having is that the above code does return me a vector (which is nice), it is although far from being the kernel of B, as the norm of B times that vector is (firstly not constant, as it changes after every run) way larger than zero.
Anyone has ever encountered this problem? Does anyone know how to fix this?

Generating invertible matrices in numpy/tensorflow

I would like to generate invertible matrices (specifically those from GL(n), a general linear group of size n) using Tensorflow and/or Numpy for use with my neural network.
How can this be done and what would be the best way of doing so?
I understand there is a way to generate symmetric invertible matrices by computing (A + A.T)/2 for arbitrary square matrices A, however, I would like mine to not just be symmetric.
I happened to have found one way which I believe can generate a large variety of random invertible matrices using diagonal dominance.
The theorem is that given an nxn matrix, if the abs of the diagonal element is larger than the sum of the abs of all the row elements with respect to the row the diagonal element is in, and this holds true for all rows, then the underlying matrix is invertible. (here is the corresponding wikipedia article: https://en.wikipedia.org/wiki/Diagonally_dominant_matrix)
Therefore the following code snippet generates an arbitrary invertible matrix.
n = 5 # size of invertible matrix I wish to generate
m = np.random.rand(n, n)
mx = np.sum(np.abs(m), axis=1)
np.fill_diagonal(m, mx)

Looking for an effective way to calculate affinity matrix

Well, suppose that I have an image with dimensions n by n. which is ordered lexicographically on a column vector with dimensions n^2 by 1.
I want to calculate following affinity matrix for image:
Clearly, the matrix is symmetric and we can reduce the calculations using this property, this is my attempt to implement the above matrix calculation:
def w(R,sigma,x):
n=x.shape[0]
w=np.zeros_like(x).astype(np.float64)
img=x.ravel()
for i in range(img.shape[0]):
for j in range(img.shape[0]):
if(max(abs(i//n-j//n),abs(i%n-j%n))>R or i==j):
w[i//n,j%n]=0
continue
w[i//n,j%n] = np.exp((-1*(int(img[i])-int(img[j]))**2)/sigma)
return w
where R is the threshold for norm, sigma is the exponential damper and x is the image.
However, this code is very ineffective, and its order is n^2, do you have any better idea for implementing the code faster?

Spectral norm 2x2 matrix in tensorflow

I've got a 2x2 matrix defined by the variables J00, J01, J10, J11 coming in from other inputs. Since the matrix is small, I was able to compute the spectral norm by first computing the trace and determinant
J_T = tf.reduce_sum([J00, J11])
J_ad = tf.reduce_prod([J00, J11])
J_cb = tf.reduce_prod([J01, J10])
J_det = tf.reduce_sum([J_ad, -J_cb])
and then solving the quadratic
L1 = J_T/2.0 + tf.sqrt(J_T**2/4.0 - J_det)
L2 = J_T/2.0 - tf.sqrt(J_T**2/4.0 - J_det)
spectral_norm = tf.maximum(L1, L2)
This works, but it looks rather ugly and it isn't generalizable to larger matrices. Is there cleaner way (maybe a method call that I'm missing) to compute spectral_norm?
The spectral norm of a matrix J equals the largest singular value of the matrix.
Therefore you can use tf.svd() to perform the singular value decomposition, and take the largest singular value:
spectral_norm = tf.svd(J,compute_uv=False)[...,0]
where J is your matrix.
Notes:
I use compute_uv=False since we are interested only in singular values, not singular vectors.
J does not need to be square.
This solution works also for the case where J has any number of batch dimensions (as long as the two last dimensions are the matrix dimensions).
The elipsis ... operation works as in NumPy.
I take the 0 index because we are interested only in the largest singular value.

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories