How to calculate covariance matrix of data frame - python

I have read data frame of sensor data, using pandas read_fwf function.
I need to find covariance matrix of read 928991 x 8 matrix. Eventually,
I want to find eigen vectors and eigen values, using principal component analysis algorithm for this covariance matrix.

First, you need to put the pandas dataframe to a numpy array by using df.values. For example:
A = df.values
It would be much easy to compute either covariance matrix or PCA after you put your data into a numpy array. For more:
# import functions you need to compute covariance matrix from numpy
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# assume you load your data using pd.read_fwf to variable *df*
df = pd.read_fwf(filepath, widths=col_widths, names=col_names)
#put dataframe values to a numpy array
A = df.values
#check matrix A's shape, it should be (928991, 8)
print(A.shape)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix followed finally by the projection of the original matrix. Here is a link you may found useful for your PCA task.

Why not just use the pd.DataFrame.cov function?

The answer of this question would be as follows
import pandas as pd
import numpy as np
from numpy.linalg import eig
df_sensor_data = pd.read_csv('HT_Sensor_dataset.dat', delim_whitespace=True)
del df_sensor_data['id']
del df_sensor_data['time']
del df_sensor_data['Temp.']
del df_sensor_data['Humidity']
df = df_sensor_data.notna().astype('float64')
covariance_matrix = df_sensor_data.cov()
print(covariance_matrix)
values, vectors = eig(covariance_matrix)
print(values)
print(vectors)

Related

Covariance Matrix showing covariance between each value in vectors

Is there a python function that allows me to compute a n*n auto covariance matrix, displaying the covariance between each combination of the entries in a vector [a1,a2,a3...an]? I can't get np.cov to do that...
I want it to look like this:
cov(a1,a1) cov(a1,a2)... cov(a1,an)
cov(a2,a1) cov(a2,a2)...
...
cov(an,a1) ... cov(an,an)
Any help is appreciated!
Cheers,
Lena
You could use the pandas package.
http://pandas.pydata.org/
Dataframes have a covariance method that computes the covariances between all columns in the DataFrame.
This should roughly be the workflow.
import pandas as pd
df = pd.DataFrame(your_vector)
cov = df.cov()
Cov is now a dataframe that contains the covariences between all the columns in your vector. You can than do what ever analysis you want on the resulting dataframe.

How to get the norm of the vector corresponding to a particular row in a SciPy sparse matrix?

I have a sparse matrix random matrix created as follows:
import numpy as np
from scipy.sparse import rand
foo = rand(100, 100, density=0.1, format='csr')
I would like to get the norm of the vector corresponding to a particular row:
row = foo.getrow(bar)
print(np.linalg.norm(row))
But this code produces an error:
ValueError: dimension mismatch
One approach would be to extract the non-zero data and then compute its L2 norm -
out = np.linalg.norm(row.data)

Compute numpy array pairwise Euclidean distance except with self

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!
Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.
Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)
I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

easy sampling of vectors from a sparse matrix, and creating a new matrix from the sample (python)

This question has two parts (maybe one solution?):
Sample vectors from a sparse matrix: Is there an easy way to sample vectors from a sparse matrix?
When I'm trying to sample lines using random.sample I get an TypeError: sparse matrix length is ambiguous.
from random import sample
import numpy as np
from scipy.sparse import lil_matrix
K = 2
m = [[1,2],[0,4],[5,0],[0,8]]
sample(m,K) #works OK
mm = np.array(m)
sample(m,K) #works OK
sm = lil_matrix(m)
sample(sm,K) #throws exception TypeError: sparse matrix length is ambiguous.
My current solution is to sample from the number of rows in the matrix, then use getrow(),, something like:
indxSampls = sample(range(sm.shape[0]), k)
sampledRows = []
for i in indxSampls:
sampledRows+=[sm.getrow(i)]
Any other efficient/elegant ideas? the dense matrix size is 1000x30000 and could be larger.
Constructing a sparse matrix from a list of sparse vectors: Now imagine I have the list of sampled vectors sampledRows, how can I convert it to a sparse matrix without densify it, convert it to list of lists and then convet it to lil_matrix?
Try
sm[np.random.sample(sm.shape[0], K, replace=False), :]
This gets you out an LIL-format matrix with just K of the rows (in the order determined by the random.sample). I'm not sure it's super-fast, but it can't really be worse than manually accessing row by row like you're currently doing, and probably preallocates the results.
The accepted answer to this question is outdated and no longer works. With newer versions of numpy, you should use np.random.choice in place of np.random.sample, e.g.:
sm[np.random.choice(sm.shape[0], K, replace=False), :]
as opposed to:
sm[np.random.sample(sm.shape[0], K, replace=False), :]

Efficient way to create a diagonal sparse matrix

I have the following code in Python using Numpy:
p = np.diag(1.0 / np.array(x))
How can I transform it to get the sparse matrix p2 with the same values as p without creating p first?
Use scipy.sparse.spdiags (which does a lot, and so may be confusing, at first), scipy.sparse.dia_matrix and/or scipy.sparse.lil_diags. (depending on the format you want the sparse matrix in...)
E.g. using spdiags:
import numpy as np
import scipy as sp
import scipy.sparse
x = np.arange(10)
# "0" here indicates the main diagonal...
# "y" will be a dia_matrix type of sparse array, by default
y = sp.sparse.spdiags(x, 0, x.size, x.size)
Using the scipy.sparse module,
p = sparse.dia_matrix(1.0 / np.array(x), shape=(len(x), len(x)));

Categories