Is there a python function that allows me to compute a n*n auto covariance matrix, displaying the covariance between each combination of the entries in a vector [a1,a2,a3...an]? I can't get np.cov to do that...
I want it to look like this:
cov(a1,a1) cov(a1,a2)... cov(a1,an)
cov(a2,a1) cov(a2,a2)...
...
cov(an,a1) ... cov(an,an)
Any help is appreciated!
Cheers,
Lena
You could use the pandas package.
http://pandas.pydata.org/
Dataframes have a covariance method that computes the covariances between all columns in the DataFrame.
This should roughly be the workflow.
import pandas as pd
df = pd.DataFrame(your_vector)
cov = df.cov()
Cov is now a dataframe that contains the covariences between all the columns in your vector. You can than do what ever analysis you want on the resulting dataframe.
Related
I have read data frame of sensor data, using pandas read_fwf function.
I need to find covariance matrix of read 928991 x 8 matrix. Eventually,
I want to find eigen vectors and eigen values, using principal component analysis algorithm for this covariance matrix.
First, you need to put the pandas dataframe to a numpy array by using df.values. For example:
A = df.values
It would be much easy to compute either covariance matrix or PCA after you put your data into a numpy array. For more:
# import functions you need to compute covariance matrix from numpy
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# assume you load your data using pd.read_fwf to variable *df*
df = pd.read_fwf(filepath, widths=col_widths, names=col_names)
#put dataframe values to a numpy array
A = df.values
#check matrix A's shape, it should be (928991, 8)
print(A.shape)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix followed finally by the projection of the original matrix. Here is a link you may found useful for your PCA task.
Why not just use the pd.DataFrame.cov function?
The answer of this question would be as follows
import pandas as pd
import numpy as np
from numpy.linalg import eig
df_sensor_data = pd.read_csv('HT_Sensor_dataset.dat', delim_whitespace=True)
del df_sensor_data['id']
del df_sensor_data['time']
del df_sensor_data['Temp.']
del df_sensor_data['Humidity']
df = df_sensor_data.notna().astype('float64')
covariance_matrix = df_sensor_data.cov()
print(covariance_matrix)
values, vectors = eig(covariance_matrix)
print(values)
print(vectors)
Suppose I have 2 pandas series, which I perceive as column vector in linear algebra x1 and x2
I want to do the operation x1 * x2^T, which is a column vector multiply with a row vector to produce a matrix (pandas dataframe).
What is the best procedure for this?
You want to import numpy and call:
pandas.DataFrame(numpy.outer(x1, x2))
Inside of pandas, you can go back to data frames to do it, e.g.
x1.to_frame().dot(x2.to_frame().T)
I am using scikit-learn. The nature of my application is such that I do the fitting offline, and then can only use the resulting coefficients online(on the fly), to manually calculate various objectives.
The transform is simple, it is just data * pca.components_, i.e. simple dot product. However, I have no idea how to perform the inverse transform. Which field of the pca object contains the relevant coefficients for the inverse transform? How do I calculate the inverse transform?
Specifically, I am referring to the PCA.inverse_transform() method call available in the sklearn.decomposition.PCA package: how can I manually reproduce its functionality using various coefficients calculated by the PCA?
1) transform is not data * pca.components_.
Firstly, * is not dot product for numpy array. It is element-wise multiplication. To perform dot product, you need to use np.dot.
Secondly, the shape of PCA.components_ is (n_components, n_features) while the shape of data to transform is (n_samples, n_features), so you need to transpose PCA.components_ to perform dot product.
Moreover, the first step of transform is to subtract the mean, therefore if you do it manually, you also need to subtract the mean at first.
The correct way to transform is
data_reduced = np.dot(data - pca.mean_, pca.components_.T)
2) inverse_transform is just the inverse process of transform
data_original = np.dot(data_reduced, pca.components_) + pca.mean_
If your data already has zero mean in each column, you can ignore the pca.mean_ above, for example
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(data)
data_reduced = np.dot(data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()
I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.