covariance matrix using np.matmul(data.T, data)

covariance matrix using np.matmul(data.T, data) - python

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.

This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.

Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

Related

Averaging multivariate normal distributions, extend covariance along a vector

If I have two separate multivariate normal random variables:
from scipy.stats import multivariate_normal
import numpy as np
cov0=np.array([
[1,0,0],
[0,1,0],
[0,0,1]
])
mean0 = np.array([1,1,1])
rv3d_0 = multivariate_normal(mean=mean0, cov=cov0)
cov1=np.array([
[1,0,0],
[0,1,0],
[0,0,1]
])
mean1 = np.array([4,4,4])
rv3d_1 = multivariate_normal(mean=mean1, cov=cov1)
Then I am interested in creating a new random variable that is between these two:
mean_avg = (mean0+mean1)/2
cov_avg = (cov0+cov1)/2
rv3d_avg = multivariate_normal(mean=mean_avg, cov=cov_avg)
# I can then plot the points generated by:
rv3d_0.rvs(1000)
rv3d_1.rvs(1000)
rv3d_avg.rvs(1000)
However when looking at the points generated, the covariance is predictably the same as the two components. However what I would like is for the covariance to be greater along the vector (mean1-mean0) compared to the covariance along the orthogonal vectors. I think maybe taking the average of the covariance is not the proper technique? Any suggestions welcome, thanks!

This is an interesting problem. Look at it this way: you have some specific directions for the covariance components, namely mean1 - mean0 is one direction and the plane orthogonal to mean1 - mean0 contains the others. In these directions you want to specify the magnitude of the variation, namely it's something (let's say FOO) in the orthogonal plane and a lot more (let's say 100 times FOO) in the direction mean1 - mean0.
You can find a basis for the orthogonal plane via the Gram-Schmidt algorithm or something. At this point you can construct a covariance matrix: let S = columns of the directions you've found (namely mean1 - mean plus the basis of the orthogonal plane), and let D = diagonal matrix with 100 FOO, FOO, FOO, ..., FOO on the diagonal. Now S D S^T (where S^T is the matrix transpose) is a positive definite matrix with the desired properties.
You might be able to avoid Gram-Schmidt, but your goal would be the same in any case: specify the properties you want and then construct a matrix to satisfy them.

I would suggest the following approach:
1- sample a good amount of observations (say 10000) from both distributions: obs0 and obs1
2- create a new array of observations obs_avg which is the sum of obs0 and obs1 divided by 2
3- for the obtained array, calculate the mean and the covariance. the code should look like this:
import numpy as np
obs0 = np.random.normal(mean0, np.sqrt(cov0), 10000) #sampling from a normal distribution
obs1 = np.random.normal(mean1, np.sqrt(cov1), 10000)
obs_avg = (obs0 + obs1)/2
mean_avg = np.mean(obs_avg, axis=0)
cov_avg = np.cov(obs_avg.T)
It's an experimental way of generating the mean and covariance of the average distribution, and I think it should give you pretty accurate results if you take a large enough number of observations.

In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.
When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?
Update: my (partial) answer below contains some additional info.
Take the following example data:
from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred',
start='2017-01-01', end='2017-02-01').pct_change().dropna())
# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
[-0.43328092 -0.36048659 0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629 0.58614003 0.56194768]
[ 0.43328092 0.36048659 -0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True True True]
[ True True True]
[False False False]]

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is \sum_1^r \s_i u_i v_i^\top :
with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.
This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.
Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\top X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.
As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.

After some digging, I've cleared up some, but not all, of my confusion on this. This issue has been covered on stats.stackexchange here. The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." However, in this case (with sklearn.PCA), the source of ambiguity is much more specific: in the source (line 391) for PCA you have:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
svd_flip, in turn, is defined here. But why the signs are being flipped to "ensure a deterministic output," I'm not sure. (U, S, V have already been found at this point...). So while sklearn's implementation is not incorrect, I don't think it's all that intuitive. Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. Problem is, the sklearn implementation will get you strong negative loadings to that first principal component.
My solution is a dumbed-down version that does not implement svd_flip. It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver, but does have a number of methods specifically geared towards this purpose.

With the PCA here in 3 dimensions, you basically find iteratively: 1) The 1D projection axis with the maximum variance preserved 2) The maximum variance preserving axis perpendicular to the one in 1). The third axis is automatically the one which is perpendicular to first two.
The components_ are listed according to the explained variance. So the first one explains the most variance, and so on. Note that by the definition of the PCA operation, while you are trying to find the vector for projection in the first step, which maximizes the variance preserved, the sign of the vector does not matter: Let M be your data matrix (in your case with the shape of (20,3)). Let v1 be the vector for preserving the maximum variance, when the data is projected on. When you select -v1 instead of v1, you obtain the same variance. (You can check this out). Then when selecting the second vector, let v2 be the one which is perpendicular to v1 and preserves the maximum variance. Again, selecting -v2 instead of v2 will preserve the same amount of variance. v3 then can be selected either as -v3 or v3. Here, the only thing which matters is that v1,v2,v3 constitute an orthonormal basis, for the data M. The signs mostly depend on how the algorithm solves the eigenvector problem underlying the PCA operation. Eigenvalue decomposition or SVD solutions may differ in signs.

This is a short notice for those who care about the purpose and not the math part at all.
Although the sign is opposite for some of the components, that shouldn't be considered as a problem. In fact what we do care about (at least to my understanding) is the axes' directions. The components, ultimately, are vectors that identify these axes after transforming the input data using pca. Therefore no matter what direction each component is pointing to, the new axes that our data lie on will be the same.

How to calculate covariance Matrix with Pandas

I'm trying to figure out how to calculate a covariance matrix with Pandas.
I'm not a data scientist or a finance guy, i'm just a regular dev going a out of his league.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(252, 4)), columns=list('ABCD'))
print(df.cov())
So, if I do this, I get that kind of output:
I find that the number are huge, and i was expecting them to be closer to zero. Do i have to calculate the return before getting the cov ?
Does anyone familiar with this could explain this a little bit or point me to a good link with explanation ? I couldn't find any link to Covariance Matrix For Dummies.
Regards,
Julien

Covariance is a measure of the degree to which returns on two assets (or any two vector or array) move in tandem. A positive covariance means that asset returns move together, while a negative covariance means returns move inversely.
On the other side we have:
The correlation coefficient is a measure that determines the degree to which two variables' movements are associated. Note that the correlation coefficient measures linear relationship between two arrays/vector/asset.
So, portfolio managers try to reduce covariance between two assets and keep the correlation coefficient negative to have enough diversification in the portfolio. Meaning that a decrease in one asset's return will not cause a decrease in return of the second asset(That's why we need negative correlation).
Maybe you meant correlation coefficient close to zero, not covariance.

The fact that you haven't provided a seed for your randomly generated numbers makes th reproducibility of your experiment difficoult. However, I tried the code you are providing here and the closer covariance matrix I get is this one :
To understand why the numbers in your cov_matrix are so huge you should first understand what is a covarance matrix. The covariance matrix is is a matrix that has as elements in the i, j position the the covariance between the i-th and j-th elements of a random vector.
A good link you might check is https://en.wikipedia.org/wiki/Covariance_matrix . Also understanding the correlation matrix might help : https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_matrices

Don't understand the output of Principal Component Analysis (PCA) in Python

I did a PCA in Python on audio spectrograms and face the following problem: I have a matrix, where each row consists of flattened song features. After applying PCA it's clear to me, that the dimensions are reduced. BUT I can't find those dimensional data in the regular dataset.
import sys
import glob
from scipy.io.wavfile import read
from scipy import signal
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
import pylab
# Read file to get samplerate and numpy array containing the signal
files = glob.glob('../some/*.wav')
song_list = []
for wav in files:
(fs, x) = read(wav)
channels = [
np.array(x[:, 0]),
np.array(x[:, 1])
]
# Combine channels to make a mono signal out of stereo
channel = np.mean(channels, axis=0)
channel = channel[0:1024,]
# Generate spectrogram
## Freqs is the same with different songs, t differs slightly
Pxx, freqs, t, plot = pylab.specgram(
channel,
NFFT=128,
Fs=44100,
detrend=pylab.detrend_none,
window=pylab.window_hanning,
noverlap=int(128 * 0.5))
# Magnitude Spectrum to use
Pxx = Pxx[0:2]
X_flat = Pxx.flatten()
song_list.append(X_flat)
song_matrix = np.vstack(song_list)
If I now apply PCA to the song_matrix...
import matplotlib
from matplotlib.mlab import PCA
from sklearn import decomposition
#test = matplotlib.mlab.PCA(song_matrix.T)
pca = decomposition.PCA(n_components=2)
song_matrix_pca = pca.fit_transform(song_matrix.T)
pca.components_ #These components should be most helpful to discriminate between the songs due to their high variance
pca.components_
...the final 2 components are the following:
Final components - two dimensions from 15 wav-files
The problem is, that I can't find those two vectors in the original dataset with all dimensions What am I doing wrong or am I misinterpreting the whole thing?

PCA doesn't give you the vectors in your dataset.
From Wikipedia :
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Say you have a column vector V containing ONE flattened spectrogram. PCA will find a matrix M whose columns are orthogonal vectors (think of them as being at right angles to every other column in M).
Multiplying M and T will give you a vector of "scores", which can be used to determine how much variance each column of M captures from the original data and each column of M captures progressively less variance in the data.
Multiplying matrix M' (the first 2 columns of M) by V will produce a 2x1 vector T' representing the "dimension-reduced spectrogram". You could reconstruct an approximation of V by multiplying T' by the inverse of M'. This would work if you had a matrix of spectrograms, too. Keeping only two principal components would produce an extremely lossy compression of your data.
But what if you want to add a new song to your dataset? Unless it is very much like the original song (meaning it introduces little variance to the original data set), there's no reason to think that the vectors of M will describe the new song well. For that matter, even multiplying all the elements of V by a constant would render M useless. PCA is quite data specific. Which is why it's not used in image/audio compression.
The good news? You can use a Discrete Cosine transform to compress your training data. Instead of lines, it finds cosines that form a descriptive basis, and doesn't suffer from the data specific limitation. DCT is used in jpeg, mp3 and other compression schemes.

Active Shape Models: matching model points to target points

I have a question regarding Active Shape Models. I am using the paper of T. Coots (which can be found here.)
I have done all of the initial steps (Procrustes Analysis to calculate mean shape, PCA to reduce dimensions) but am stuck on fitting.
This is the situation I am in now: I have calculated the mean shape with points X and have also calculated a new set of points Y that X should move to, to better fit my image.
I am using the following algorithm, which can be found on page 23 of the paper previously linked:
To clarify: is the mean shape calculated with Procrustes Analysis, and the is the matrix containing the eigenvectors calculated with PCA.
Everything goes well up to step 4. I can calculate the pose parameters and invert the transformation onto the points Y.
However, in stap 5, something strange happens. Whatever the pose parameters are calculated in stap 3 and applied in stap 4, stap 5 always results in almost exactly the same vector y' with very low values (one of them being 1.17747114e-05 for example). (So whether i calculated a scale of 1/10 or 1000, y' barely changes).
This results in the algorithm always converging to the same value of b, and thus in the same output shape x, no matter what the input set of target points Y are that I want the model points X to match with.
This sure is not the goal of the algorithm... Could anyone explain this strange behaviour? Somehow, projecting my calculated vector y in step 5 into the "tangent plane" does not take into account any of the changes made in step 4.
Edit: I have some more reasoning, though no explanation or solution. If, in step 5, i manually set y' to consist only of zeros, then in step 6, b is equal to the matrix of eigenvectors multiplicated with the meanshape. And this results in the same b I always get (since y' is always a vector with very low values).
But these eigenvectors are calculated from the meanshape using PCA... So what's expected, is that no change should take place, right?

Something you could check is that your coordinates are scaled properly: the algorithm assumes that all coordinates are scaled so that the mean shape vector has Euclidean norm one. If this is not the case (especially if it is much larger than one, you will get extremely small components for y).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.