How to calculate covariance Matrix with Pandas

How to calculate covariance Matrix with Pandas - python

I'm trying to figure out how to calculate a covariance matrix with Pandas.
I'm not a data scientist or a finance guy, i'm just a regular dev going a out of his league.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(252, 4)), columns=list('ABCD'))
print(df.cov())
So, if I do this, I get that kind of output:
I find that the number are huge, and i was expecting them to be closer to zero. Do i have to calculate the return before getting the cov ?
Does anyone familiar with this could explain this a little bit or point me to a good link with explanation ? I couldn't find any link to Covariance Matrix For Dummies.
Regards,
Julien

Covariance is a measure of the degree to which returns on two assets (or any two vector or array) move in tandem. A positive covariance means that asset returns move together, while a negative covariance means returns move inversely.
On the other side we have:
The correlation coefficient is a measure that determines the degree to which two variables' movements are associated. Note that the correlation coefficient measures linear relationship between two arrays/vector/asset.
So, portfolio managers try to reduce covariance between two assets and keep the correlation coefficient negative to have enough diversification in the portfolio. Meaning that a decrease in one asset's return will not cause a decrease in return of the second asset(That's why we need negative correlation).
Maybe you meant correlation coefficient close to zero, not covariance.

The fact that you haven't provided a seed for your randomly generated numbers makes th reproducibility of your experiment difficoult. However, I tried the code you are providing here and the closer covariance matrix I get is this one :
To understand why the numbers in your cov_matrix are so huge you should first understand what is a covarance matrix. The covariance matrix is is a matrix that has as elements in the i, j position the the covariance between the i-th and j-th elements of a random vector.
A good link you might check is https://en.wikipedia.org/wiki/Covariance_matrix . Also understanding the correlation matrix might help : https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_matrices

Related

covariance matrix using np.matmul(data.T, data)

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.

This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.

Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.
When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?
Update: my (partial) answer below contains some additional info.
Take the following example data:
from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred',
start='2017-01-01', end='2017-02-01').pct_change().dropna())
# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
[-0.43328092 -0.36048659 0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629 0.58614003 0.56194768]
[ 0.43328092 0.36048659 -0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True True True]
[ True True True]
[False False False]]

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is \sum_1^r \s_i u_i v_i^\top :
with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.
This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.
Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\top X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.
As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.

After some digging, I've cleared up some, but not all, of my confusion on this. This issue has been covered on stats.stackexchange here. The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." However, in this case (with sklearn.PCA), the source of ambiguity is much more specific: in the source (line 391) for PCA you have:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
svd_flip, in turn, is defined here. But why the signs are being flipped to "ensure a deterministic output," I'm not sure. (U, S, V have already been found at this point...). So while sklearn's implementation is not incorrect, I don't think it's all that intuitive. Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. Problem is, the sklearn implementation will get you strong negative loadings to that first principal component.
My solution is a dumbed-down version that does not implement svd_flip. It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver, but does have a number of methods specifically geared towards this purpose.

With the PCA here in 3 dimensions, you basically find iteratively: 1) The 1D projection axis with the maximum variance preserved 2) The maximum variance preserving axis perpendicular to the one in 1). The third axis is automatically the one which is perpendicular to first two.
The components_ are listed according to the explained variance. So the first one explains the most variance, and so on. Note that by the definition of the PCA operation, while you are trying to find the vector for projection in the first step, which maximizes the variance preserved, the sign of the vector does not matter: Let M be your data matrix (in your case with the shape of (20,3)). Let v1 be the vector for preserving the maximum variance, when the data is projected on. When you select -v1 instead of v1, you obtain the same variance. (You can check this out). Then when selecting the second vector, let v2 be the one which is perpendicular to v1 and preserves the maximum variance. Again, selecting -v2 instead of v2 will preserve the same amount of variance. v3 then can be selected either as -v3 or v3. Here, the only thing which matters is that v1,v2,v3 constitute an orthonormal basis, for the data M. The signs mostly depend on how the algorithm solves the eigenvector problem underlying the PCA operation. Eigenvalue decomposition or SVD solutions may differ in signs.

This is a short notice for those who care about the purpose and not the math part at all.
Although the sign is opposite for some of the components, that shouldn't be considered as a problem. In fact what we do care about (at least to my understanding) is the axes' directions. The components, ultimately, are vectors that identify these axes after transforming the input data using pca. Therefore no matter what direction each component is pointing to, the new axes that our data lie on will be the same.

Estimation of fundamental matrix or essential matrix from feature matching

I am estimating the fundamental matrix and the essential matrix by using the inbuilt functions in opencv.I provide input points to the function by using ORB and brute force matcher.These are the problems that i am facing:
1.The essential matrix that i compute from in built function does not match with the one i find from mathematical computation using fundamental matrix as E=k.t()FK.
2.As i vary the number of points used to compute F and E,the values of F and E are constantly changing.The function uses Ransac method.How do i know which value is the correct one??
3.I am also using an inbuilt function to decompose E and find the correct R and T from the 4 possible solutions.The value of R and T also change with the changing E.More concerning is the fact that the direction vector T changes without a pattern.Say it was in X direction at a value of E,if i change the value of E ,it changes to Y or Z.Y is this happening????.Has anyone else had the same problem.???
How do i resolve this problem.My project involves taking measurements of objects from images.
Any suggestions or help would be welcome!!

Both F and E are defined up to a scale factor. It may help to normalize the matrices, e. g. by dividing by the last element.
RANSAC is a randomized algorithm, so you will get a different result every time. You can test how much it varies by triangulating the points, or by computing the reprojection errors. If the results vary too much, you may want to increase the number of RANSAC trials or decrease the distance threshold, to make sure that RANSAC converges to the correct solution.

Yes, Computing Fundamental Matrix gives a different matrix every time as it is defined up to a scale factor.
It is a Rank 2 matrix with 7DOF(3 rot, 3 trans, 1 scaling).
The fundamental matrix is a 3X3 matrix, F33(3rd col and 3rd row) is scale factor.
You make ask why do we append matrix with constant at F33, Because of (X-Left)F(x-Right)=0, This is a homogenous equation with infinite solutions, we are adding a constraint by making F33 constant.

autocorrelation function of time series data with numpy

I have been trying to calculate an autocorrelation function, as defined in statistical mechanics, using numpy. Most of the documentation I found is relative to functions like correlate and convolve. However, for a given random variable x these functions just seem to calculate the sum
ACF(dt) = sum_{t=0}^T [(x(t)*x(t+dt)]
instead of the average
ACF(dt) = mean[x(t)*x(t+dt)]
so in fact for calculating an autocorrelation function one would need to do something like:
acf = np.correlate(x,x,mode='full')
acf_half = acf[acf.size / 2:]
ldata = len(acf)
acf = np.array([x/(ldata-i) for i,x in enumerate(acf_half)])
Of course we would need to subtract mean(x)**2 from the resulting acf to be correct.
Can anyone confirm that this is correct?

Generally speaking, the autocorrelation, correlation, etc. is the sum (integral). Sometimes it is normalized, but not averaged in the sense as you've written above. This is because they are defined in terms of the mathematical convolution operation, which is simply the integral that you've written as a sum above.
The brackets at the stat mech page indicate a thermal average, which is an ensemble or time average over the 'experiment' taking place many times at many different states at some temperature. This (the finite temperature) causes the fluctuations that give rise to the 'statistical' nature of the problem, and cause the decay of the correlation (loss of long range order). This simply means that you should find the autocorrelation of several datasets, and average those together, but do not take the mean of the function.
As far as I can tell, your code is attempting to weigh the correlation at dt by the length of the overlap length dt, but I do not believe that this is correct.
With respect to the subtraction of <s>2, that's in the case of the spin model, where <s> would be the mean spin (magnetization), so I believe you are correct in that you should use mean(x)**2.
As a side-note, I would suggest using mode='same' instead of 'full' so that the domain of your correlation matches the domain of your input without having to look at just one-half of the output (here the output is symmetric, so it doesn't really make a difference).

Calculating Pearson correlation

I'm trying to calculate the Pearson correlation coefficient of two variables. These variables are to determine if there is a relationship between number of postal codes to a range of distances. So I want to see if the number of postal codes increases/decreases as the distance ranges changes.
I'll have one list which will count the number of postal codes within a distance range and the other list will have the actual ranges.
Is it ok to have a list that contain a range of distances? Or would it be better to have a list like this [50, 100, 500, 1000] where each element would then contain ranges up that amount. So for example the list represents up to 50km, then from 50km to 100km and so on.

Use scipy :
scipy.stats.pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters :
x : 1D array
y : 1D array the same length as x
Returns :
(Pearson’s correlation coefficient, :
2-tailed p-value)

You can also use numpy:
numpy.corrcoef(x, y)
which would give you a correlation matrix that looks like:
[[1 correlation(x, y)]
[correlation(y, x) 1]]

try this:
val=Top15[['Energy Supply per Capita','Citable docs per Capita']].rank().corr(method='pearson')

In Python 3.10 correlation() function was added to the statistics module of the Python standard library, it can be directly used by importing the statistics module:
import statistics
statistics.correlation(words, views)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate covariance Matrix with Pandas - python

Related

covariance matrix using np.matmul(data.T, data)

In sklearn.decomposition.PCA, why are components_ negative?

Estimation of fundamental matrix or essential matrix from feature matching

autocorrelation function of time series data with numpy

Calculating Pearson correlation

Categories

Resources