I'm trying to calculate variance from covariance matrix and the proportions of each stocks.
Here's my code:
for num in range(9):
result_table2.iloc[num]['Expected Return'] = arith_mean.dot(result_table1.iloc[num])
for num in range(9):
result_table2.iloc[num]['Standard Deviation'] = (result_table1.iloc[num])#V.values
The first one worked and the second one failed
result_table1:
result_table2:
covariance matrix:
Related
I'm working on an assignment where I am tasked to implement PCA in Python for an online course. Unfortunately, when I try to run a comparison (provided by the course) between my implementation and SKLearn's, my results appear to differ too greatly.
After many hours of review, I am still unsure where it is going wrong. If someone could take a look and determine what step I have coded or interpreted incorrectly, I would greatly appreciate it.
def normalize(X):
"""
Normalize the given dataset X to have zero mean.
Args:
X: ndarray, dataset of shape (N,D)
Returns:
(Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
with mean 0; mean is the sample mean of the dataset.
Note:
You will encounter dimensions where the standard deviation is zero.
For those ones, the process of normalization results in normalized data with NaN entries.
We can handle this by setting the std = 1 for those dimensions when doing normalization.
"""
# YOUR CODE HERE
### Uncomment and modify the code below
mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise. Setting it to 1 will compute the mean across rows.
std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.
std_filled = std.copy()
std_filled[std == 0] = 1
# Compute the normalized data as Xbar
Xbar = (X - mu)/std_filled
return Xbar, mu, # std_filled
def eig(S):
"""
Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.
Args:
S: ndarray, covariance matrix
Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors
Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
# YOUR CODE HERE
# Uncomment and modify the code below
# Compute the eigenvalues and eigenvectors
# You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
eigvals, eigvecs = np.linalg.eig(S)
# The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
# We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
# of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
sort_indices = np.argsort(eigvals)[::-1]
# Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
return eigvals[sort_indices], eigvecs[:, sort_indices]
def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by the columns of `B`
Args:
B: ndarray of dimension (D, M), the basis for the subspace
Returns:
P: the projection matrix
"""
# YOUR CODE HERE
P = B # (np.linalg.inv(B.T # B)) # B.T
return P
def select_components(eig_vals, eig_vecs, num_components):
"""
Selects the n components desired for projecting the data upon.
Args:
eig_vals: The eigenvalues sorted in descending order of magnitude.
eig_vecs: The eigenvectors sorted in order relative to that of the eigenvalues.
num_components: the number of principal components to use.
Returns:
The number of desired components to keep for projection of the data upon.
"""
principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]
return principal_vals, principal_components
def PCA(X, num_components):
"""
Projects normalized data onto the 'n' desired principal components.
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
the reconstructed data, the sample mean of the X, principal values
and principal components
"""
# Normalize to have mean 0 and variance 1.
Z, mean_vec = normalize(X)
# Calculate the covariance matrix
S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables. Set bias = True to ensure normalization is done with N and not N-1
# Calculate the (unit) eigenvectors and eigenvalues of S. Sort them in descending order of importance relative to the magnitude of the eigenvalues.
eig_vals, eig_vecs = eig(S)
# Keep only the n largest Principle Components of the sorted unit eigenvectors.
principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
# Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.
#P = projection_matrix(eig_vecs[:, :num_components])
P = projection_matrix(principal_components)
# Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
X_reconst = (P # X.T).T
return X_reconst, mean_vec, principal_vals, principal_components
And here is the test case I'm supposed to pass:
random = np.random.RandomState(0)
X = random.randn(10, 5)
from sklearn.decomposition import PCA as SKPCA
for num_component in range(1, 4):
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver="full")
sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
reconst, _, _, _ = PCA(X, num_component)
# The difference in the result should be very small (<10^-20)
print(
"difference in reconstruction for num_components = {}: {}".format(
num_component, np.square(reconst - sklearn_reconst).sum()
)
)
np.testing.assert_allclose(reconst, sklearn_reconst)
As far as I can tell, there are a few things wrong with your code.
Your projection matrix is wrong.
If the eigenvectors of your covariance matrix is B with dimension D x M where M is the number of components you select and D is the dimension of the original data, then the projection matrix is just B # B.T.
In standard implementation of PCA, we typically do not scale the data by the inverse of the standard deviation. You seem to be trying to do an approximation of a whitened PCA (ZCA), but even then it looks wrong.
As a quick test, you can compute the normalized data without dividing by the standard deviation, and when you compute the covariance matrix, set bias=False.
You should also subtract the mean from the data before multiplying it by the projection operator, and adding it back after that, i.e.,
X_reconst = (P # (X - mean_vec).T).T + mean_vec.
PCA essentially is just a change of basis, followed by discarding coordinates corresponding to directions with low variance. The eigenvectors of the covariance matrix corresponds to the new orthogonal basis, and the eigenvalues tells you the variance of the data along the direction of the corresponding eigenvectors. P = B # B.T is just the change of basis followed to the new basis (and discarding some coordinates), B, followed by a change back to the original basis.
Edit
I'm curious to know which online course teaches people to implement PCA this way.
I am implementing an algorithm for k-means clustering. So far it works using Euclidean distances. Switching out Euclidean distances for Mahalanobis distances fails to cluster correctly.
For some reason, the Mahalanobis distance is negative at times. Turns out the covariance matrix has negative eigenvalues, which apparently is not good for covariance matrices.
Here are the functions I'm using:
#takes in data point x, centroid m, covariance matrix sigma
def mahalanobis(x, m, sigma):
return np.dot(np.dot(np.transpose(x - m), np.linalg.inv(sigma)), x - m)
#takes in centroid m and data (iris in 2d, dimensions: 2x150)
def covar_matrix(m, data):
d, n = data.shape
R = np.zeros((d,d))
for i in range(n):
R += np.dot(data[:,i:i+1] , np.transpose(data[:,i:i+1]))
R /= n
return R - np.dot(m, np.transpose(m))
#autocorrelation_matrix - centroid*centroid'
How I implemented the algorithm:
Set k
Randomly choose k centroids
Calculate covar_matrix() of each centroid
Calculate mahalanobis() of each data point to each centroid and add to closest cluster
Start looking for new centroids; for each data point* in each cluster, calculate the sum of mahalanobis() to every other point in the cluster; point with minimum sum becomes new centroid
Repeat 3-5 until old centroid and new centroids are the same
*Calculate covar_matrix() with this point
I expect a positive Mahalanobis distance and a positive definite covariance matrix (the latter will fix the former I hope).
I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean
I have a co-occurrence symmetric matrix (1877 x 1877).
I treat columns as features and compute the cosine distance between them. Before that, I scale the matrix (center to the mean and component wise scale to unit variance).
from sklearn import preprocessing
from sklearn.metrics import pairwise_distances
X_scaled = preprocessing.scale(mymatrix)
dist = pairwise_distances(X_scaled,metric="cosine")
My questions:
Should I scale the co-occurrence data before computing the cosine
distance/sim? The figure below shows the histograms of the actual matrix. The x-axis represents co-occurrence values in the matrix, and y-axis indicates the number of times they appear in the matrix.
The code above returns distance > 1 and distance < 0. How can I ensure that the cosine distance values between 0 and 1? Should I apply min max scaler over the dist matrix?
Iam trying to calculate PCA of a matrix.
Sometimes the resulting eigen values/vectors are complex values so when trying to project a point to a lower dimension plan by multiplying the eigen vector matrix with the point coordinates i get the following Warning
ComplexWarning: Casting complex values to real discards the imaginary part
In that line of code np.dot(self.u[0:components,:],vector)
The whole code i used to calculate PCA
import numpy as np
import numpy.linalg as la
class PCA:
def __init__(self,inputData):
data = inputData.copy()
#m = no of points
#n = no of features per point
self.m = data.shape[0]
self.n = data.shape[1]
#mean center the data
data -= np.mean(data,axis=0)
# calculate the covariance matrix
c = np.cov(data, rowvar=0)
# get the eigenvalues/eigenvectors of c
eval, evec = la.eig(c)
# u = eigen vectors (transposed)
self.u = evec.transpose()
def getPCA(self,vector,components):
if components > self.n:
raise Exception("components must be > 0 and <= n")
return np.dot(self.u[0:components,:],vector)
The covariance matrix is symmetric, and thus has real eigenvalues. You may see a small imaginary part in some eigenvalues due to numerical error. The imaginary parts can generally be ignored.
You can use scikits python library for PCA, this is an example of how to use it