I've been tasked to implement my PCA code to convert data to a 2d field for a KNN assignment. My PCA code creates an array with the eigenvectors called PCevecs.
def __PCA(data):
#Normalize data
data_cent = data-np.mean(data)
#calculate covariance
covarianceMatrix = np.cov(data_cent, bias=True)
#Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
#Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
The assignment transforms the training-data using the PCA. The returned PCevecs has the shape (88, 88) given by calling print(PCevecs.shape). The shape of the training data is (88, 4).
np.dot(trainingFeatures, PCevecs[:, 0:2])
When the code is running I get the error message "ValueError: shapes (88,4) and (88,2) not alligned: 4 (dim 1) != 88 (dim 0)". I can see that the arrays don't match, but I can't see that I've done anything wrong with the PCA implementation. I've tried to have a look at similar problems on Stackoverflow. I haven't seen anyone sorting the eigenvector and eigenvalues the same way.
(EDITED with additional info from the comments)
While the PCA implementation is OK in general, you may want to either compute it on the data transposed, or you want to make sure that you tell np.cov() in which axis your dimensionality is via the rowvar parameter.
The following would work as you expect:
import numpy as np
def __PCA_fixed(data, rowvar=False):
# Normalize data
data_cent = data - np.mean(data)
# calculate covariance (pass `rowvar` to `np.cov()`)
covarianceMatrix = np.cov(data_cent, rowvar=rowvar, bias=True)
# Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
# Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
Testing it out with some random numbers:
data = np.random.randint(0, 100, (100, 10))
PCevals, PCevecs = __PCA_fixed(data)
print(PCevecs.shape)
# (10, 10)
Also note that, in more general terms, the singular value decomposition (np.linalg.svd() in NumPy) might be a better approach for principal component analysis (with a simple relationship with the eigenvalue decomposition you use and transposition).
As a general coding style note, it may be a good idea to follow the advice from PEP-8, many of which can be readily checked by some automated tool like, e.g. autopep8.
Related
I am working on an excercise in remote sensing, and have some trouble performing a pca. I have 103 different spectral bands, meaning the shape of my array is (m, n, 103). I want to reduce this to ten components.
I tried to use PCA from sklearn.decomposition, however my dimensions are wrong, and i dont think i understand how to use the function.
pca = PCA(n_components = 10)
test = feature_scaling(data)
pca_components = pca.fit_transform(test)
Giving me the following error: *** ValueError: Found array with dim 3. Estimator expected <= 2.
I was wondering if i maybe need to extract each of the spectral bands and perform a flattening, but i am not sure what my input in the pca is supposed to look like.
Does anyone know what i can do?
I'm confused with sklearn's PCA(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX,
where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform module of the class PCA in sklearn should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(
There is a difference between the theoretical Wikipedia description and the practical sklearn implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U and W_s = -W, then U_s and W_s also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U and W. If we just reverse the signs of k-th column of U and W, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)
To manually perform a transformation, I have to matrix multiply the original data (de-meaned) by PCA().fit(data).components_.T (the transpose of the components) to get the results to match PCA().fit_transform(data).
Does anyone know if this is intentional, and if so why? Looking at the comments in the source code there is a note that it should be X_new = X*V, where V is the components, but I find that it is X_new = XV^T.
This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.
I am interested in taking a look at the Eigenvalues after performing Multidimensional scaling. What function can do that ? I looked at the documentation, but it does not mention Eigenvalues at all.
Here is a code sample:
mds = manifold.MDS(n_components=100, max_iter=3000, eps=1e-9,
random_state=seed, dissimilarity="precomputed", n_jobs=1)
results = mds.fit(wordDissimilarityMatrix)
# need a way to get the Eigenvalues
I also couldn't find it from reading the documentation. I suspect they aren't performing classical MDS, but something more sophisticated:
“Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics (1997)
“Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychometrika, 29, (1964)
If you're looking for eigenvalues per classical MDS then it's not hard to get them yourself. The steps are:
Get your distance matrix. Then square it.
Perform double-centering.
Find eigenvalues and eigenvectors
Select top k eigenvalues.
Your ith principle component is sqrt(eigenvalue_i)*eigenvector_i
See below for code example:
import numpy.linalg as la
import pandas as pd
# get some distance matrix
df = pd.read_csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")
A = df.values.T[1:].astype(float)
# square it
A = A**2
# centering matrix
n = A.shape[0]
J_c = 1./n*(np.eye(n) - 1 + (n-1)*np.eye(n))
# perform double centering
B = -0.5*(J_c.dot(A)).dot(J_c)
# find eigenvalues and eigenvectors
eigen_val = la.eig(B)[0]
eigen_vec = la.eig(B)[1].T
# select top 2 dimensions (for example)
PC1 = np.sqrt(eigen_val[0])*eigen_vec[0]
PC2 = np.sqrt(eigen_val[1])*eigen_vec[1]