sci-kit (sklearn) PCA decomposition transform

sci-kit (sklearn) PCA decomposition transform - python

To manually perform a transformation, I have to matrix multiply the original data (de-meaned) by PCA().fit(data).components_.T (the transpose of the components) to get the results to match PCA().fit_transform(data).
Does anyone know if this is intentional, and if so why? Looking at the comments in the source code there is a note that it should be X_new = X*V, where V is the components, but I find that it is X_new = XV^T.

Related

PCA for KNN in numpy

I've been tasked to implement my PCA code to convert data to a 2d field for a KNN assignment. My PCA code creates an array with the eigenvectors called PCevecs.
def __PCA(data):
#Normalize data
data_cent = data-np.mean(data)
#calculate covariance
covarianceMatrix = np.cov(data_cent, bias=True)
#Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
#Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
The assignment transforms the training-data using the PCA. The returned PCevecs has the shape (88, 88) given by calling print(PCevecs.shape). The shape of the training data is (88, 4).
np.dot(trainingFeatures, PCevecs[:, 0:2])
When the code is running I get the error message "ValueError: shapes (88,4) and (88,2) not alligned: 4 (dim 1) != 88 (dim 0)". I can see that the arrays don't match, but I can't see that I've done anything wrong with the PCA implementation. I've tried to have a look at similar problems on Stackoverflow. I haven't seen anyone sorting the eigenvector and eigenvalues the same way.

(EDITED with additional info from the comments)
While the PCA implementation is OK in general, you may want to either compute it on the data transposed, or you want to make sure that you tell np.cov() in which axis your dimensionality is via the rowvar parameter.
The following would work as you expect:
import numpy as np
def __PCA_fixed(data, rowvar=False):
# Normalize data
data_cent = data - np.mean(data)
# calculate covariance (pass `rowvar` to `np.cov()`)
covarianceMatrix = np.cov(data_cent, rowvar=rowvar, bias=True)
# Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
# Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
Testing it out with some random numbers:
data = np.random.randint(0, 100, (100, 10))
PCevals, PCevecs = __PCA_fixed(data)
print(PCevecs.shape)
# (10, 10)
Also note that, in more general terms, the singular value decomposition (np.linalg.svd() in NumPy) might be a better approach for principal component analysis (with a simple relationship with the eigenvalue decomposition you use and transposition).
As a general coding style note, it may be a good idea to follow the advice from PEP-8, many of which can be readily checked by some automated tool like, e.g. autopep8.

How PCA computes the transformed version in `sklearn`?

I'm confused with sklearn's PCA(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX,
where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform module of the class PCA in sklearn should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(

There is a difference between the theoretical Wikipedia description and the practical sklearn implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U and W_s = -W, then U_s and W_s also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U and W. If we just reverse the signs of k-th column of U and W, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)

Can the input values to the spectral clustering of scikit-learn be a negative value?

Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:
from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)
df_scaled now contains both negative and positive values.
Now if I pass this normalized data frame to the spectral cluster as follows,
spectral = SpectralClustering(n_clusters = k,
n_init=30,
affinity='nearest_neighbors', random_state=cluster_seed,
assign_labels='kmeans')
clusters = spectral.fit_predict(df_scaled)
I will get the cluster lables.
Here is what confuses me: the official doc says that
"Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."
Questions: Do the normalized negative values of df_scaled affect the clustering result?
OR
Does it depend on the affinity computation I am using e.g. precomputed, rbf? If so how can I use the normalized input values to SpectralClustering?
My understanding is that normalizing could improve the clustering results and good for faster computation.
I appreciate any help or tips on how to I can approach the problem.

You are passing a data matrix, not a precomputed affinity matrix.
The "nearest neighbors" uses a binary kernel, which is non-negative.
To better understand the inner workings, please have a look at the source code.

covariance matrix using np.matmul(data.T, data)

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.

This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.

Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

Factor Loadings using sklearn

I want the correlations between individual variables and principal components in python.
I am using PCA in sklearn. I don't understand how can I achieve the loading matrix after I have decomposed my data? My code is here.
iris = load_iris()
data, y = iris.data, iris.target
pca = PCA(n_components=2)
transformed_data = pca.fit(data).transform(data)
eigenValues = pca.explained_variance_ratio_
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html doesn't mention how this can be achieved.

Multiply each component by the square root of its corresponding eigenvalue:
pca.components_.T * np.sqrt(pca.explained_variance_)
This should produce your loading matrix.

I think that #RickardSjogren is describing the eigenvectors, while #BigPanda is giving the loadings. There's a big difference: Loadings vs eigenvectors in PCA: when to use one or another?.
I created this PCA class with a loadings method.
Loadings, as given by pca.components_ * np.sqrt(pca.explained_variance_), are more analogous to coefficients in a multiple linear regression. I don't use .T here because in the PCA class linked above, the components are already transposed. numpy.linalg.svd produces u, s, and vt, where vt is the Hermetian transpose, so you first need to back into v with vt.T.
There is also one other important detail: the signs (positive/negative) on the components and loadings in sklearn.PCA may differ from packages such as R.
More on that here:
In sklearn.decomposition.PCA, why are components_ negative?.

According to this blog the rows of pca.components_ are the loading vectors. So:
loadings = pca.components_

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sci-kit (sklearn) PCA decomposition transform - python

Related

PCA for KNN in numpy

How PCA computes the transformed version in `sklearn`?

Can the input values to the spectral clustering of scikit-learn be a negative value?

covariance matrix using np.matmul(data.T, data)

Factor Loadings using sklearn

Categories

Resources