Principal Component Analysis: Order of components AFTER transformation - python

I am using the PCA class from sklearn.decomposition to reduce the dimensionality of the feature space in order to plot that feature space.
I wondering the following: After applying the fit and transform method of the PCA class, I am getting back an array X_transformed of shape (n_samples, n_components) as stated in the documentation. Is the order of columns of X_transformed sorted by the amount of explained variance? In the documentation it says that PCA.components_ is sorted by explained variance, so I am assuming that the columns of X_transformed are as well, but please correct me if I am wrong.
Little example:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X) # X is an array containing my original features. X.shape=(n_samples, n_features)
X_transformed = pca.transfom(X) # X_transformed.shape=(n_samples, n_components). Are X_transformed's columns sorted by explained variance?
Thanks!

Hmm maybe just got an idea to test that
from sklearn.decomposition import PCA
import numpy as np
pca_2 = PCA(n_components=2)
X_transformed_2 = pca_2.fit_transform(X)
# X_transformed_2 hold two components with most variance explained
pca_10 = PCA(n_components=10)
X_transformed_10 = pca_10.fit_transform(X)
# X_transformed_10 hold 10 components with most variance explained
# Hypothesis: If the first 2 components in X_transformed_10 are ordered by explained variance, it's first 2 columns should equal X_transformed_2
np.array_equal(X_transformed_2, X_transformed_10[:, 2]) ## returns True

Related

How PCA computes the transformed version in `sklearn`?

I'm confused with sklearn's PCA(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX,
where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform module of the class PCA in sklearn should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(
There is a difference between the theoretical Wikipedia description and the practical sklearn implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U and W_s = -W, then U_s and W_s also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U and W. If we just reverse the signs of k-th column of U and W, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)

PCA features do not match original features

I am trying to reduce the feature dimensions using PCA. I have been able to apply PCA to my training data, but am struggling to understand why the reduced feature set (X_train_pca) shares no similarities with the original features (X_train).
print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)
most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]
Should the first feature vector in X_train_pca not be just a subset of the first feature vector in X_train? For example, why doesn't the following equal True?
print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False
Furthermore, none of the features from the first feature vector of X_train are in the first feature vector of X_train_pca:
for i in X_train[0]:
print(i in X_train_pca[0])
# False
# False
# False
# ...
PCA transforms your high dimensional feature vectors into low dimensional feature vectors.
It does not simply determine the least important index in the original space and drop that dimension.
This is normal since the PCA algorithm applies a transformation to your data:
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
(https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction)
Run the following code sample to see the effects the PCA algorithm on a simple Gaussian data set.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
pca = PCA(2)
X = np.random.multivariate_normal(mean=np.array([0, 0]), cov=np.array([[1, 0.75],[0.75, 1]]), size=(1000,))
X_new = pca.fit_transform(X)
plt.scatter(X[:, 0], X[:, 1], s=5, label='Initial data')
plt.scatter(X_new[:, 0], X_new[:, 1], s=5, label='Transformed data')
plt.legend()
plt.show()

How to choose the features that describe x% of all information in data while using Incremental principal components analysis (IPCA)?

I'd like to use the Incremental principal components analysis (IPCA) to reduce my feature space such that it contains x% of information.
I would use the sklearn.decomposition.IncrementalPCA(n_components=None, whiten=False, copy=True, batch_size=None)
I can leave the n_components=None so that it works on all the features that I have.
But later once the whole data set is analyzed.
How do I select the features that represent x% of data and how do I create a transform() for those features number of features.
This idea taken from this question.
You can get the percentage of explained variance from each of the components of your PCA using explained_variance_ratio_. For example in the iris dataset, the first 2 principal components account for 98% of the variance in the data:
import numpy as np
from sklearn import decomposition
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
pca = decomposition.IncrementalPCA()
pca.fit(X)
pca.explaned_variance_ratio_
#array([ 0.92461621, 0.05301557, 0.01718514, 0.00518309])

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

Factor Loadings using sklearn

I want the correlations between individual variables and principal components in python.
I am using PCA in sklearn. I don't understand how can I achieve the loading matrix after I have decomposed my data? My code is here.
iris = load_iris()
data, y = iris.data, iris.target
pca = PCA(n_components=2)
transformed_data = pca.fit(data).transform(data)
eigenValues = pca.explained_variance_ratio_
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html doesn't mention how this can be achieved.
Multiply each component by the square root of its corresponding eigenvalue:
pca.components_.T * np.sqrt(pca.explained_variance_)
This should produce your loading matrix.
I think that #RickardSjogren is describing the eigenvectors, while #BigPanda is giving the loadings. There's a big difference: Loadings vs eigenvectors in PCA: when to use one or another?.
I created this PCA class with a loadings method.
Loadings, as given by pca.components_ * np.sqrt(pca.explained_variance_), are more analogous to coefficients in a multiple linear regression. I don't use .T here because in the PCA class linked above, the components are already transposed. numpy.linalg.svd produces u, s, and vt, where vt is the Hermetian transpose, so you first need to back into v with vt.T.
There is also one other important detail: the signs (positive/negative) on the components and loadings in sklearn.PCA may differ from packages such as R.
More on that here:
In sklearn.decomposition.PCA, why are components_ negative?.
According to this blog the rows of pca.components_ are the loading vectors. So:
loadings = pca.components_

Categories