How PCA computes the transformed version in `sklearn`? - python

I'm confused with sklearn's PCA(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX,
where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform module of the class PCA in sklearn should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(

There is a difference between the theoretical Wikipedia description and the practical sklearn implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U and W_s = -W, then U_s and W_s also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U and W. If we just reverse the signs of k-th column of U and W, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)

Related

Scikit learn NMF how to adjust sparseness of resulting factorization?

Nonnegative matrix factorization is lauded for generating sparse basis sets. However, when I run sklearn.decomposition.NMF the factors are not sparse. Older versions of NMF had a 'degree of sparseness' parameter beta. Newer versions do not, but I want my basis matrix W to actually be sparse. What can I do? (Code to reproduce problem is below).
I have toyed around with increasing various regularization parameters (e.g., alpha), but am not getting anything very sparse (like in the paper by Lee and Seung (1999) when I apply it to the Olivetti faces dataset. They still basically end up looking like eigenfaces.
My CNM output (not very sparse):
Lee and Seung CNM paper output basis columns (looks sparse to me):
Code to reproduce my problem:
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import NMF
faces, _ = fetch_olivetti_faces(return_X_y=True)
# run nmf on the faces data set
num_nmf_components = 50
estimator = NMF(num_nmf_components,
init='nndsvd',
tol=5e-3,
max_iter=1000,
alpha_W=0.01,
l1_ratio=0)
H = estimator.fit_transform(faces)
W = estimator.components_
# plot the basis faces
n_row, n_col = 6, 4 # how many faces to plot
image_shape = (64, 64)
n_samples, n_features = faces.shape
plt.figure(figsize=(10,12))
for face_id, face in enumerate(W[:n_row*n_col]):
plt.subplot(n_row, n_col, face_id+1)
plt.imshow(face.reshape(image_shape), cmap='gray')
plt.axis('off')
plt.tight_layout()
Is there some combinations of parameters with sklearn.decomposition.NMF() that lets you dial in sparseness? I have played with different combinations of alpha_W and l1_ratio and even tweaked the number of components. I still end up with eigen-face looking things.
There are a couple of things going on here that we need to disentangle. First, what happened to sparseness? Second, how do you generate sparse faces using the sklearn function?
Where did the sparseness go?
The sklearn.decomposition.NMF function went through a major change from versions 0.16 to 0.19. There are multiple ways to implement nonnetative matrix factorization.
Before 0.16, NMF used projected gradient descent as described in Hoyer 2004, and included a sparseness parameter (which as OP noted let you adjust the sparseness of the resulting W basis).
Because of various limitations outlined in this extremely thorough issue at sklearn's github repo, it was decided to move on to two additional methods:
Release 0.16: coordinate descent (PR here which was in version 0.16)
Release 0.19: multiplicative update (PR here which was in version 0.19)
This was a pretty major undertaking, and the upshot is we now have a great deal more freedom in terms of error functions, initialization, and regularization. You can read about that at the issue. The objective function is now:
You can read more details/explanation at the docs, but to note a few things relevant to the question:
The solver param which takes in mu for multiplicative update or cd for coordinate descent. The older projected gradient descent method (with the sparseness parameter) is deprecated.
As you can see in the objective function, there are weights for regularizing W and for H (alpha_W and alpha_H respectively). In theory if you want to reign in W, you should increase alpha_W.
You can regularize using the L1 or L2 norm, and the ratio between the two is set by l1_ratio. The larger you make l1_ratio, the more you weight the L1 norm over L2 norm. Note: the L1 norm tends to generate more sparse parameter sets, while the L2 norm tends to generate small parameter sets, so in theory if you want sparseness, then set your l1_ratio high.
How to generate sparse faces?
The examination of the objective function suggests what to do. Crank up alpha_W and l1_ratio. But also note that the Lee and Seung paper used multiplicative update (mu), so if you wanted to reproduce their results, I would recommend setting solver to mu, setting alpha_W high, and l1_ratio high, and see what happens.
In the OP's question, they implicitly used the cd solver (which is the default), and set alpha_W=0.01 and l1_ratio=0, which I wouldn't necessarily expect to create a sparse basis set.
But things are actually not that simple. I tried some initial runs of coordinate descent with high l1_ratio and alpha_W and found very low sparseness. So to quantify some of this, I did a grid search, and used a sparseness measure.
Quantifying sparseness is itself a cottage industry (e.g., see this post, and the paper cited there). I used Hoyer's measure of sparsity, adapted from the one used in the nimfa package:
def sparseness_hoyer(x):
"""
The sparseness of array x is a real number in [0, 1], where sparser array
has value closer to 1. Sparseness is 1 iff the vector contains a single
nonzero component and is equal to 0 iff all components of the vector are
the same
modified from Hoyer 2004: [sqrt(n)-L1/L2]/[sqrt(n)-1]
adapted from nimfa package: https://nimfa.biolab.si/
"""
from math import sqrt # faster than numpy sqrt
eps = np.finfo(x.dtype).eps if 'int' not in str(x.dtype) else 1e-9
n = x.size
# measure is meant for nmf: things get weird for negative values
if np.min(x) < 0:
x -= np.min(x)
# patch for array of zeros
if np.allclose(x, np.zeros(x.shape), atol=1e-6):
return 0.0
L1 = abs(x).sum()
L2 = sqrt(np.multiply(x, x).sum())
sparseness_num = sqrt(n) - (L1 + eps) / (L2 + eps)
sparseness_den = sqrt(n) - 1
return sparseness_num / sparseness_den
What this measures actually quantifies is sort of complicated, but roughly a sparse image is one with only a few pixels active, a non-sparse image has lots of pixels active. If we run PCA on the faces example from the OP, we can see the sparseness values is low around 0.04 for the eigenfaces:
Sparsifying using coordinate descent?
If we run NMF using the params used in the OP (using coordinate descent, with low W_alpha and l1_ratio, except with 200 components), the sparseness values are again low:
If you look at the histogram of sparseness values this is verified:
Different, but not super impressive, compared with PCA.
I next did a grid search through W_alpha and l1_ratio space, varying them between 0 and 1 (at 0.1 step increments). I found that sparsity was not maximized when they were 1. Surprisingly, contrary to theoretical expectations, I found that sparsity was only high when l1_ratio was 0 and it dropped of precipitously above 0. And within this slice of parameters, sparsity was maximized when alpha_W was 0.9:
Intuitively, this is a huge improvement. There is still a lot of variation in the distribution of sparseness values, but they are much higher:
However, maybe in order to replicate the Lee and Seung results, and better control sparseness, we should be using multiplicative update (which is what they used). Let's try that next.
Sparsifying using multiplicative update
For the next attempt, I used multiplicative update, and this behaved much more as expected, with sparse, parts-based representations emerging:
You can see the drastic difference, and this is reflected in the histogram of sparseness values:
Note the code to generate this is below.
One final interesting thing to note: the sparseness values with this method seem to increase with the component number. I plotted sparseness as a function of component, and this is (roughly) born out, and was born out consistently over all my runs of the algorithm:
I have not seen this discussed elsewhere, so thought I'd mention it.
Code to generate sparse representation of faces using the mu NMF algorithm:
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import NMF
faces, _ = fetch_olivetti_faces(return_X_y=True)
num_nmf_components = 200
alph_W = 0.9 # cd: .9, mu: .9
L1_ratio = 0.9 # cd: 0, L1_ratio: 0.9
try:
del estimator
except:
print("first run")
estimator = NMF(num_nmf_components,
init='nndsvdar', # nndsvd
solver='mu',
max_iter=50,
alpha_W=alph_W,
alpha_H=0, zeros
l1_ratio=L1_ratio,
shuffle=True)
H = estimator.fit_transform(faces)
W = estimator.components_
# plot the basis faces
n_row, n_col = 5, 7 # how many faces to plot
image_shape = (64, 64)
n_samples, n_features = faces.shape
plt.figure(figsize=(10,12))
for face_id, face in enumerate(W[:n_row*n_col]):
plt.subplot(n_row, n_col, face_id+1)
face_sparseness = sparseness_hoyer(face)
plt.imshow(face.reshape(image_shape), cmap='gray')
plt.title(f"{face_sparseness: 0.2f}")
plt.axis('off')
plt.suptitle('NMF', fontsize=16, y=1)
plt.tight_layout()

covariance matrix using np.matmul(data.T, data)

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.
When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?
Update: my (partial) answer below contains some additional info.
Take the following example data:
from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred',
start='2017-01-01', end='2017-02-01').pct_change().dropna())
# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
[-0.43328092 -0.36048659 0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629 0.58614003 0.56194768]
[ 0.43328092 0.36048659 -0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True True True]
[ True True True]
[False False False]]
As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is \sum_1^r \s_i u_i v_i^\top :
with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.
This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.
Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\top X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.
As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.
After some digging, I've cleared up some, but not all, of my confusion on this. This issue has been covered on stats.stackexchange here. The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." However, in this case (with sklearn.PCA), the source of ambiguity is much more specific: in the source (line 391) for PCA you have:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
svd_flip, in turn, is defined here. But why the signs are being flipped to "ensure a deterministic output," I'm not sure. (U, S, V have already been found at this point...). So while sklearn's implementation is not incorrect, I don't think it's all that intuitive. Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. Problem is, the sklearn implementation will get you strong negative loadings to that first principal component.
My solution is a dumbed-down version that does not implement svd_flip. It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver, but does have a number of methods specifically geared towards this purpose.
With the PCA here in 3 dimensions, you basically find iteratively: 1) The 1D projection axis with the maximum variance preserved 2) The maximum variance preserving axis perpendicular to the one in 1). The third axis is automatically the one which is perpendicular to first two.
The components_ are listed according to the explained variance. So the first one explains the most variance, and so on. Note that by the definition of the PCA operation, while you are trying to find the vector for projection in the first step, which maximizes the variance preserved, the sign of the vector does not matter: Let M be your data matrix (in your case with the shape of (20,3)). Let v1 be the vector for preserving the maximum variance, when the data is projected on. When you select -v1 instead of v1, you obtain the same variance. (You can check this out). Then when selecting the second vector, let v2 be the one which is perpendicular to v1 and preserves the maximum variance. Again, selecting -v2 instead of v2 will preserve the same amount of variance. v3 then can be selected either as -v3 or v3. Here, the only thing which matters is that v1,v2,v3 constitute an orthonormal basis, for the data M. The signs mostly depend on how the algorithm solves the eigenvector problem underlying the PCA operation. Eigenvalue decomposition or SVD solutions may differ in signs.
This is a short notice for those who care about the purpose and not the math part at all.
Although the sign is opposite for some of the components, that shouldn't be considered as a problem. In fact what we do care about (at least to my understanding) is the axes' directions. The components, ultimately, are vectors that identify these axes after transforming the input data using pca. Therefore no matter what direction each component is pointing to, the new axes that our data lie on will be the same.

Factor Loadings using sklearn

I want the correlations between individual variables and principal components in python.
I am using PCA in sklearn. I don't understand how can I achieve the loading matrix after I have decomposed my data? My code is here.
iris = load_iris()
data, y = iris.data, iris.target
pca = PCA(n_components=2)
transformed_data = pca.fit(data).transform(data)
eigenValues = pca.explained_variance_ratio_
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html doesn't mention how this can be achieved.
Multiply each component by the square root of its corresponding eigenvalue:
pca.components_.T * np.sqrt(pca.explained_variance_)
This should produce your loading matrix.
I think that #RickardSjogren is describing the eigenvectors, while #BigPanda is giving the loadings. There's a big difference: Loadings vs eigenvectors in PCA: when to use one or another?.
I created this PCA class with a loadings method.
Loadings, as given by pca.components_ * np.sqrt(pca.explained_variance_), are more analogous to coefficients in a multiple linear regression. I don't use .T here because in the PCA class linked above, the components are already transposed. numpy.linalg.svd produces u, s, and vt, where vt is the Hermetian transpose, so you first need to back into v with vt.T.
There is also one other important detail: the signs (positive/negative) on the components and loadings in sklearn.PCA may differ from packages such as R.
More on that here:
In sklearn.decomposition.PCA, why are components_ negative?.
According to this blog the rows of pca.components_ are the loading vectors. So:
loadings = pca.components_

Scipy's sparse eigsh() for small eigenvalues

I'm trying to write a spectral clustering algorithm using NumPy/SciPy for larger (but still tractable) systems, making use of SciPy's sparse linear algebra library. Unfortunately, I'm running into stability issues with eigsh().
Here's my code:
import numpy as np
import scipy.sparse
import scipy.sparse.linalg as SLA
import sklearn.utils.graph as graph
W = self._sparse_rbf_kernel(self.X_, self.datashape)
D = scipy.sparse.csc_matrix(np.diag(np.array(W.sum(axis = 0))[0]))
L = graph.graph_laplacian(W) # D - W
vals, vects = SLA.eigsh(L, k = self.k, M = D, which = 'SM', sigma = 0, maxiter = 1000)
The sklearn library refers to the scikit-learn package, specifically this method for calculating a graph laplacian from a sparse SciPy matrix.
_sparse_rbf_kernel is a method I wrote to compute pairwise affinities of the data points. It operates by creating a sparse affinity matrix from image data, specifically by only computing pairwise affinities for the 8-neighborhoods around each pixel (instead of pairwise for all pixels with scikit-learn's rbf_kernel method, which for the record doesn't fix this either).
Since the laplacian is unnormalized, I'm looking for the smallest eigenvalues and corresponding eigenvectors of the system. I understand that ARPACK is ill-suited for finding small eigenvalues, but I'm trying to use shift-invert to find these values and am still not having much success.
With the above arguments (specifically, sigma = 0), I get the following error:
RuntimeError: Factor is exactly singular
With sigma = 0.001, I get a different error:
scipy.sparse.linalg.eigen.arpack.arpack.ArpackNoConvergence: ARPACK error -1: No convergence (1001 iterations, 0/5 eigenvectors converged)
I've tried all three different values for mode with the same result. Any suggestions for using the SciPy sparse library for finding small eigenvalues of a large system?
You should use which='LM': in the shift-invert mode, this parameter refers to the transformed eigenvalues. (As explained in the documentation.)

Categories