How to interpret Scikit-learn's PCA when n_components are None?

How to interpret Scikit-learn's PCA when n_components are None? - python

I'm confused with the problem mentioned in the title. Does n_components=None mean that no transformation has been made in input, or that it has been transformed to new dimensional space but instead of the usual "reduction" (keeping few components with high eigenvalues) with keeping all the new synthetic features? The documentation suggests the former for me:
Hence, the None case results in: n_components == min(n_samples, n_features) - 1
But this is not entirely clear, and additionally, if it indeed means keeping all the components, why on earth the number of these equals to n_components == min(n_samples, n_features) - 1, why not to n_features?
However, I find the other alternative (in case of None, dropping the whole PCA step), I have never heard about applying PCA without omitting some eigenvectors...

As per official documentation -
If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.
Hence, the None case results in:
n_components == min(n_samples, n_features) - 1
So it depends upon the type of solver (which can be set via the parameter) being used for the eigenvectors.
If arpack : run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
As to your second query about dropping the whole PCA step, it depends totally upon what you are trying to solve. Since PCA components explain the variation of the data in decreasing order (1st component explains max variance, the last component explains the least variance), it can be useful for specific tasks to have some features that explain more variance.

Related

normalize_y in GPR

I'm trying to use GaussianProcessRegressor in sklearn to predict values of unknown.
The target values are typically between 1000-10000.
Since they are not 0-mean prior, I set the model with normalize_y = False, which is a default setup.
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor(kernel = RBF, random_state=0, alpha=1e-10, normalize_y = False)
when I predicted unknown with the gpr model, the returned std values are unrealistically too small, like in the scale of 0.1, which is 0.001% of the predicted values.
When I changed the setting to normalize_y = True, the returned std values are more realistic, about 500ish.
Can someone explain exactly what normalize_y does here, and if I set it to True or False in this case?

I found the closest answer HERE: https://github.com/scikit-learn/scikit-learn/issues/15612
"OK I think I know what might be going on here. It's a bit tricky to see but I think that none of the kernels have a vertical length scale parameter, so kernel(x,x) is always equal to 1. All the diagonal elements of K are equal to 1 (before we add the ridge to it), for example.
We can then see that the variance of the predictions can only be between 0 and 1. For example, if we're predicting at a point far from the training data (so kernel(X, x_new) is a vector of zeros) then the variance is just
sigma^2 = kernel(x_new, x_new) = 1
I think the real problem here is that the prior is for data with unit variance, but the data doesn't have unit variance. The solution would be to normalise the data so that it has unit variance after it 'enters' the GP, conduct the GP analysis, and then 'unnormalise' it back again at the end. The code already removes the mean automatically, so I think we just need to divide by the standard deviation at the same point and it would work OK.
So could just need a few extra lines!"
For this reason, changing the length_scale_bounds parameter of your kernel should fix this issue!
I hope this helps those who land here as I faced the same issue!

covariance matrix using np.matmul(data.T, data)

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.

This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.

Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.
When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?
Update: my (partial) answer below contains some additional info.
Take the following example data:
from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred',
start='2017-01-01', end='2017-02-01').pct_change().dropna())
# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
[-0.43328092 -0.36048659 0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629 0.58614003 0.56194768]
[ 0.43328092 0.36048659 -0.82602486]
[-0.68674084 0.72559581 -0.04356302]]
# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True True True]
[ True True True]
[False False False]]

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is \sum_1^r \s_i u_i v_i^\top :
with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.
This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.
Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\top X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.
As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.

After some digging, I've cleared up some, but not all, of my confusion on this. This issue has been covered on stats.stackexchange here. The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." However, in this case (with sklearn.PCA), the source of ambiguity is much more specific: in the source (line 391) for PCA you have:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
svd_flip, in turn, is defined here. But why the signs are being flipped to "ensure a deterministic output," I'm not sure. (U, S, V have already been found at this point...). So while sklearn's implementation is not incorrect, I don't think it's all that intuitive. Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. Problem is, the sklearn implementation will get you strong negative loadings to that first principal component.
My solution is a dumbed-down version that does not implement svd_flip. It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver, but does have a number of methods specifically geared towards this purpose.

With the PCA here in 3 dimensions, you basically find iteratively: 1) The 1D projection axis with the maximum variance preserved 2) The maximum variance preserving axis perpendicular to the one in 1). The third axis is automatically the one which is perpendicular to first two.
The components_ are listed according to the explained variance. So the first one explains the most variance, and so on. Note that by the definition of the PCA operation, while you are trying to find the vector for projection in the first step, which maximizes the variance preserved, the sign of the vector does not matter: Let M be your data matrix (in your case with the shape of (20,3)). Let v1 be the vector for preserving the maximum variance, when the data is projected on. When you select -v1 instead of v1, you obtain the same variance. (You can check this out). Then when selecting the second vector, let v2 be the one which is perpendicular to v1 and preserves the maximum variance. Again, selecting -v2 instead of v2 will preserve the same amount of variance. v3 then can be selected either as -v3 or v3. Here, the only thing which matters is that v1,v2,v3 constitute an orthonormal basis, for the data M. The signs mostly depend on how the algorithm solves the eigenvector problem underlying the PCA operation. Eigenvalue decomposition or SVD solutions may differ in signs.

This is a short notice for those who care about the purpose and not the math part at all.
Although the sign is opposite for some of the components, that shouldn't be considered as a problem. In fact what we do care about (at least to my understanding) is the axes' directions. The components, ultimately, are vectors that identify these axes after transforming the input data using pca. Therefore no matter what direction each component is pointing to, the new axes that our data lie on will be the same.

Active Shape Models: matching model points to target points

I have a question regarding Active Shape Models. I am using the paper of T. Coots (which can be found here.)
I have done all of the initial steps (Procrustes Analysis to calculate mean shape, PCA to reduce dimensions) but am stuck on fitting.
This is the situation I am in now: I have calculated the mean shape with points X and have also calculated a new set of points Y that X should move to, to better fit my image.
I am using the following algorithm, which can be found on page 23 of the paper previously linked:
To clarify: is the mean shape calculated with Procrustes Analysis, and the is the matrix containing the eigenvectors calculated with PCA.
Everything goes well up to step 4. I can calculate the pose parameters and invert the transformation onto the points Y.
However, in stap 5, something strange happens. Whatever the pose parameters are calculated in stap 3 and applied in stap 4, stap 5 always results in almost exactly the same vector y' with very low values (one of them being 1.17747114e-05 for example). (So whether i calculated a scale of 1/10 or 1000, y' barely changes).
This results in the algorithm always converging to the same value of b, and thus in the same output shape x, no matter what the input set of target points Y are that I want the model points X to match with.
This sure is not the goal of the algorithm... Could anyone explain this strange behaviour? Somehow, projecting my calculated vector y in step 5 into the "tangent plane" does not take into account any of the changes made in step 4.
Edit: I have some more reasoning, though no explanation or solution. If, in step 5, i manually set y' to consist only of zeros, then in step 6, b is equal to the matrix of eigenvectors multiplicated with the meanshape. And this results in the same b I always get (since y' is always a vector with very low values).
But these eigenvectors are calculated from the meanshape using PCA... So what's expected, is that no change should take place, right?

Something you could check is that your coordinates are scaled properly: the algorithm assumes that all coordinates are scaled so that the mean shape vector has Euclidean norm one. If this is not the case (especially if it is much larger than one, you will get extremely small components for y).

Why does numpy std() give a different result to matlab std()?

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?

The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.

The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).

For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.