I'm a Matlab user and I'm learning Python with the sklearn library. I want to translate this Matlab code
[coeff,score] = pca(X)
For coeff I have tried this in Python:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(X)
coeff = print(np.transpose(pca.components_))
I don't know whether or not it's right; for score I have no idea.
Could anyone enlight me about correctness of coeff and feasibility of score?
The sklearn PCA has a score method as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Try: pca.score(X) or pca.score_samples(X) depending whether you wish a score for each sample (the latter) or a single score for all samples (the former)
The PCA score in sklearn is different from matlab.
In sklearn, pca.score() or pca.score_samples() gives the log-likelihood of samples whereas matlab gives the principal components.
From sklearn Documentation:
Return the log-likelihood of each sample.
Parameters:
X : array, shape(n_samples, n_features)
The data.
Returns:
ll : array, shape (n_samples,)
Log-likelihood of each sample under the current model
From matlab documentation:
[coeff,score,latent] = pca(___) also returns the principal component
scores in score and the principal component variances in latent. You
can use any of the input arguments in the previous syntaxes.
Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X.
Now, the equivalent of matlab score in pca is fit_transform() or transform() :
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> matlab_equi_score = pca.fit_transform(X)
Related
after fitting my data into
X = my data
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.
When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D array?
when I do
X_ori = pca.inverse_transform(X_pca)
I get same dimension however different numbers.
Also if I plot both X and X_ori they are different.
When I perform inverse transformation by definition isn't it supposed to return to original data
No, you can only expect this if the number of components you specify is the same as the dimensionality of the input data. For any n_components less than this, you will get different numbers than the original dataset after applying the inverse PCA transformation: the following diagrams give an illustration in two dimensions.
It can not do that, since by reducing the dimensions with PCA, you've lost information (check pca.explained_variance_ratio_ for the % of information you still have). However, it tries its best to go back to the original space as well as it can, see the picture below
(generated with
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(1)
X_orig = np.random.rand(10, 2)
X_re_orig = pca.inverse_transform(pca.fit_transform(X_orig))
plt.scatter(X_orig[:, 0], X_orig[:, 1], label='Original points')
plt.scatter(X_re_orig[:, 0], X_re_orig[:, 1], label='InverseTransform')
[plt.plot([X_orig[i, 0], X_re_orig[i, 0]], [X_orig[i, 1], X_re_orig[i, 1]]) for i in range(10)]
plt.legend()
plt.show()
)
If you had kept the n_dimensions the same (set pca = PCA(2), you do recover the original points (the new points are on top of the original ones):
I am using following code to do principal component analysis of iris data:
from sklearn import datasets
iris = datasets.load_iris()
dat = pd.DataFrame(data=iris.data, columns=['sl', 'sw', 'pl', 'pw'])
from sklearn.preprocessing import scale
stddat = scale(dat)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pc_out = pca.fit_transform(stddat)
pcdf = pd.DataFrame(data = pc_out , columns = ['PC-1', 'PC-2'])
print(pcdf.head())
Output:
PC-1 PC-2
0 -2.264542 0.505704
1 -2.086426 -0.655405
2 -2.367950 -0.318477
3 -2.304197 -0.575368
4 -2.388777 0.674767
Now I want to determine PC-1 for a new set of values of 'sl', 'sw', 'pl' and 'pw', say: 4.8, 3.1, 1.3, 0.2. How can I do this? I could not find any way to do this using sklearn library.
Edit: as mentioned in comments, I can get PC values for new data with command pca.transform(new_data). However, I am interested in getting variable loadings so that I can use these numbers to determine PC values later and from anywhere, rather than just in current environment.
By loadings I mean "the weight by which each standardized original variable should be multiplied to get the component score" (from https://en.wikipedia.org/wiki/Principal_component_analysis ). I cannot find a method to do this on the documentation page: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Here's the transform function available here:
def transform(self, X):
"""Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted
from a training set.
Parameters
----------
X : array-like, shape (n_samples, n_features)
New data, where n_samples is the number of samples
and n_features is the number of features.
Returns
-------
X_new : array-like, shape (n_samples, n_components)
Examples
--------
>>> import numpy as np
>>> from sklearn.decomposition import IncrementalPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> ipca = IncrementalPCA(n_components=2, batch_size=3)
>>> ipca.fit(X)
IncrementalPCA(batch_size=3, copy=True, n_components=2, whiten=False)
>>> ipca.transform(X) # doctest: +SKIP
"""
check_is_fitted(self, ['mean_', 'components_'], all_or_any=all)
X = check_array(X)
if self.mean_ is not None:
X = X - self.mean_
X_transformed = np.dot(X, self.components_.T)
if self.whiten:
X_transformed /= np.sqrt(self.explained_variance_)
return X_transformed
The variable loadings, are the components, which you get from pca.components_. Be sure that your mean_ is 0 and whiten is False, and then you can simply get that matrix and use it wherever you want to transform your matrices/vectors.
I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.
I perform SVD with sklearn.decomposition.PCA
From the equation of the SVD
A= U x S x V_t
V_t = transpose matrix of V
(Sorry I can't paste the original equation)
If I want the matrix U, S, and V, how can I get it if I use the sklearn.decomposition.PCA ?
First of all, depending on the size of your matrix, sklearn implementation of PCA will not always compute the full SVD decomposition. The following is taken from PCA's GitHub reciprocity:
svd_solver : string {'auto', 'full', 'arpack', 'randomized'}
auto :
the solver is selected by a default policy based on `X.shape` and
`n_components`: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient 'randomized'
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.
full :
run exact full SVD calling the standard LAPACK solver via
`scipy.linalg.svd` and select the components by postprocessing
arpack :
run SVD truncated to n_components calling ARPACK solver via
`scipy.sparse.linalg.svds`. It requires strictly
0 < n_components < X.shape[1]
randomized :
run randomized SVD by the method of Halko et al.
In addition, it also performs some manipulations on the data (see here).
Now, if you want to get U, S, V that are used in sklearn.decomposition.PCA you can use pca._fit(X).
For example:
from sklearn.decomposition import PCA
X = np.array([[1, 2], [3,5], [8,10], [-1, 1], [5,6]])
pca = PCA(n_components=2)
pca._fit(X)
prints
(array([[ -3.55731195e-01, 5.05615563e-01],
[ 2.88830295e-04, -3.68261259e-01],
[ 7.10884729e-01, -2.74708608e-01],
[ -5.68187889e-01, -4.43103380e-01],
[ 2.12745524e-01, 5.80457684e-01]]),
array([ 9.950385 , 0.76800941]),
array([[ 0.69988535, 0.71425521],
[ 0.71425521, -0.69988535]]))
However, if you just want the SVD decomposition of the original data, I would suggest to use scipy.linalg.svd
I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.
Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?
I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.
Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:
from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_
This gives an output like
[[ 0.83849224 0.54491354]
[ 0.54491354 -0.83849224]]
where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.
As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :
[ 0.99244289 0.00755711]
This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.
If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:
import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)
Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.
Hope that helps.
The docs say explained_variance_ will give you
"The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.", new in version 0.18.
Seems a little questionable since the first and second sentences do not seem to agree.
sklearn PCA documentation