I am using scikit-learn. The nature of my application is such that I do the fitting offline, and then can only use the resulting coefficients online(on the fly), to manually calculate various objectives.
The transform is simple, it is just data * pca.components_, i.e. simple dot product. However, I have no idea how to perform the inverse transform. Which field of the pca object contains the relevant coefficients for the inverse transform? How do I calculate the inverse transform?
Specifically, I am referring to the PCA.inverse_transform() method call available in the sklearn.decomposition.PCA package: how can I manually reproduce its functionality using various coefficients calculated by the PCA?
1) transform is not data * pca.components_.
Firstly, * is not dot product for numpy array. It is element-wise multiplication. To perform dot product, you need to use np.dot.
Secondly, the shape of PCA.components_ is (n_components, n_features) while the shape of data to transform is (n_samples, n_features), so you need to transpose PCA.components_ to perform dot product.
Moreover, the first step of transform is to subtract the mean, therefore if you do it manually, you also need to subtract the mean at first.
The correct way to transform is
data_reduced = np.dot(data - pca.mean_, pca.components_.T)
2) inverse_transform is just the inverse process of transform
data_original = np.dot(data_reduced, pca.components_) + pca.mean_
If your data already has zero mean in each column, you can ignore the pca.mean_ above, for example
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(data)
data_reduced = np.dot(data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
Related
Let's say this is how I do my PCA with sklearns sklearn.decomposition.PCA:
def doPCA(arr):
scaler = StandardScaler()
scaler.fit(arr)
arr =scaler.transform(arr)
pca =PCA(n_components=2)
X = pca.fit_transform(arr)
return X
My current understanding is that I get an output array of the same length, but each sample is now of dimension 2.
Now, I am interested where a value in my original array arr ended up after the PCA.
My question is:
Can I assume that X[i] corresponds to arr[i]?
What you obtain as X, which is U[:, :n_components]*S[:n_components], in your code are the PCA loadings on the first n_components. To understand why X[i] should correspond to arr[i], let us see what loadings mean.
Loadings
Imagine the eigen vectors to be basis vectors for the new dimensions of order n_components. The loadings help define where each of the data points lie on this new dimension space. In other words, the original data points from the full feature space projected on to the reduced dimensional space. These are coefficients in linear combination (np.dot(X, n_components)) predicting the original full set of features by the (standardized) components.
So you can assume that X[i] corresponds to arr[i].
I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700
Singular value decomposition of matrix M of size (M,N) means factoring
How to obtain all three matrices from scikit-learn and numpy package?
I think I can obtain Sigma with PCA model:
import numpy as np
from sklearn.decomposition import PCA
model = PCA(N, copy=True, random_state=0)
model.fit(X)
Sigma = model.singular_values_
Sigma = np.diag(singular_values)
What about other matrices?
You can get these matrices using numpy.linalg.svd as follows:
a=np.array([[1,2,3],[4,5,6],[7,8,9]])
U, S, V = np.linalg.svd(a, full_matrices=True)
S is a 1D array that represents the diagonal entries in Sigma. U and V are the corresponding matrices from the decomposition.
By the way, note that when you used PCA, the data is centered before svd is applied (unlike numpy.linalg.svd, where svd is applied directly on the matrix itself. see lines 409-410 here).
Can't comment on Mirian's answer because I don't have enough reputation, but from looking at Miriam's link, sklearn actually calls scipy's linalg.svd which is doesn't seem to be the same as np.linalg.svd (discussion here)
So it may be better to use U, S, V = scipy.linalg.svd(a, full_matrices=True)
I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.
Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?
I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.
Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:
from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_
This gives an output like
[[ 0.83849224 0.54491354]
[ 0.54491354 -0.83849224]]
where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.
As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :
[ 0.99244289 0.00755711]
This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.
If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:
import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)
Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.
Hope that helps.
The docs say explained_variance_ will give you
"The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.", new in version 0.18.
Seems a little questionable since the first and second sentences do not seem to agree.
sklearn PCA documentation
I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.