pca.inverse_transform in sklearn - python

after fitting my data into
X = my data
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.
When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D array?
when I do
X_ori = pca.inverse_transform(X_pca)
I get same dimension however different numbers.
Also if I plot both X and X_ori they are different.

When I perform inverse transformation by definition isn't it supposed to return to original data
No, you can only expect this if the number of components you specify is the same as the dimensionality of the input data. For any n_components less than this, you will get different numbers than the original dataset after applying the inverse PCA transformation: the following diagrams give an illustration in two dimensions.

It can not do that, since by reducing the dimensions with PCA, you've lost information (check pca.explained_variance_ratio_ for the % of information you still have). However, it tries its best to go back to the original space as well as it can, see the picture below
(generated with
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(1)
X_orig = np.random.rand(10, 2)
X_re_orig = pca.inverse_transform(pca.fit_transform(X_orig))
plt.scatter(X_orig[:, 0], X_orig[:, 1], label='Original points')
plt.scatter(X_re_orig[:, 0], X_re_orig[:, 1], label='InverseTransform')
[plt.plot([X_orig[i, 0], X_re_orig[i, 0]], [X_orig[i, 1], X_re_orig[i, 1]]) for i in range(10)]
plt.legend()
plt.show()
)
If you had kept the n_dimensions the same (set pca = PCA(2), you do recover the original points (the new points are on top of the original ones):

Related

Scipy and the hierarchical clustering input

When performing hierarchical clustering with scipy, it is said in the docs here that scipy.cluster.hierarchy.linkage takes 1-D condensed distance matrix or a 2-D array of observation vectors as input. However, I generated a simple (symmetric) similarity matrix with pandas Dataframe and scipy took that as input with no problem at all, and the resulting dendrogram is just fine.
Can someone explain, how is this possible? Do I have outdated docs or...?
The docs are accurate, they just don't tell you what will happen if you actually try to use an uncondensed distance matrix.
The function raises a warning but still runs because it first tries to convert input into a numpy array. This creates a 2-D array from your 2-D DataFrame while at the same time recognizing that this likely isn't the expected input based on the array dimensions and symmetry.
Depending on the complexity (e.g. cluster separation, number of clusters, distribution of data across clusters) of your input data, the clustering may still look like it succeeds in generating a suitable dendrogram, as you noted. This makes sense conceptually because the result is a clustering of n- similarity vectors which may be well-separated in simple cases.
For example, here is some synthetic data with 150 observations and 2 clusters:
import pandas as pd
from scipy.spatial.distance import cosine, pdist, squareform
np.random.seed(42) # for repeatability
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
obs_df = pd.DataFrame(np.concatenate((a, b),), columns=['x', 'y'])
obs_df.plot.scatter(x='x', y='y')
Z = linkage(obs_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
If you generate a similarity matrix, this is an n x n matrix that could still be clustered as if it were n vectors. I can't plot 150-D vectors, but plotting the magnitude of each vector and then the dendrogram seems to confirm a similar clustering.
def similarity_func(u, v):
return 1-cosine(u, v)
dists = pdist(obs_df, similarity_func)
sim_df = pd.DataFrame(squareform(dists), columns=obs_df.index, index=obs_df.index)
sim_array = np.asarray(sim_df)
sim_lst = []
for vec in sim_array:
mag = np.linalg.norm(vec,ord=1)
sim_lst.append(mag)
pd.Series(sim_lst).plot.bar()
Z = linkage(sim_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
What we're really clustering here is a vector whose components are similarity measures of each of the 150 points. We're clustering a collection of each point's intra- and inter-cluster similarity measures. Since the two clusters are different sizes, a point in one cluster will have a rather different collection of intra- and inter-cluster similarities relative to a point in the other cluster. Hence, we get two primary clusters that are proportionate to the number of points in each cluster just as we did in the first step.

How to inverse PCA without one component?

I want to denoise signals by applying PCA, then deleting one component and inversing PCA back to have denoised signals.
Here's what I tried :
reduced = pca.fit_transform(signals)
denoised = np.delete(reduced, 0, 1)
result = pca.inverse_transform(denoised)
But I have the error :
ValueError: shapes (11,4) and (5,5756928) not aligned: 4 (dim 1) != 5 (dim 0)
How can I invert PCA ?
To remove noise, first fit the PCA for a number of components (pca = PCA(n_components=2)). Then, look at the eigenvalues and identify components that are noise.
After identifying these noisy components (write this does), transform the whole dataset.
Example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
eigenvalues = pca.explained_variance_
print(eigenvalues)
#[7.93954312 0.06045688] # I assume that the 2nd component is noise due to λ=0.06 << 7.93
X_reduced = pca.transform(X)
#Since the 2nd component is considered noise, keep only the projections on the first component
X_reduced_selected = X_reduced[:,0]
And to invert use this:
pca.inverse_transform(X_reduced)[:,0]

Equivalent of Matlab's PCA in Python sklearn

I'm a Matlab user and I'm learning Python with the sklearn library. I want to translate this Matlab code
[coeff,score] = pca(X)
For coeff I have tried this in Python:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(X)
coeff = print(np.transpose(pca.components_))
I don't know whether or not it's right; for score I have no idea.
Could anyone enlight me about correctness of coeff and feasibility of score?
The sklearn PCA has a score method as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Try: pca.score(X) or pca.score_samples(X) depending whether you wish a score for each sample (the latter) or a single score for all samples (the former)
The PCA score in sklearn is different from matlab.
In sklearn, pca.score() or pca.score_samples() gives the log-likelihood of samples whereas matlab gives the principal components.
From sklearn Documentation:
Return the log-likelihood of each sample.
Parameters:
X : array, shape(n_samples, n_features)
The data.
Returns:
ll : array, shape (n_samples,)
Log-likelihood of each sample under the current model
From matlab documentation:
[coeff,score,latent] = pca(___) also returns the principal component
scores in score and the principal component variances in latent. You
can use any of the input arguments in the previous syntaxes.
Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X.
Now, the equivalent of matlab score in pca is fit_transform() or transform() :
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> matlab_equi_score = pca.fit_transform(X)

how to get original data from normalized array

I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700

PCA: Get Top 20 Most Important Dimensions

I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.

Categories