How to inverse PCA without one component? - python

I want to denoise signals by applying PCA, then deleting one component and inversing PCA back to have denoised signals.
Here's what I tried :
reduced = pca.fit_transform(signals)
denoised = np.delete(reduced, 0, 1)
result = pca.inverse_transform(denoised)
But I have the error :
ValueError: shapes (11,4) and (5,5756928) not aligned: 4 (dim 1) != 5 (dim 0)
How can I invert PCA ?

To remove noise, first fit the PCA for a number of components (pca = PCA(n_components=2)). Then, look at the eigenvalues and identify components that are noise.
After identifying these noisy components (write this does), transform the whole dataset.
Example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
eigenvalues = pca.explained_variance_
print(eigenvalues)
#[7.93954312 0.06045688] # I assume that the 2nd component is noise due to λ=0.06 << 7.93
X_reduced = pca.transform(X)
#Since the 2nd component is considered noise, keep only the projections on the first component
X_reduced_selected = X_reduced[:,0]
And to invert use this:
pca.inverse_transform(X_reduced)[:,0]

Related

Defining custom layer for tensorflow model

I'm trying to build a model in tensorflow that should take a number of points with n dimensions and find a set of hyperplanes that form a hull around one set of points while including as little of another set of points.
To do this I would input a Matrix of size [n,np] with n denoting dimensions and np denoting the number of points each defined in n dimensions. Like:
[[ 0.04370488 -0.09842589 -0.01787493]
[ 0.1415032 0.05342565 0.63025913]
[-0.84298323 -0.91433908 -0.9716289 ]
[ 0.19159608 -0.68356499 0.55441537]
[ 0.34797942 0.55592542 -0.74667198]]
As a last layer I would like to have n+1 hyperplanes that are each defined by two vectors, one of them pointing to a point on the hyperplane, the other being the normal vector of the hyperplane. In three dimensions this would give me 4 hyperplanes each defined by 2 vectors with 3 dimensions. So I guess this would be a 4x2x3 matrix or 24 values. Like:
[[0, 0, 0] [1, 0, 0]]
[[0, 0, 0] [0, 1, 0]]
[[0, 0, 0] [0, 0, 1]]
[[5, 5, 5] [-1, -1, -1]]
I was thinking this layer to either be the output of the model OR
to be used in calculating whether a point is on the in- or outside of the hull. Which could just be encoded as 0 or 1
For now I have a barebones model where I managed to input a Matrix with the correct size but couldn't yet manage to write a loss function or custom layer that makes it possible to evaluate whether a point is in or outside of the hull
The ys array is a 800,1 array containing labels for each point saying it is either a point that should be in the hull or outside the hull.
from tensorflow import keras
def in_convex_hull(point, plane_point, plane_normal):
if np.dot(plane, (point - a)) == 1:
return true
return false
def custom_loss(actual, pred):
loss = 0
return loss
def custom_layer():
return
model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[800,3])])
model.add(keras.layers.Dense(1000))
model.add(keras.layers.Dense(24))
model.compile(optimizer='Adam', loss='BinaryCrossentropy', metrics = ["accuracy"])
xs = np.array([np.random.rand(800,3) for i in range(1)])
ys = np.array([np.eye(1)[np.random.choice(1, 800)]])
history = model.fit(xs, ys, epochs=10, batch_size=1, verbose = 1)
Any pointers on how this setup could be achieved is greatly appreciated

pca.inverse_transform in sklearn

after fitting my data into
X = my data
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.
When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D array?
when I do
X_ori = pca.inverse_transform(X_pca)
I get same dimension however different numbers.
Also if I plot both X and X_ori they are different.
When I perform inverse transformation by definition isn't it supposed to return to original data
No, you can only expect this if the number of components you specify is the same as the dimensionality of the input data. For any n_components less than this, you will get different numbers than the original dataset after applying the inverse PCA transformation: the following diagrams give an illustration in two dimensions.
It can not do that, since by reducing the dimensions with PCA, you've lost information (check pca.explained_variance_ratio_ for the % of information you still have). However, it tries its best to go back to the original space as well as it can, see the picture below
(generated with
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(1)
X_orig = np.random.rand(10, 2)
X_re_orig = pca.inverse_transform(pca.fit_transform(X_orig))
plt.scatter(X_orig[:, 0], X_orig[:, 1], label='Original points')
plt.scatter(X_re_orig[:, 0], X_re_orig[:, 1], label='InverseTransform')
[plt.plot([X_orig[i, 0], X_re_orig[i, 0]], [X_orig[i, 1], X_re_orig[i, 1]]) for i in range(10)]
plt.legend()
plt.show()
)
If you had kept the n_dimensions the same (set pca = PCA(2), you do recover the original points (the new points are on top of the original ones):

Equivalent of Matlab's PCA in Python sklearn

I'm a Matlab user and I'm learning Python with the sklearn library. I want to translate this Matlab code
[coeff,score] = pca(X)
For coeff I have tried this in Python:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(X)
coeff = print(np.transpose(pca.components_))
I don't know whether or not it's right; for score I have no idea.
Could anyone enlight me about correctness of coeff and feasibility of score?
The sklearn PCA has a score method as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Try: pca.score(X) or pca.score_samples(X) depending whether you wish a score for each sample (the latter) or a single score for all samples (the former)
The PCA score in sklearn is different from matlab.
In sklearn, pca.score() or pca.score_samples() gives the log-likelihood of samples whereas matlab gives the principal components.
From sklearn Documentation:
Return the log-likelihood of each sample.
Parameters:
X : array, shape(n_samples, n_features)
The data.
Returns:
ll : array, shape (n_samples,)
Log-likelihood of each sample under the current model
From matlab documentation:
[coeff,score,latent] = pca(___) also returns the principal component
scores in score and the principal component variances in latent. You
can use any of the input arguments in the previous syntaxes.
Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X.
Now, the equivalent of matlab score in pca is fit_transform() or transform() :
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> matlab_equi_score = pca.fit_transform(X)

PCA: Get Top 20 Most Important Dimensions

I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.

How to run and interpret Fisher's Linear Discriminant Analysis from scikit-learn

I am trying to run a Fisher's LDA (1, 2) to reduce the number of features of matrix.
Basically, correct if I am wrong, given n samples classified in several classes, Fisher's LDA tries to find an axis that projecting thereon should maximize the value J(w), which is the ratio of total sample variance to the sum of variances within separate classes.
I think this can be used to find the most useful features for each class.
I have a matrix X of m features and n samples (m rows, n columns).
I have a sample classification y, i.e. an array of n labels, each one for each sample.
Basing on y I want to reduce the number of features to, for example, 3 most representative features.
Using scikit-learn I tried in this way (following this documentation):
>>> import numpy as np
>>> from sklearn.lda import LDA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = LDA(n_components=3)
>>> clf.fit_transform(X, y)
array([[ 4.],
[ 4.],
[ 8.],
[-4.],
[-4.],
[-8.]])
At this point I am a bit confused, how to obtain the most representative features?
The features you are looking for are in clf.coef_ after you have fitted the classifier.
Note that n_components=3 doesn't make sense here, since X.shape[1] == 2, i.e. your feature space only has two dimensions.
You do not need to invoke fit_transform in order to obtain coef_, calling clf.fit(X, y) will suffice.

Categories