I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.
Related
Here is the idea:
There is a huge 2D dataset (250,000 datapoints).
I need to get rid of 90% of the datapoint without hurting the data structure. Which means (i believe) to get rid of the closest ones. Density must decrease...
Considering we need to keep the structure - we can't just randomly delete 90% as this might cause bias. There may be a little element of random in this but no too much.
I can put the data in 2D matrix and divide into cells. Some cells then will have more datapoints and some will have less and some will have none.
I need the algorithm that will group those datapoints or the cells in my matrix into segments which will all have relatively close nummer of datapoints in it. Those segments or cells in "new" matrix can be different size(which i believe is the point in this algorithm).
I've drawn a picture. It is not accurate but I hope it will make idea a bit clearer.
Also I code in python :^)
Thank you!!
the algorithm you are searching is a unsupervised learning method, the most famous one is kmeans on python.
You can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Here is a code example for an array:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
If you have to adjust it for a dataframe (df), it looks like this:
from sklearn.cluster import KMeans
X = df[['column A',..., 'column D']]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
the output labels are your clusters.
after fitting my data into
X = my data
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.
When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D array?
when I do
X_ori = pca.inverse_transform(X_pca)
I get same dimension however different numbers.
Also if I plot both X and X_ori they are different.
When I perform inverse transformation by definition isn't it supposed to return to original data
No, you can only expect this if the number of components you specify is the same as the dimensionality of the input data. For any n_components less than this, you will get different numbers than the original dataset after applying the inverse PCA transformation: the following diagrams give an illustration in two dimensions.
It can not do that, since by reducing the dimensions with PCA, you've lost information (check pca.explained_variance_ratio_ for the % of information you still have). However, it tries its best to go back to the original space as well as it can, see the picture below
(generated with
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(1)
X_orig = np.random.rand(10, 2)
X_re_orig = pca.inverse_transform(pca.fit_transform(X_orig))
plt.scatter(X_orig[:, 0], X_orig[:, 1], label='Original points')
plt.scatter(X_re_orig[:, 0], X_re_orig[:, 1], label='InverseTransform')
[plt.plot([X_orig[i, 0], X_re_orig[i, 0]], [X_orig[i, 1], X_re_orig[i, 1]]) for i in range(10)]
plt.legend()
plt.show()
)
If you had kept the n_dimensions the same (set pca = PCA(2), you do recover the original points (the new points are on top of the original ones):
I'm a Matlab user and I'm learning Python with the sklearn library. I want to translate this Matlab code
[coeff,score] = pca(X)
For coeff I have tried this in Python:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(X)
coeff = print(np.transpose(pca.components_))
I don't know whether or not it's right; for score I have no idea.
Could anyone enlight me about correctness of coeff and feasibility of score?
The sklearn PCA has a score method as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Try: pca.score(X) or pca.score_samples(X) depending whether you wish a score for each sample (the latter) or a single score for all samples (the former)
The PCA score in sklearn is different from matlab.
In sklearn, pca.score() or pca.score_samples() gives the log-likelihood of samples whereas matlab gives the principal components.
From sklearn Documentation:
Return the log-likelihood of each sample.
Parameters:
X : array, shape(n_samples, n_features)
The data.
Returns:
ll : array, shape (n_samples,)
Log-likelihood of each sample under the current model
From matlab documentation:
[coeff,score,latent] = pca(___) also returns the principal component
scores in score and the principal component variances in latent. You
can use any of the input arguments in the previous syntaxes.
Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X.
Now, the equivalent of matlab score in pca is fit_transform() or transform() :
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> matlab_equi_score = pca.fit_transform(X)
I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700
I want to perform principal component analysis for dimension reduction and data integration.
I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.
I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].
Code
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)
Output
[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
Is taking 1st PCA after dimension reduction proper approach for data integration?
1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3].
So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?
In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?
Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.
In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.
There is no need to use PCA for this small dataset. And for PCA you array should be scaled.
In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).