Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction

Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction - python

I have a dataset that consists of 13 columns and I wanted to use PCA for data reduction to remove unwanted columns. My problem is PCA doesn't really show columns names but PC1 PC2 etc. I found out extra tree classifier does the same thing but does indicate the variation of each column. I just wanted to make sure if they both have the same objective or are they different in their outcome. Also would anyone suggest a better methods for Data Reduction?
My last question is that I have a code for Extra tree classifier and wanted to confirm if it was correct or not?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('D:\\Project\\database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 500,
criterion ='entropy', max_features = 'auto')
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_],
axis = 0)
plt.bar(df.columns, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()
Thank You.

The two methods are very different.
PCA doesn't show you the feature names because dimensionality reduction with PCA doesn't really have anything to do with the relative importance of the features. PCA takes the original data and transforms it into a space where each new 'feature' (principal component) is independent of the others, and you can tell how important each principal component is to faithfully representing the data based on its corresponding eigenvalue. Removing the least important principal components reduces dimensionality in principal component space, but not in the original feature space - so you need to do PCA on all future data, too, and then perform all your classification on the (shortened) principal component vectors.
An extra tree classifier trains an entire classifier on your data, so it's much more powerful than just dimensionality reduction. However, it does seem closer to what you're looking for, since the feature importance does directly tell you how relevant each feature is when making a classification.
Note that in PCA, the principal components with the highest eigenvalues contribute the most to accurately reconstructing the data. This is not the same as contributing the most to accurately classifying the data. The extra tree classifier is the reverse: it tells you what features are most important when classifying the data, not when reconstructing it.
Basically, if you think you have a representative dataset right now and are comfortable only storing variables that are relevant to classifying the data you already have, dimensionality reduction with extra trees is a good choice for you. If you just want to faithfully represent the data with less space without being overly concerned about the effects on classification, PCA is the better choice. Dimensionality reduction with PCA will often also help remove irrelevant features from the original data, but that's not what it's optimized for.

Related

Scoring increasing with number of components using PCA

I recently started working in the field of machine learning and stuff related to it using python. Today I'm working on a dataset where I would like to apply a dimension reduction and apply my model to evaluate the score. This dataset got 30 features.
I start with a simple algorithm which is the Logistic Regression but before applying my logistic regression I want to do a PCA.
To determine which number of components is the best I used the gridsearchCV with my logistic regression only playing with the C parameter and my PCA where I choose the number of components.
The result I got is that the more components I use for my PCA the better is the precision score. For my example with n_components=30 I get a precision score of 0.81.
The problem is that I thought PCA is used for dimension reduction (i.e working with fewer features) and that it could help increasing score. Is there something I do not understand?
pca = PCA()
logistic = LogisticRegression()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
param_grid = {
'pca__n_components': [5,10,15,20,25,30],
'logistic__C': [0.01,0.1,1,10,100]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='precision') # fix adding a tuple scoring
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
results = pd.DataFrame(search.cv_results_)
output : Best parameter (CV score=0.881):
{'logistic__C': 0.01, 'pca__n_components': 30}
Thanks in advance for your reply
EDIT: I add this screenshot for more information on the score with number of components

In general, when you do dimension reduction, you lose some information. It is not surprising then that you get a higher score with the full set of PCA features. Working with few features could indeed help increase the score but not necessarily, there are also other good reasons for using PCA for dimension reduction. Here are the main advantages of PCA:
PCA is one good technique for dimension reduction (with its own limitations) in the sense that it concentrate the variance of the dataset in the first dimensions of the computed new space. Hence, dropping the last features is done at a minimal cost in terms of information carried by the dataset (under certain hypotheses). Using PCA for dimension reduction mitigates the risk of overfitting by limiting the number of features, while losing a minimal amount of information. In this sense, less features can increase the score by avoiding overfitting but that is not always true.
Dimension reduction with PCA can also be useful when working with noisy data. PCA will not directly eliminate the noise, but the first few features will have a higher signal-to-noise ratio since the variance of the dataset is concentrated there. The last features may be then dominated by noise and dropped.
Since PCA projects the dataset on a new orthonormal basis, the new features will be all independant from each other. This property is often required by a lot of machine learning algorithms to achieve optimal performance.
Of course, PCA should not be used in any case as it has its own hypotheses and limitations. Here are what I consider the main ones (non exhaustive):
PCA is sensitive to the scaling of the variables. As an example, if you have a temperaturecolumn in your dataset, you will get a different transformation depending on whether you use Celsius or Fahrenheit as the unit because their scale are different. When the variables have different scales, PCA is a bit arbitrary. This can be corrected by scaling all variables to unit variance, but at the cost of modifying (compressing or expanding) the fluctuations of the variables in all dimensions.
PCA captures linear correlations between between the features but fails to capture non-linear correlations.
What would be interesting in your case would be to compare the score obtained with and without the PCA transformation. You would see then if there is a benefit in using it.
Last but not least, your plot shows an interesting thing. The gain in the score between 20 and 30 features is very low (1% ?). You can wonder whether it is worth keeping ten additional features for this very low gain. Indeed, keeping more features increases the risk of having a model with a lower ability to generalize. Cross validation mitigates already this risk, but there are no guarantees that when you apply the model on unseen data, this unseen data will have the exact same properties as your training dataset.

Feature scaling for Kmeans algorithm

I know feature scaling is required for KMeans algorithm defined under
sklearn.cluster.KMeans
My question is whether it needs to be done manually before using KMeans or KMeans does automatically perform feature scaling? If automatic, please show me where is it specified in KMeans algorithm as I am unable to find it in the documentation present here:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Btw people say that Kmeans itself takes care of Feature Scaling.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.
A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set1. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.
or other multivariate analysis.
1 Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

As far as I know, K-means does not automatically perform feature scaling. Anyway its a simple process and requires just two additional lines of code. I would recommend using StandardScaler feature scaling. Here is a good example on how to do it.
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
iris = datasets.load_iris()
X = iris.data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
clt = KMeans(n_clusters=3, random_state=0, n_jobs=-1)
model = clt.fit(X_std)

PCA trained model and then back to original feature space

I tried to predict special movement with my smartphone. Therfore I have developed an application with creates a dataset containing acceleration, gyroscope, magneticfiled etc.
The Problem is, I dont dont realy know which are good features
Thats why i tried to use PCA
so far no problems
from sklearn.decomposition import PCA
pca = PCA(0.95) # i don't want to lose too much information
.. split recorded data in train and test samples
pc_test = pca.fit_transform(data_test)
pc_train = pca.fit_transform(data_train)
and fit the data to Random Forest or Ridge Regression etc...
But now i have the problem that all my trained classifier, are only working on pca transformed data.
This means i have to do pca on my phone to do my intended prediction.
Is this the correct way to proceed or have i missed something?
I thought about pca like one time analytics tool

First I do not think, it is always a good idea to set a static variance ratio from the beginning like 0.95. Holding as many information as possible(up to all the dimensions you have originally) sometimes lead not to to the best Result/model since you are trying PCA here. I would try a series of variance ratios like:
import numpy as np
n_s = np.linspace(0.65, 0.85, num=21)
for n in n_s:
pca = PCA(n_components=n)
#...
and look at the results Than you can set your variance/number of components (which generates the highest accuracy in your model) to a scalar.It is a important point in ML. To your question: Most likely you are not going to do PCA and even modelling on your phone, you only are going to use the resulting model at the end. You gonna want to have your training data set as large as possible(that leads to better accuracies), as far as your computing hardware allows. That "superior" hardware cannot be your mobile phone.

scikit-learn to learn and generate list of numbers

I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.
[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....
I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.
Is there an example sklearn code that does the similar job where I can take a look at?

I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.
First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.
I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here.
Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here. Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.
At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation, which is also implemented in seaborn), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.
If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.

Here is a small code example that does exactly that (discriminative model):
import numpy as np
from sklearn.linear_model import LinearRegression
#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))
#define the regression
clf = LinearRegression()
#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Otherwise here are more examples:
http://scikit-learn.org/stable/auto_examples/index.html
The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)

How to ensure disjoint set of features when using Feature Union

I'm trying to learn how to use some of the helper features in sklearn but am struggling with understanding how to use FeatureUnion
One part of the documentation states this
(A FeatureUnion has no way of checking whether two transformers might
produce identical features. It only produces a union when the feature
sets are disjoint, and making sure they are is the caller’s
responsibility.)
However an example on the Iris dataset shows this
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
How is it ensured that the pca and SelectKBest functions don't select the same feature, or in other words how can the user ensure that the two selections are disjoint?
http://scikit-learn.org/dev/modules/pipeline.html#feature-union
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

I think you pretty much answered your own question with that quote from the docs:
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
The FeatureUnion does not ensure features are different.
In the example of the Iris dataset it is possible (though highly unlikely) that the PCA and the feature selection process will generate identical features. In that case, you just have twice the same feature in the output of the FeatureUnion.
This is usually not a huge deal, though if you can avoid it it's probably cleaner to do so (for instance a random forest model would be biased towards a feature that is present several times, as it would have a higher probability to be picked as a candidate to split a node).
To be a bit clearer, I don't think there's a lot you can do about it beyond avoiding to combine feature extraction processes that obviously create duplicate features in a FeatureUnion.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.