How to ensure disjoint set of features when using Feature Union

How to ensure disjoint set of features when using Feature Union - python

I'm trying to learn how to use some of the helper features in sklearn but am struggling with understanding how to use FeatureUnion
One part of the documentation states this
(A FeatureUnion has no way of checking whether two transformers might
produce identical features. It only produces a union when the feature
sets are disjoint, and making sure they are is the caller’s
responsibility.)
However an example on the Iris dataset shows this
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
How is it ensured that the pca and SelectKBest functions don't select the same feature, or in other words how can the user ensure that the two selections are disjoint?
http://scikit-learn.org/dev/modules/pipeline.html#feature-union
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

I think you pretty much answered your own question with that quote from the docs:
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
The FeatureUnion does not ensure features are different.
In the example of the Iris dataset it is possible (though highly unlikely) that the PCA and the feature selection process will generate identical features. In that case, you just have twice the same feature in the output of the FeatureUnion.
This is usually not a huge deal, though if you can avoid it it's probably cleaner to do so (for instance a random forest model would be biased towards a feature that is present several times, as it would have a higher probability to be picked as a candidate to split a node).
To be a bit clearer, I don't think there's a lot you can do about it beyond avoiding to combine feature extraction processes that obviously create duplicate features in a FeatureUnion.

Related

Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction

I have a dataset that consists of 13 columns and I wanted to use PCA for data reduction to remove unwanted columns. My problem is PCA doesn't really show columns names but PC1 PC2 etc. I found out extra tree classifier does the same thing but does indicate the variation of each column. I just wanted to make sure if they both have the same objective or are they different in their outcome. Also would anyone suggest a better methods for Data Reduction?
My last question is that I have a code for Extra tree classifier and wanted to confirm if it was correct or not?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('D:\\Project\\database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 500,
criterion ='entropy', max_features = 'auto')
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_],
axis = 0)
plt.bar(df.columns, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()
Thank You.

The two methods are very different.
PCA doesn't show you the feature names because dimensionality reduction with PCA doesn't really have anything to do with the relative importance of the features. PCA takes the original data and transforms it into a space where each new 'feature' (principal component) is independent of the others, and you can tell how important each principal component is to faithfully representing the data based on its corresponding eigenvalue. Removing the least important principal components reduces dimensionality in principal component space, but not in the original feature space - so you need to do PCA on all future data, too, and then perform all your classification on the (shortened) principal component vectors.
An extra tree classifier trains an entire classifier on your data, so it's much more powerful than just dimensionality reduction. However, it does seem closer to what you're looking for, since the feature importance does directly tell you how relevant each feature is when making a classification.
Note that in PCA, the principal components with the highest eigenvalues contribute the most to accurately reconstructing the data. This is not the same as contributing the most to accurately classifying the data. The extra tree classifier is the reverse: it tells you what features are most important when classifying the data, not when reconstructing it.
Basically, if you think you have a representative dataset right now and are comfortable only storing variables that are relevant to classifying the data you already have, dimensionality reduction with extra trees is a good choice for you. If you just want to faithfully represent the data with less space without being overly concerned about the effects on classification, PCA is the better choice. Dimensionality reduction with PCA will often also help remove irrelevant features from the original data, but that's not what it's optimized for.

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.

Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

scikit-learn to learn and generate list of numbers

I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.
[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....
I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.
Is there an example sklearn code that does the similar job where I can take a look at?

I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.
First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.
I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here.
Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here. Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.
At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation, which is also implemented in seaborn), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.
If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.

Here is a small code example that does exactly that (discriminative model):
import numpy as np
from sklearn.linear_model import LinearRegression
#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))
#define the regression
clf = LinearRegression()
#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Otherwise here are more examples:
http://scikit-learn.org/stable/auto_examples/index.html
The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)

Ensemble feature selection from feature sets

I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?

I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.

Sklearn: How to make an ensemble for two binary classifiers?

I have two classifiers for a multimedia dataset. One for visual material and one for textual material. I want to combine the predictions of these classifiers to make a final prediction. I have been reading about bagging, boosting and stacking ensembles and all seem useful and I would like to try them. However, I can only seem to find rather theoretical examples for my specific problem, nothing concrete enough for me to understand how to actually implement it (in python with scikit-learn). My two classifiers both use 10 KFold CV with SVM classification. Both outputting a list of n_samples = 1000 with predictions (either 1's or 0's). Also, I made them both produce a list of probabilities on which the predictions are based, looking like this:
[[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]
....
[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]]
How would I go about combining these in an ensemble. What should I use as input? Ive tried concatenating the label predictions horizontally and input them as features, but with no luck (same for the probabilities).

If you're looking for combining strictly, I recomend using brew because it is built on top of sklearn (meaning that you can use your sklearn classifiers), and, last time I checked, sklearn was good for creating ensembles (Bagging, AdaBoost, RandomForest ...), but not many combining rules were provided for your own custom ensemble (such as hybrid ensembles).
https://github.com/viisar/brew
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = your_list_of_classifiers # [clf1, clf2]
ens = Ensemble(classifiers = clfs)
# create your Combiner
# the rules can be 'majority_vote', 'max', 'min', 'mean' or 'median'
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)

It depends entirely on the ensemble method you want to implement. Have you taken a look at the sklearn-ensemble documentation?
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

There is a classifier called 'VotingClassifier' in sklearn.ensemble which can be used to club multiple classifiers and the predicted labels will be based on voting from the enlisted classifiers. Here is the example:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.