I have two classifiers for a multimedia dataset. One for visual material and one for textual material. I want to combine the predictions of these classifiers to make a final prediction. I have been reading about bagging, boosting and stacking ensembles and all seem useful and I would like to try them. However, I can only seem to find rather theoretical examples for my specific problem, nothing concrete enough for me to understand how to actually implement it (in python with scikit-learn). My two classifiers both use 10 KFold CV with SVM classification. Both outputting a list of n_samples = 1000 with predictions (either 1's or 0's). Also, I made them both produce a list of probabilities on which the predictions are based, looking like this:
[[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]
....
[ 0.96761819 0.03238181]
[ 0.96761819 0.03238181]]
How would I go about combining these in an ensemble. What should I use as input? Ive tried concatenating the label predictions horizontally and input them as features, but with no luck (same for the probabilities).
If you're looking for combining strictly, I recomend using brew because it is built on top of sklearn (meaning that you can use your sklearn classifiers), and, last time I checked, sklearn was good for creating ensembles (Bagging, AdaBoost, RandomForest ...), but not many combining rules were provided for your own custom ensemble (such as hybrid ensembles).
https://github.com/viisar/brew
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = your_list_of_classifiers # [clf1, clf2]
ens = Ensemble(classifiers = clfs)
# create your Combiner
# the rules can be 'majority_vote', 'max', 'min', 'mean' or 'median'
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)
It depends entirely on the ensemble method you want to implement. Have you taken a look at the sklearn-ensemble documentation?
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble
There is a classifier called 'VotingClassifier' in sklearn.ensemble which can be used to club multiple classifiers and the predicted labels will be based on voting from the enlisted classifiers. Here is the example:
Related
I have a classifier multiclass, trained using the LinearSVC model provided by Sklearn library.
This model provides a decision_function method, which I use with numpy library functions to interpret correctly the result set.
But, I don't understand why this method always tries to distribute the total of probabilities (which in my case is 1) into between each one of the possibles classes.
I expected a different behavior of my classifier.
I mean, for example, suppose that I have a short piece of text like this:
"There are a lot of types of virus and bacterias that cause disease."
But my classifier was trained with three types of texts, let say "maths", "history" and "technology".
So, I think it has very sense that each of the three subjects has a probability very closed to zero (and therefore far to sum 1) when I try to classify that.
Is there a more appropriate method or model to obtain the results that I just described?
Do I use the wrong way the decision_function?
Sometimes, you may have text that has nothing to do with any of the subjects used to train a classifier or vice versa, it could be a probability about 1 for more than one subject.
I think I need to find some light on these issues (text classification, none binary classification, etc.)
Many thanks in advance for any help!
There are multiple parts to your question I will try to answer as much as I can.
I don't understand why this method always tries to distribute the total of probabilities?
That is the nature of most of the ML models out there, a given example has to be put into some class, and every model has some mechanism to compute the probability that a given data point belongs to a class and whichever class has the highest probability you will be predicting the corresponding class.
To address your problem i.e. the existence of examples doesn't belong to any of the classes you could always create a pseudo-class called others when you train the model, in this way even if your data point doesn't correspond to any of your actual classes e.g.maths, history and technology as per your example it will be binned to the other class.
Addressing the problem that your data point could possibly belong to multiple classes.
This is what typically multi-label classification is used for.
What you are looking for is Multi-label classification model. Refer here to know understanding multi-label classification and the list of models that support multi-label classification task.
Simple example to demonstrate multi-label classification:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.preprocessing import OneHotEncoder
categories = ['sci.electronics', 'sci.space', 'talk.religion.misc',]
newsgroups_train = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'),
categories=categories)
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import make_pipeline
X, y = newsgroups_train.data, OneHotEncoder(sparse=False)\
.fit_transform([[newsgroups_train.target_names[i]]
for i in newsgroups_train.target])
model = make_pipeline(TfidfVectorizer(stop_words='english'),
MultiOutputClassifier(LinearSVC()))
model.fit(X, y)
print(newsgroups_train.target_names)
# ['sci.electronics', 'sci.space', 'talk.religion.misc']
print(model.predict(['religion followers of jesus']))
# [[0. 0. 1.]]
print(model.predict(['Upper Atmosphere Satellite Research ']))
# [[0. 1. 0.]]
print(model.predict(['There are a lot of types of virus and bacterias that cause disease.']))
# [[0. 0. 0.]]
A common way of dealing with this is to try and cast your text sample into some kind of vector space and measure the "distance" between that and some archetypical positions within that same vector space that represent classifications.
This model of a classifier is convenient because if you collapse your text sample into a vector of vocabulary frequencies, it almost trivially can be expressed as a vector - where the dimensions are defined by the number of vocabulary features you choose to track.
By cluster-analysis of a wider text corpus, you can try and determine central points that commonly occur within clusters, and you can describe these in terms of the vector-positions at which they are located.
And finally, with a handful of cluster-centers defined, you can simply pythagoras your way into finding which of these topic-clusters your chosen sample lies the closest to - but you also have at your fingertips the relative distances between your sample and all the other cluster centres as well - so it's less probabilistic, more a spatial measure.
I have 6 different classes which I am doing multiclass classification on using both XGBoost and RandomForest.
What I want is to analyze which features are most important for a sample belonging to each class.
I know that there are two different ways of getting the feature importance using xgboost. First the built in feature_importances_ variable which returns an array with importance score for all features.
Another way is to from xgboost import plot_importance which you can use to provide a plot through
plot_importance(model = xgb, max_num_features = 20)
But all of these methods only return the most important features without respect to classes.
Is there a way to get the most important features for a sample belonging to each class?
Or would the solution be to create 6 different binary classifiers and try to analyze those feature importances instead?
Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.
Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!
I want to perform bagging using python scikit-learn.
I want to combine RFE(), recursive feature selection algorithm.
The step is like below.
Make 30 subsets allowing redundant selection (bagging)
Perform RFE for each data set
Get output of each classification
find top 5 features from each output
I tried to use BaggingClassifier approach like below, but it took a lot of time and may not seem to work. Using only RFE works without problems(rfe.fit()).
cf1 = LinearSVC()
rfe = RFE(estimator=cf1)
bagging = BaggingClassifier(rfe, n_estimators=30)
bagging.fit(trainx, trainy)
Also, step 4 may be difficult to find top feature, because Bagging classifier does not offer the attribute like ranking_ in RFE().
Is there some other good ways to achieve those 4 steps?
Without bagging, one would access the ranking given by RFE with the following line:
rfe.ranking_
This order can be used to sort the features names, and then take the five first features. See the documentation for sklearn RFE for an example of this parameter.
With bagging, you would want access to each of your 30 estimators. Based on the documentation for sklearn BaggingClassifier, you can have access to them with:
bagging.estimators_
So: for each bagging in bagging.estimators_, get the ranking, sort the features based on this ranking, and take the first five elements !
Hope this helps.
I'm trying to learn how to use some of the helper features in sklearn but am struggling with understanding how to use FeatureUnion
One part of the documentation states this
(A FeatureUnion has no way of checking whether two transformers might
produce identical features. It only produces a union when the feature
sets are disjoint, and making sure they are is the caller’s
responsibility.)
However an example on the Iris dataset shows this
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
How is it ensured that the pca and SelectKBest functions don't select the same feature, or in other words how can the user ensure that the two selections are disjoint?
http://scikit-learn.org/dev/modules/pipeline.html#feature-union
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
I think you pretty much answered your own question with that quote from the docs:
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
The FeatureUnion does not ensure features are different.
In the example of the Iris dataset it is possible (though highly unlikely) that the PCA and the feature selection process will generate identical features. In that case, you just have twice the same feature in the output of the FeatureUnion.
This is usually not a huge deal, though if you can avoid it it's probably cleaner to do so (for instance a random forest model would be biased towards a feature that is present several times, as it would have a higher probability to be picked as a candidate to split a node).
To be a bit clearer, I don't think there's a lot you can do about it beyond avoiding to combine feature extraction processes that obviously create duplicate features in a FeatureUnion.