Best way to combine probabilistic classifiers in scikit-learn

Best way to combine probabilistic classifiers in scikit-learn - python

I have a logistic regression and a random forest and I'd like to combine them (ensemble) for the final classification probability calculation by taking an average.
Is there a built-in way to do this in sci-kit learn? Some way where I can use the ensemble of the two as a classifier itself? Or would I need to roll my own classifier?

NOTE: The scikit-learn Voting Classifier is probably the best way to do this now
OLD ANSWER:
For what it's worth I ended up doing this as follows:
class EnsembleClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, classifiers=None):
self.classifiers = classifiers
def fit(self, X, y):
for classifier in self.classifiers:
classifier.fit(X, y)
def predict_proba(self, X):
self.predictions_ = list()
for classifier in self.classifiers:
self.predictions_.append(classifier.predict_proba(X))
return np.mean(self.predictions_, axis=0)

Given the same problem, I used a majority voting method.
Combing probabilities/scores arbitrarily is very problematic, in that the performance of your different classifiers can be different, (For example, an SVM with 2 different kernels , + a Random forest + another classifier trained on a different training set).
One possible method to "weigh" the different classifiers, might be to use their Jaccard score as a "weight".
(But be warned, as I understand it, the different scores are not "all made equal", I know that a Gradient Boosting classifier I have in my ensemble gives all its scores as 0.97, 0.98, 1.00 or 0.41/0 . I.E. it's very overconfident..)

What about the sklearn.ensemble.VotingClassifier?
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Per the description:
The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

Now scikit-learn has StackingClassifier which can be used to stack multiple estimators.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('lg', LogisticRegression()))
]
clf = StackingClassifier(
estimators=estimators, final_estimator=LogisticRegression()
)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train)
clf.predict_proba(X_test)

Related

Can I use GridSearchCV with KNeighboursRegressor?

I have a data set with some float column features (X_train) and a continuous target (y_train).
I want to run KNN regression on the data set, and I want to (1) do a grid search for hyperparameter tuning and (2) run cross validation on the training.
I wrote this code:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'weights': ['uniform','distance'],'p':[1,2,5]}]
# Search for best hyperparameters
grid = GridSearchCV(estimator=KNeighborsRegressor(),
param_grid=hp_candidates,
cv=cv_method,
verbose=1,
scoring='accuracy',
return_train_score=True)
grid.fit(X_train,y_train)
The error I get is:
Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I understand the error, that I can only do this method for KNN in classification, not regression.
But what I can't find is how to edit this code to make it suitable for KNN regression? Can someone explain to me how this could be done?
(So the ultimate aim is I have a data set, I want to tune the parameters, do cross validation, and output the best model based on above and get back some accuracy scores, ideally scores that have comparable scores in other algorithms and are not specific to KNN, so I can compare accuracy).
Also just to mention, this is my first attempt at KNN in scikitlearn, so all comments/critic is welcome.

Yes you can use GridSearchCV with the KNeighboursRegressor.
As you have a metric choice problem,
you can read the metrics documentation here : https://scikit-learn.org/stable/modules/model_evaluation.html
The metrics appropriate for a regression problem are different than from classification problems, and you have the list here for appropritae regression metrics:
‘explained_variance’
‘max_error’
‘neg_mean_absolute_error’
‘neg_mean_squared_error’
‘neg_root_mean_squared_error’
‘neg_mean_squared_log_error’
‘neg_median_absolute_error’
‘r2’
‘neg_mean_poisson_deviance’
‘neg_mean_gamma_deviance’
‘neg_mean_absolute_percentage_error’
So you can chose one to replace "accuracy" and test it.

How can i plot the ADABoost Classifier decision boundary of each base classifiers in python?

I searched in everywhere but couldn't find a way to do that. I used make_moons() data in my code and run a logistic regression model. After that i created ADABoost Classifier with 4 base classifiers and used logistic regression model for base estimator. My next task is to plot the decision boundary of each base classifiers The output should include 4 decision boundary. How can i plot the decision boundary of each base classifiers?
My code so far :
from sklearn import datasets
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
X, y = make_moons(n_samples=100, noise=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
clf= AdaBoostClassifier(logisticRegr,n_estimators=4)
clf.fit(X, y)

The fitted AdaBoostClassifier (clf here) exposes its fitted base estimators in the attribute estimators_, and also defines itself as an iterable (source), so you can just loop over clf to get the base estimators. Then you can just use their decision methods to plot.
I happen to have previously make a notebook that did almost this for a CV.SE question; you can probably get a good start from that.

Recursive feature elimination combined with nested (leave one group out) cross-validation in scikit

I want to do a binary classification for 30 groups of subjects having 230 samples by 150 features. I founded it very hard to implement especially when doing feature selection, parameters tunning through nested leave one group out cross-validation and report the accuracy using two classifiers the SVM and random forest and to see which features been selected.
I'm new to this and I'm sure the following code is not correct:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
X= the data (230 samples * 150 features)
y= [1,0,1,0,0,0,1,1,1..]
groups = [1,2...30]
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0]}]
inner_cross_validation = LeaveOneGroupOut().split(X, y, groups)
outer_cross_validation = LeaveOneGroupOut().split(X, y, groups)
estimator = SVC(kernel="linear")
selector = RFE(estimator, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search.fit(X, y)
scores = cross_val_score(grid_search, X, y,cv=outer_cross_validation)
I don't know where to set "the random forest classifier" in the above because I want to compare the accuracies between SVM and random forest.
Thank you very much for reading and hope that someone can help me.
Best regards

You should call the tree in the same way that you call the SVM
#your libraries
from sklearn.tree import DecisionTreeClassifier
#....
estimator = SVC(kernel="linear")
estimator2 = DecisionTreeClassifier( ...parameters here...)
selector = RFE(estimator, step=1)
selector2 = RFE(estimator2, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search = GridSearchCV(selector2, ..greed for the tree here.., cv=inner_cross_validation)
Please note that this procedure will lead to two different set of selected features: one for the SVM and one for the Decision Tree

Feature Importance using Imbalanced-learn library

The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles.
My question is however, how can I get feature improtance of the estimator after using BalancedBaggingClassifier or any other sampling method from imblearn?
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(random_state=42,base_estimator=DecisionTreeClassifier(criterion=criteria_,max_features='sqrt',random_state=1),n_estimators=2000)
bbc.fit(X_train,y_train)

Not all estimators in sklearn allow you to get feature importances (for example, BaggingClassifier doesn't). If the estimator does, it looks like it should just be stored as estimator.feature_importances_, since the imblearn package subclasses from sklearn classes. I don't know what estimators imblearn has implemented, so I don't know if there are any that provide feature_importances_, but in general you should look at the sklearn documentation for the corresponding object to see if it does.
You can, in this case, look at the feature importances for each of the estimators within the BalancedBaggingClassifier, like this:
for estimator in bbc.estimators_:
print(estimator.steps[1][1].feature_importances_)
And you can print the mean importance across the estimators like this:
print(np.mean([est.steps[1][1].feature_importances_ for est in bbc.estimators_], axis=0))

There is a shortcut around this, however it is not very efficient. The BalancedBaggingClassifieruses the RandomUnderSampler successively and fits the estimator on top. A for-loop with RandomUnderSampler can be one way of going around the pipeline method, and then call the Scikit-learn estimator directly. This will also allow to look at feature_importance:
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=1)
my_list=[]
for i in range(0,10): #random under sampling 10 times
X_pl,y_pl=rus.sample(X_train,y_train,)
my_list.append((X_pl,y_pl)) #forming tuples from samples
X_pl=[]
Y_pl=[]
for num in range(0,len(my_list)): #Creating the dataframes for input/output
X_pl.append(pd.DataFrame(my_list[num][0]))
Y_pl.append(pd.DataFrame(my_list[num][1]))
X_pl_=pd.concat(X_pl) #Concatenating the DataFrames
Y_pl_=pd.concat(Y_pl)
RF=RandomForestClassifier(n_estimators=2000,criterion='gini',max_features=25,random_state=1)
RF.fit(X_pl_,Y_pl_)
RF.feature_importances_

According to scikit learn documentation, you can use impurity-based feature importance on classifications, that don't have their own using some sort of ForestClassifier.
Here my classifier doesn't have feature_importances_, I'm adding it directly.
classifier.fit(x_train, y_train)
...
...
forest = ExtraTreesClassifier(n_estimators=classifier.n_estimators,
random_state=classifier.random_state)
forest.fit(x_train, y_train)
classifier.feature_importances_ = forest.feature_importances_

Cross Validation with ROC?

I use the code to run cross validation, returning ROC scores.
rf = RandomForestClassifier(n_estimators=1000,oob_score=True,class_weight = 'balanced')
scores = cross_val_score ( rf, X,np.ravel(y), cv=10, scoring='roc_auc')
How can I return the ROC based on
roc_auc_score(y_test,results.predict(X_test))
rather than
roc_auc_score(y_test,results.predict_proba(X_test))

ROC AUC is only useful if you can rank order your predictions. Using .predict() will just give the most probable class for each sample, and so you won't be able to do that rank ordering.
In the example below, I fit a random forest on a randomly generated dataset and tested it on a held out sample. The blue line shows the proper ROC curve done using .predict_proba() while the green shows the degenerate one with .predict() where it only really knows of the one cutoff point.
from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
rf = RandomForestClassifier()
data, target = make_classification(n_samples=4000, n_features=2, n_redundant=0, flip_y=0.4)
train, test, train_t, test_t = train_test_split(data, target, train_size=0.9)
rf.fit(train, train_t)
plt.plot(*roc_curve(test_t, rf.predict_proba(test)[:,1])[:2])
plt.plot(*roc_curve(test_t, rf.predict(test))[:2])
plt.show()
EDIT: While there's nothing stopping you from calculating an roc_auc_score() on .predict(), the point of the above is that it's not really a useful measurement.
In [5]: roc_auc_score(test_t, rf.predict_proba(test)[:,1]), roc_auc_score(test_t, rf.predict(test))
Out[5]: (0.75502749115010925, 0.70238005573548234)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to combine probabilistic classifiers in scikit-learn - python

Related

Can I use GridSearchCV with KNeighboursRegressor?

How can i plot the ADABoost Classifier decision boundary of each base classifiers in python?

Recursive feature elimination combined with nested (leave one group out) cross-validation in scikit

Feature Importance using Imbalanced-learn library

Cross Validation with ROC?

Categories

Resources