How to properly use Smote in Classification models - python

I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn't explained it well. It's my starting on Stack overflow.
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
Here i applied the Random Forest Classifier on my data
import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))
RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())
If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.
y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))

I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don't have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.
Check scikit-learn documentation:
Scikit-documentation

I agree with razimbres about using class_weight.
Another option for you would be to split the dataset into train and test first. Then, keep the test set aside. Use only the training set from here on:
X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.

Related

What is the expected_value field of TreeExplainer for a Random Forest?

I used SHAP to explain my RF
RF_best_parameters = RandomForestRegressor(random_state=24, n_estimators=100)
RF_best_parameters.fit(X_train, y_train.values.ravel())
shap_explainer_model = shap.TreeExplainer(RF_best_parameters)
The TreeExplainer class has an attribute expected_value.
My first guess that this field is the mean of the predicted y, according to the X_train (I also read this here )
But it is not.
The output of the command:
shap_explainer_model.expected_value
is 0.2381.
The output of the command:
RF_best_parameters.predict(X_train).mean()
is 0.2389.
As we can see the values are not same.
So what is the meaning of the expected_value here?
This is due to a peculiarity of the method when used with the Random Forest algorithm; quoting from the response in the relevant Github thread shap explainer expected_value is different from model expected value:
It is because of how sklearn records the training samples in the tree models it builds. Random forests use a random subsample of the data to train each tree, and it is that random subsample that is used in sklearn to record the leaf sample weights in the model. Since TreeExplainer uses the recorded leaf sample weights to represent the training dataset, it will depend on the random sampling used during training. This will cause small variations like the ones you are seeing.
We can actually verify that this behavior is not present with other algorithms, say Gradient Boosting Trees:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
But for RF, we get indeed a "small variation" as mentioned by the main SHAP developer in the thread above:
rf = RandomForestRegressor(random_state=0)
rf.fit(X_train, y_train)
rf_explainer = shap.TreeExplainer(rf)
rf_explainer.expected_value
# array([-11.59166808])
mean_pred_rf = np.mean(rf.predict(X_train))
mean_pred_rf
# -11.280125877556388
np.isclose(mean_pred_rf, rf_explainer.expected_value)
# array([False])
Just try :
shap_explainer_model = shap.TreeExplainer(RF_best_parameters, data=X_train, feature_perturbation="interventional", model_output="raw")
Then the shap_explainer_model.expected_value should give you the mean prediction of your model on train data.
Otherwise, TreeExplainer uses feature_perturbation="tree_path_dependent"; accoding to the documentation:
The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.

StandardScaler to whole training dataset or to individual folds for Cross Validation

I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).
my current process is as follows:
Experiment A
Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
Specify the cross validation method as KFold with 5 folds
Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score
Experiment B
Use the same boston housing data as above
fit_transform StandardScaler on the entire dataset
Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
Compare the scores here to Experiment A
The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.
I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?
See my test code below:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression
np.set_printoptions(15)
boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]
scalar = StandardScaler()
clf = LinearRegression()
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print('Length of Data on which scaler is fit on =', len(X))
output = super().fit(X,y)
# print('mean of scalar =',output.mean_)
output = super().transform(X)
return output
pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])
cv = KFold(n_splits=5, random_state=42)
cross_val_score(pipeline, X, y, cv = cv)
# Now fitting Scaler on whole train data
scaler_2 = StandardScaler()
clf_2 = LinearRegression()
X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)
Thanks!

How to use GridSearchCV for tuning parameters with train_test_split strategy?

I am trying to fine tune my sklearn models using train_test_split strategy. I am aware of GridSearchCV's ability to perform parameter tuning, however, it was tied to using Cross Validation strategy, I would like to use train_test_split strategy for the parameter searching, for the speed of training is important for my case, I prefer simple train_test_split over cross-validation.
I could try to write my own for loop, but it would be inefficient for not taking advantage of the built-in parallelization used in GridSearchCV.
Anyone knows how to take advantage GridSearchCV for this? Or provide an alternative that wasn't too slow.
Yes, you can use ShuffleSplit for this.
ShuffleSplit is a cross validation strategy like KFold, but unlike KFold where you have to train K models, here you can control how many times to do the train/test split, even once if you prefer.
shuffle_split = ShuffleSplit(n_splits=1,test_size=.25)
n_splits defines how many times to repeat this splitting and training routine.
Now you can use it like this:
GridSearchCV(clf,param_grid={},cv=shuffle_split)
I would like to add on to Shihab Shahriar's answer, by providing a code sample.
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier
# Load iris dataset
iris = datasets.load_iris()
# Prepare X and y as dataframe
X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.DataFrame(data=iris.target, columns=['Species'])
# Train test split
shuffle_split = ShuffleSplit(n_splits=1, test_size=0.3)
# This is equivalent to:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# But, it is usable for GridSearchCV
# GridSearch without CV
params = { 'n_estimators': [16, 32] }
clf = RandomForestClassifier()
grid_search = GridSearchCV(clf, param_grid=params, cv=shuffle_split)
grid_search.fit(X, y)
This should help anyone facing a similar problem.

Feature Importance using Imbalanced-learn library

The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles.
My question is however, how can I get feature improtance of the estimator after using BalancedBaggingClassifier or any other sampling method from imblearn?
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(random_state=42,base_estimator=DecisionTreeClassifier(criterion=criteria_,max_features='sqrt',random_state=1),n_estimators=2000)
bbc.fit(X_train,y_train)
Not all estimators in sklearn allow you to get feature importances (for example, BaggingClassifier doesn't). If the estimator does, it looks like it should just be stored as estimator.feature_importances_, since the imblearn package subclasses from sklearn classes. I don't know what estimators imblearn has implemented, so I don't know if there are any that provide feature_importances_, but in general you should look at the sklearn documentation for the corresponding object to see if it does.
You can, in this case, look at the feature importances for each of the estimators within the BalancedBaggingClassifier, like this:
for estimator in bbc.estimators_:
print(estimator.steps[1][1].feature_importances_)
And you can print the mean importance across the estimators like this:
print(np.mean([est.steps[1][1].feature_importances_ for est in bbc.estimators_], axis=0))
There is a shortcut around this, however it is not very efficient. The BalancedBaggingClassifieruses the RandomUnderSampler successively and fits the estimator on top. A for-loop with RandomUnderSampler can be one way of going around the pipeline method, and then call the Scikit-learn estimator directly. This will also allow to look at feature_importance:
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=1)
my_list=[]
for i in range(0,10): #random under sampling 10 times
X_pl,y_pl=rus.sample(X_train,y_train,)
my_list.append((X_pl,y_pl)) #forming tuples from samples
X_pl=[]
Y_pl=[]
for num in range(0,len(my_list)): #Creating the dataframes for input/output
X_pl.append(pd.DataFrame(my_list[num][0]))
Y_pl.append(pd.DataFrame(my_list[num][1]))
X_pl_=pd.concat(X_pl) #Concatenating the DataFrames
Y_pl_=pd.concat(Y_pl)
RF=RandomForestClassifier(n_estimators=2000,criterion='gini',max_features=25,random_state=1)
RF.fit(X_pl_,Y_pl_)
RF.feature_importances_
According to scikit learn documentation, you can use impurity-based feature importance on classifications, that don't have their own using some sort of ForestClassifier.
Here my classifier doesn't have feature_importances_, I'm adding it directly.
classifier.fit(x_train, y_train)
...
...
forest = ExtraTreesClassifier(n_estimators=classifier.n_estimators,
random_state=classifier.random_state)
forest.fit(x_train, y_train)
classifier.feature_importances_ = forest.feature_importances_

Restricted Boltzmann Machine in Scikit-learn: Iris Classification

I'm working on an example of applying Restricted Boltzmann Machine on Iris dataset. Essentially, I'm trying to make a comparison between RMB and LDA. LDA seems to produce a reasonable correct output result, but the RBM isn't. Following a suggestion, I binarized the feature inputs using skearn.preprocessing.Binarizer, and also tried different threshold parameter values. I tried several different ways to apply binarization, but none seemed to work for me.
Below is my modified version of the code based on this user's version User: covariance.
Any helpful comments are greatly appreciated.
from sklearn import linear_model, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.lda import LDA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:,:2] # we only take the first two features.
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Models we will use
rbm = BernoulliRBM(random_state=0, verbose=True)
binarizer = preprocessing.Binarizer(threshold=0.01,copy=True)
X_binarized = binarizer.fit_transform(X_train)
hidden_layer = rbm.fit_transform(X_binarized, Y_train)
logistic = linear_model.LogisticRegression()
logistic.coef_ = hidden_layer
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
lda = LDA(n_components=3)
#########################################################################
# Training RBM-Logistic Pipeline
logistic.fit(X_train, Y_train)
classifier.fit(X_binarized, Y_train)
#########################################################################
# Get predictions
print "The RBM model:"
print "Predict: ", classifier.predict(X_test)
print "Real: ", Y_test
print
print "Linear Discriminant Analysis: "
lda.fit(X_train, Y_train)
print "Predict: ", lda.predict(X_test)
print "Real: ", Y_test
RBM and LDA are not directly comparable, as RBM doesn't perform classification on its own. Though you are using it as a feature engineering step with logistic regression at the end, LDA is itself a classifier - so the comparison isn't very meaningful.
The BernoulliRBM in scikit learn only handles binary inputs. The iris dataset has no sensible binarization, so you aren't going to get any meaningful outputs.

Categories