How to integrate SMOTE resampling and feature selection into RFECV

How to integrate SMOTE resampling and feature selection into RFECV - python

I am working on a dataset of shape (41188, 58) to make a binary classifier. The data is highly imbalanced. Initially, I intend to do feature selection by RFECV and this is the code I am using which has been borrowed from here:
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(5),
scoring='accuracy',
min_features_to_select=min_features_to_select)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.plot(range(min_features_to_select, len(rfecv.grid_scores_) +
min_features_to_select), rfecv.grid_scores_)
plt.show()
I got the following result:
then I changed the code to cv=StratifiedKFold(2) and min_features_to_select = 20 and this time I got:
In none of the cases above, resampling was done. Since resampling should be applied to the training data, and here I am using cross validation therefore, each training data fold should be resampled (e.g. SMOTE) as well. I wonder how to integrate resampling and feature selection into RFECV?

Related

K-fold Cross Validation Queries

I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)

The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.

How to decide threshold value in SelectFromModel() for selecting features?

I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.
Code:
feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
clf.fit(X_train, y_train)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
Now I am trying to use SelectFromModel from sklearn.feature_selection but how can I decide the threshold value for my given dataset.
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
When I try threshold=0.15 and then try to train my model I get an error saying data is too noisy or the selection is too strict.
But if I use threshold=0.015 I am able to train my model on selected new features So how can I decide this threshold value ?

I would try the following approach:
start with a low threshold, for example: 1e-4
reduce your features using SelectFromModel fit & transform
compute metrics (accuracy, etc.) for your estimator (RandomForestClassifier in your case) for selected features
increase threshold and repeat all steps starting from point 1.
Using this approach you can estimate what is the best threshold for your particular data and your estimator

Scikit Logistic Regression summary output?

is there a way to have a similar, nice output for the scikit logistic regression models as in statsmodels? With all the p-values, std. errors etc. in one table?

As you and others have pointed out, this is a limitation of scikit learn. Before discussing below a scikit approach for your question, the “best” option is to use statsmodels as follows:
import statsmodels.api as sm
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()
X represents your input features/predictors matrix and y represents the outcome variable. Statsmodels works well if X lacks highly correlated features, lacks low variance features, feature(s) don’t generate “perfect/quasi-perfect separation”, and any categorical features are reduced to “n-1” levels i.e., dummy-coded (and not “n” levels i.e., one-hot encoded as described here: dummy variable trap).
However, if above isn't feasible/practical, one scikit approach is coded below for fairly equivalent results - in terms of feature coefficients/odds with their standard errors and 95%CI estimates. Essentially, the code generates these results from distinct logistic regression scikit models trained against distinct test-train splits of your data. Again, make sure categorical features are dummy coded to n-1 levels (or your scikit coefficients will be incorrect for categorical features).
#Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty
= "none")
##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)
##Create features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
train_x, test_x, train_y, test_y = train_test_split(X, y, stratify=y,
test_size=0.3, random_state=state) #5 test-train splits
log_nr.fit(train_x, train_y) #fit logistic model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_)
##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)
#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] -
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] +
coeff_table["95ci_se_coeff"]
Finally, (optionally) convert coefficients to odds by exponentiating as follows. Odds ratios are my favorite output from logistic regression and these are appended to your dataframe using code below.
#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])
This answer builds upon on a somewhat related reply by #pciunkiewicz available here : Collate model coefficients across multiple test-train splits from sklearn

RFECV does not return same features for same data

I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?

In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source

Plotting Precision-Recall curve when using cross-validation in scikit-learn

I'm using cross-validation to evaluate the performance of a classifier with scikit-learn and I want to plot the Precision-Recall curve. I found an example on scikit-learn`s website to plot the PR curve but it doesn't use cross validation for the evaluation.
How can I plot the Precision-Recall curve in scikit learn when using cross-validation?
I did the following but i'm not sure if it's the correct way to do it (psudo code):
for each k-fold:
precision, recall, _ = precision_recall_curve(y_test, probs)
mean_precision += precision
mean_recall += recall
mean_precision /= num_folds
mean_recall /= num_folds
plt.plot(recall, precision)
What do you think?
Edit:
it doesn't work because the size of precision and recall arrays are different after each fold.
anyone?

Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. Next, collect all the test (i.e. out-of-bag) predictions and compute precision and recall.
## let test_samples[k] = test samples for the kth fold (list of list)
## let train_samples[k] = test samples for the kth fold (list of list)
for k in range(0, k):
model = train(parameters, train_samples[k])
predictions_fold[k] = predict(model, test_samples[k])
# collect predictions
predictions_combined = [p for preds in predictions_fold for p in preds]
## let predictions = rearranged predictions s.t. they are in the original order
## use predictions and labels to compute lists of TP, FP, FN
## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation
Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. Given n samples, you should have n test predictions.
(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.)
Unless you are using leave-one-out cross-validation, k-fold cross validation generally requires a random partitioning of the data. Ideally, you would do repeated (and stratified) k-fold cross validation. Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006).
I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation.
For a nice plot, I showed a representative PR curve from one of the cross-validation rounds.
There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset.
For instance, if the proportion of (binary) labels in your dataset is not skewed (i.e. it is roughly 50-50), you could use the simpler ROC analysis with cross-validation:
Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration.

This is currently the best way to plot a Precision Recall curve for an sklearn classifier using cross-validation. Best part is, it plots the PR Curves for ALL classes, so you get multiple neat-looking curves as well
from scikitplot.classifiers import plot_precision_recall_curve
import matplotlib.pyplot as plt
clf = LogisticRegression()
plot_precision_recall_curve(clf, X, y)
plt.show()
The function automatically takes care of cross-validating the given dataset, concatenating all out of fold predictions, and calculating the PR Curves for each class + averaged PR Curve. It's a one-line function that takes care of it all for you.
Precision Recall Curves
Disclaimer: Note that this uses the scikit-plot library, which I built.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to integrate SMOTE resampling and feature selection into RFECV - python

Related

K-fold Cross Validation Queries

How to decide threshold value in SelectFromModel() for selecting features?

Scikit Logistic Regression summary output?

RFECV does not return same features for same data

Plotting Precision-Recall curve when using cross-validation in scikit-learn

Categories

Resources