I am experiencing a problem where finetuning the hyperparameters using GridSearchCV doesn't really improve my classifiers. I figured the improvement should be bigger than that. The biggest improvement for a classifier I've gotten with my current code is around +-0.03. I have a dataset with eight columns and an unbalanced binary outcome. For scoring I use f1 and I use KFold with 10 splits. I was hoping if someone could spot something which is off and I should look at? Thank you!
I use the following code:
model_parameters = {
"GaussianNB": {
},
"DecisionTreeClassifier": {
'min_samples_leaf': range(5, 9),
'max_depth': [None, 0, 1, 2, 3, 4]
},
"KNeighborsClassifier": {
'n_neighbors': range(1, 10),
'weights': ["distance", "uniform"]
},
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30)
},
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"]
}
}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
n_splits = 10
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="micro"))
cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
for model_name, parameters in model_parameters.items():
# Models is a dict with 5 classifiers
model = models[model_name]
grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, scoring=scoring_method, verbose=False).fit(X_train, y_train)
cvScore = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=cv, scoring='f1').mean()
classDict[model_name] = cvScore
If your classes are unbalanced, when you do Kfold you should keep the proportion between the two targets.
Having folds unbalanced can lead to very poor results
check Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns
stratified folds. The folds are made by preserving the percentage of
samples for each class.
There are also a lot of techniques to handle unbalanced dataset. Based on the context:
upsampling the minority class (using for example the resample from sklearn)
under sampling the majority class (also this lib has some useful tools to do both under\up sampling)
handle the unbalance with your specific ML model
For example, in SVC, there is an argument when you create the model , class_weight='balanced'
clf_3 = SVC(kernel='linear',
class_weight='balanced', # penalize
probability=True)
which will penalize more the errors on minority class.
You can change your config as such:
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30),
'class_weight': 'balanced'
}
For LogisticRegression you can set the weights instead, reflecting the proportion of your classes
LogisticRegression(class_weight={0:1, 1:10}) # if problem is a binary one
changing the grid search dict in such way:
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"],
'class_weight':{0:1, 1:10}
}
Anyway the approach depends on the used model. For neural network for example, you can change the loss function to penalize the minority class with a weighted calculation (the same of the logistic regression)
Related
How to use RandomizedSearchCV or GridSearchCV for only 30% of data in order to speed up the process.
My X.shape is 94456,100 and I'm trying to use RandomizedSearchCV or GridSearchCV but it's taking a verly long time. I'm runnig my code for several hours but still with no results.
My code looks like this:
# Random Forest
param_grid = [
{'n_estimators': np.arange(2, 25), 'max_features': [2,5,10,25],
'max_depth': np.arange(10, 50), 'bootstrap': [True, False]}
]
clf = RandomForestClassifier()
grid_search_forest = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search_forest.fit(X, y)
rf_best_model = grid_search_forest.best_estimator_
# Decsision Tree
param_grid = {'max_depth': np.arange(1, 50), 'min_samples_split': [20, 30, 40]}
grid_search_dec_tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy')
grid_search_dec_tree.fit(X, y)
dt_best_model = grid_search_dec_tree.best_estimator_
# K Nearest Neighbor
knn = KNeighborsClassifier()
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid_search_knn = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid_search_knn.fit(X, y)
knn_best_model = grid_search_knn.best_estimator_
You can always sample a part of your data to fit your models. Although not designed for this purpose, train_test_split can be useful here (it can take care of shuffling, stratification etc, which in a manual sampling you would have to take care of by yourself):
from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.70)
By asking for test_size=0.70, your training data X_train will now be 30% of your initial set X.
You should now replace all the .fit(X, y) statements in your code with .fit(X_train, y_train).
On a more general level, all these np.arange() statements in your grid look like overkill - I would suggest selecting some representative values in a list instead of going through a grid search in that detail. Random Forests in particular are notoriously insensitive in the number of trees n_estimators, and adding one tree at a time is hardly useful - go for something like 'n_estimators': [50, 100]...
ShuffleSplit fits very well to this problem. You can define your cv as:
cv = ShuffleSplit(n_splits=1, test_size=.3)
This means setting aside and using 30% of your training data for validating each hyper-parameter setting. cv=5 on the other hand will carry out a 5-fold cross validation, which means going through 5 fit and predict for each hyper-parameter setting.
So, this also requires very minimal change to your code. Just replace those cv=5 or cv=10 inside GridSearchCV with cv = ShuffleSplit(n_splits=1, test_size=.3) and you're done.
I am currently working with a small dataset of 20x300. Since I have so few datapoints, I was wondering if I could use an approach similar to leave-one-out cross-validation but for testing.
Here's what I was thinking:
train/test split the data, with only one data point in the test set.
train the model on training data, potentially with
grid_search/cross-validation
use the best model from step 2 to
make a prediction on the one data point and save the prediction in
an array
repeat the previous steps until all the data points have
been in the test set
calculate your preferred metric of choice
(accuracy, f1-score, auc, etc) using these predictions
The pros of this approach would be to:
You don't have to "reserve/waste" data for testing so you can train
with more datapoints.
The cons would be:
This approach suffers from potential(?) data leakage.
You are calculating an accuracy metric from a bunch of predictions
that potentially came from different models, due to the gridsearches, so I'm not sure how accurate it is going to be.
I have tried the standard approaches of train/test splitting but since I need to take out at least 5 points for testing, then I don't have enough points for training and the ROC AUC becomes very bad.
Here's come code so you can see what I'm talking about.
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
param_grid = {'C': [1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5, 5e6, 1e6],
'gamma': [0.00001, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
'degree': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'kernel': ['rbf', 'linear', 'poly'],
'class_weight': ['balanced', None]}
model = SVC(probability = True)
skf = StratifiedKFold(n_splits=num_cv_folds)
grid_search = GridSearchCV(model, param_grid,
n_jobs=-1,
cv=skf,
scoring='roc_auc',
verbose=0,
refit=True,
iid=False)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict_proba(X_test)
y_preds.append(y_pred_test[0][0])
fpr, tpr, thresholds = roc_curve(y, y_preds_, pos_label=1)
auc_roc1 = auc(fpr, tpr)
I would really appreciate some feedback about whether this approach is actually feasibly or not and why.
I am trying to predict an incidence of diabetes in scikit in python. My data was extremely imbalanced so I resampled the data using SMOTE on 100%, balancing the samples. I am using random forest for classification and I wanted to import feature importance after I built the model. However, the resampled datasets are now in numpy.ndarray and I am not able to see the feature importance results. Please help!
diabetic (1)=704 and non-diabetic (0)=4777 on the full dataset, I resampled the dataset using SMOTE on 100% balancing the minority class to majority class of the training sample
I found the best parameters by grid-search and 5fold cross-validation
when I tried to see the feature importance I got the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
So I understand that as soon as I resampled the data it coerced into ndarray but I don't know what to do next.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_res, y_res = sm.fit_sample(X_train, y_train.ravel())
cl = RandomForestClassifier()
params = {
'n_estimators': [100, 300, 500, 800, 1000],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
}
clf = GridSearchCV(estimator=cl, param_grid=params, cv=5, verbose=0)
bestmodel= clf.fit(X_res, y_res)
cl = RandomForestClassifier(bootstrap=False, criterion='gini',
max_depth=8, max_features='sqrt', n_estimators= 300)
forestSMOTE= cl.fit(X_res, y_res)
feature_importances = pd.DataFrame(forestSMOTE.feature_importances_,
index = X_res.columns, columns=
['importance']).sort_values('importance',
ascending=False)
If someone has suggestions about what to do in order to be able to plot important features when I am using resampled sets, let me know PLEASE.
Basically, I want to perform binary classification using SVM (SVC) from sk-learn. Since I do not have separate training and testing data, I use cross-validation to evaluate the effectiveness of the feature set that I use.
Then, I use GridSearchCV to find the best estimator and set the cross-validation parameter to 10. Because I want to analyze the prediction result, I use the best estimator to perform cross-validation using the same dataset (of course I use 10-fold cross-validation).
However, when I print the scores of performance (precision, recall, f-measure, and accuracy), It produces different scores. Why do you think this happen?
I am wondering, in sk-learn should I specify the label for positive one? In my dataset, I have already labelled the positive case as 1.
Lastly, the following text is the snippet for my code.
tuned_parameters = [{'kernel': ['linear','rbf'], 'gamma': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10], 'C': [0.1, 1, 5, 10, 50, 100, 1000]}]
scoring = ['f1_macro', 'precision_macro', 'recall_macro', 'accuracy']
clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=10, scoring= scoring, refit='f1_macro')
clf.fit(feature, label)
param_C = clf.cv_results_['param_C']
param_gamma = clf.cv_results_['param_gamma']
P = clf.cv_results_['mean_test_precision_macro']
R = clf.cv_results_['mean_test_recall_macro']
F1 = clf.cv_results_['mean_test_f1_macro']
A = clf.cv_results_['mean_test_accuracy']
#print clf.best_estimator_
print clf.best_score_
scoring2 = ['f1', 'precision', 'recall', 'accuracy']
scores = cross_validate(clf.best_estimator_, feature, label, cv=n, scoring=scoring2, return_train_score=True)
print scores
scores_f1 = np.mean(scores['test_f1'])
scores_p = np.mean(scores['test_precision'])
scores_r = np.mean(scores['test_recall'])
scores_a = np.mean(scores['test_accuracy'])
print '\t'.join([str(scores_f1), str(scores_p), str(scores_r),str(scores_a)])
It may be due to that the cross-validation splits used in cross_validate and GridSearchCV are different, due to the randomness. The effect of this randomness becomes even larger as your dataset is very small (93) and the number of folds is so large (10). A possible fix is to feed into cv a fix train/test splits, and reduce the number of folds to reduce the variance, i.e.
kfolds=StratifiedKFold(n_splits=3).split(feature, label)
...
clf = GridSearchCV(..., cv=kfolds, ...)
...
scores = cross_validate(..., cv=kfolds, ...)
I would like to ensure that the order of operations for my machine learning is right:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV
# 1. Initialize model
model = RandomForestClassifier(5000)
# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator
# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
self.model.fit(data[train], target[train])
# 5. grid search for best parameters
param_grid = {
'n_estimators': [1000, 2500, 5000],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [3, 5, data.shape[1]]
}
gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_
# Now the model can be used for prediction
Please let me know if this order looks good or if something can be done to improve it.
--EDIT, clarifying to reduce downvotes.
Specifically,
1. Should the SelectFromModel be done after cross validation?
Should grid search be done before cross validation?
The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.
clf = RandomForestClassifier(5000)
Once you have your phases, you can build a pipeline to combine the two into a final model.
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
Now you can perform a GridSearch on your model. Keep in mind you have two stages, so the parameters must be specified by stage fs or clf. In terms of the feature selection stage, you can also access the base estimator using fs__estimator. Below is an example of how to search parameters on any of the three objects.
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
You can then make predictions with gs directly or using gs.best_estimator_.