I am currently working with a small dataset of 20x300. Since I have so few datapoints, I was wondering if I could use an approach similar to leave-one-out cross-validation but for testing.
Here's what I was thinking:
train/test split the data, with only one data point in the test set.
train the model on training data, potentially with
grid_search/cross-validation
use the best model from step 2 to
make a prediction on the one data point and save the prediction in
an array
repeat the previous steps until all the data points have
been in the test set
calculate your preferred metric of choice
(accuracy, f1-score, auc, etc) using these predictions
The pros of this approach would be to:
You don't have to "reserve/waste" data for testing so you can train
with more datapoints.
The cons would be:
This approach suffers from potential(?) data leakage.
You are calculating an accuracy metric from a bunch of predictions
that potentially came from different models, due to the gridsearches, so I'm not sure how accurate it is going to be.
I have tried the standard approaches of train/test splitting but since I need to take out at least 5 points for testing, then I don't have enough points for training and the ROC AUC becomes very bad.
Here's come code so you can see what I'm talking about.
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
param_grid = {'C': [1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5, 5e6, 1e6],
'gamma': [0.00001, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
'degree': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'kernel': ['rbf', 'linear', 'poly'],
'class_weight': ['balanced', None]}
model = SVC(probability = True)
skf = StratifiedKFold(n_splits=num_cv_folds)
grid_search = GridSearchCV(model, param_grid,
n_jobs=-1,
cv=skf,
scoring='roc_auc',
verbose=0,
refit=True,
iid=False)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict_proba(X_test)
y_preds.append(y_pred_test[0][0])
fpr, tpr, thresholds = roc_curve(y, y_preds_, pos_label=1)
auc_roc1 = auc(fpr, tpr)
I would really appreciate some feedback about whether this approach is actually feasibly or not and why.
Related
I would like to use Gridsearch in the code to fine tune my SVM model, I have copied this code from other githubs and it has been working perfectly fine for my cross-fold.
X = Corpus.drop(['text','ManipulativeTag','compound'],axis=1).values # !!! this drops compund because of Naive Bayes
y = Corpus['ManipulativeTag'].values
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Create splits
splits = kf.split(X)
# Access the training and validation indices of splits
kfold_accuracy = {}
kfold_precision = {}
kfold_f = {}
kfold_recall = {}
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
# print("training:", train_index, "validations:", val_index)
X_val,y_val= X[val_index], y[val_index]
SVM = svm.SVC(C=1.0, kernel='linear', random_state=1111, probability=True) ### the base estimator
SVM.fit(X_train, y_train)
# predict the labels on validation dataset
predictions = SVM.predict(X_val)
# Use accuracy_score function to get the accuracy
kfold_accuracy[i] = accuracy_score(y_val, predictions)
kfold_precision[i] = precision_score(y_val, predictions)
kfold_f[i] = f1_score(y_val,predictions)
kfold_recall[i] = recall_score(y_val,predictions)
However when trying to implement Gridsearch most of the articles that I ran into uses train_test_split() rather than my kf.split(), I am having trouble finding the right place to shove the GridSearchCV() line:
GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)
I found my solution here: Grid search and cross validation SVM
I have copied this from the post:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
{'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000] },{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}]
And I have kept everything from my code and only made changes in the loop by adding the Gridsearch() in my loop:
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
X_val,y_val= X[val_index], y[val_index]
# this is where I put GridSearch()
# here cv cannot be 1, so I put 2 instead
SVM = GridSearchCV(SVC(), tuned_parameters, cv=2, scoring='accuracy')
SVM.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(SVM.best_params_)
I have a data set with 100 columns of continuous features, and a continuous label, and I want to run SVR; extracting features of relevance, tuning hyper parameters, and then cross-validating my model that is fit to my data.
I wrote this code:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = SVR()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
# define the grid
grid = dict()
#How many features to try
grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]
# define the grid search
#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
search = GridSearchCV(
pipeline,
# estimator=SVR(kernel='rbf'),
param_grid={
'estimator__svr__C': [0.1, 1, 10, 100, 1000],
'estimator__svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'estimator__svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
scoring='neg_mean_squared_error',
verbose=1,
n_jobs=-1)
for param in search.get_params().keys():
print(param)
# perform the search
results = search.fit(X_train, y_train)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
print(">%.3f with: %r" % (mean, param))
I get the error:
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
steps=[('sel',
SelectKBest(k=10,
score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
When I print estimator.get_params().keys(), as suggested in the error message, I get:
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__sel
estimator__svr
estimator__sel__k
estimator__sel__score_func
estimator__svr__C
estimator__svr__cache_size
estimator__svr__coef0
estimator__svr__degree
estimator__svr__epsilon
estimator__svr__gamma
estimator__svr__kernel
estimator__svr__max_iter
estimator__svr__shrinking
estimator__svr__tol
estimator__svr__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Fitting 5 folds for each of 405 candidates, totalling 2025 fits
But when I change the line:
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
to:
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])
I get the error:
ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']
Could someone explain what I'm doing wrong, i.e. how do I combine the pipeline/feature selection step into the GridSearchCV?
As a side note, if I comment out pipeline in the GridSearchCV, and uncomment estimator=SVR(kernal='rbf'), the cell runs without issue, but in that case, I presume I am not incorporating the feature selection in, as it's not called anywhere. I have seen some previous SO questions, e.g. here, but they don't seem to answer this specific question.
Is there a cleaner way to write this?
The first error message is about the pipeline parameters, not the search parameters, and indicates that your param_grid is bad, not the pipeline step names. Running pipeline.get_params().keys() should show you the right parameter names. Your grid should be:
param_grid={
'svr__C': [0.1, 1, 10, 100, 1000],
'svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...
I am trying to fix the randomization in my code but every time I run, I get different best score and best parameters. The results are no too far apart, but how can I fix the result to get the same best score and parameters every time I run?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 27)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
clf = DecisionTreeClassifier(random_state=None)
parameter_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5,6,8,10,20,30,50],
'max_features': [10,20,30,40,50]
}
skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(X_train, y_train)
grid_search = GridSearchCV(clf, param_grid=parameter_grid, cv=skf, scoring='precision')
grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
y_pred_iris = clf.predict(X_test)
print(confusion_matrix(y_test,y_pred),"\n")
print(classification_report(y_test,y_pred),"\n")
In order to get reproducible results, every source of randomness in your code must be explicitly seeded (and even then, you must be careful that the implicit assumption of all other being equal actually holds - see Why does the importance parameter influence performance of Random Forest in R? for a case where it does not).
There are three parts in your code that inherently include a random element:
train_test_split
DecisionTreeClassifier
StratifiedKFold
You correctly seed the first one (using random_state=27), but you fail to do so for the other two, leaving random_state=None in both of them.
What you should do is simply replace the two cases of random_state=None in your code with an explicit seed, as you have done for train_test_split; it doesn't have to be any specific number, or even the same for all cases, it just needs to be explicitly set.
I am experiencing a problem where finetuning the hyperparameters using GridSearchCV doesn't really improve my classifiers. I figured the improvement should be bigger than that. The biggest improvement for a classifier I've gotten with my current code is around +-0.03. I have a dataset with eight columns and an unbalanced binary outcome. For scoring I use f1 and I use KFold with 10 splits. I was hoping if someone could spot something which is off and I should look at? Thank you!
I use the following code:
model_parameters = {
"GaussianNB": {
},
"DecisionTreeClassifier": {
'min_samples_leaf': range(5, 9),
'max_depth': [None, 0, 1, 2, 3, 4]
},
"KNeighborsClassifier": {
'n_neighbors': range(1, 10),
'weights': ["distance", "uniform"]
},
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30)
},
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"]
}
}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
n_splits = 10
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="micro"))
cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
for model_name, parameters in model_parameters.items():
# Models is a dict with 5 classifiers
model = models[model_name]
grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, scoring=scoring_method, verbose=False).fit(X_train, y_train)
cvScore = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=cv, scoring='f1').mean()
classDict[model_name] = cvScore
If your classes are unbalanced, when you do Kfold you should keep the proportion between the two targets.
Having folds unbalanced can lead to very poor results
check Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns
stratified folds. The folds are made by preserving the percentage of
samples for each class.
There are also a lot of techniques to handle unbalanced dataset. Based on the context:
upsampling the minority class (using for example the resample from sklearn)
under sampling the majority class (also this lib has some useful tools to do both under\up sampling)
handle the unbalance with your specific ML model
For example, in SVC, there is an argument when you create the model , class_weight='balanced'
clf_3 = SVC(kernel='linear',
class_weight='balanced', # penalize
probability=True)
which will penalize more the errors on minority class.
You can change your config as such:
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30),
'class_weight': 'balanced'
}
For LogisticRegression you can set the weights instead, reflecting the proportion of your classes
LogisticRegression(class_weight={0:1, 1:10}) # if problem is a binary one
changing the grid search dict in such way:
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"],
'class_weight':{0:1, 1:10}
}
Anyway the approach depends on the used model. For neural network for example, you can change the loss function to penalize the minority class with a weighted calculation (the same of the logistic regression)
I use xgboost to do a multi-class classification of spectrogram images(data link: automotive target classification). The class number is 5, training data includes 20000 samples(each class 5000 samples), test data includes 5000 samples(each class 1000 samples), the original image size is 144*400. This is my code snippet:
train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1, 'n_estimators': 100,
'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0,
'reg_alpha': 0, 'reg_lambda': 1,
'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))
During hyperparameter tunning, according to https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, I tuned n_estimators at first with GridSearchCV(n_estimators=[100,200,300,400,500]) using training data, then test with test data. Then I tried GridSearchCV with both 'n_estimators' and 'learning_rate' also.
The best hyperparameter is n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83, when I use this best estimator to classify, the training data I get 100% correct result, but the test data only gets precison of [0.864 0.777 0.895 0.856 0.882] and recall of [0.941 0.919 0.764 0.874 0.753]. I guess with n_estimators=500 is overfitting, but I don't know how to choose this n_estimator and learning_rate at this step.
For reducing dimensionality, I tried PCA but more than n_components>3500 is needed to achieve 95% variance, so I use downsampling instead as shown in code.
Sorry for the incomplete info, hope this time is clear. Many thanks!
Why not try Optuna for XGBoost hyperparameter tuning, with pruning and with early_stopping_rounds parameter of XGBoost ?
Here is a notebook of mine as a guide only. XGBoost version must be 1.6 though, as early_stopping_rounds is run differently (fit() method) in XGBoost versions below 1.6.
https://www.kaggle.com/code/josephramon/sba-optuna-xgboost