I am trying to fix the randomization in my code but every time I run, I get different best score and best parameters. The results are no too far apart, but how can I fix the result to get the same best score and parameters every time I run?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 27)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
clf = DecisionTreeClassifier(random_state=None)
parameter_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5,6,8,10,20,30,50],
'max_features': [10,20,30,40,50]
}
skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(X_train, y_train)
grid_search = GridSearchCV(clf, param_grid=parameter_grid, cv=skf, scoring='precision')
grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
y_pred_iris = clf.predict(X_test)
print(confusion_matrix(y_test,y_pred),"\n")
print(classification_report(y_test,y_pred),"\n")
In order to get reproducible results, every source of randomness in your code must be explicitly seeded (and even then, you must be careful that the implicit assumption of all other being equal actually holds - see Why does the importance parameter influence performance of Random Forest in R? for a case where it does not).
There are three parts in your code that inherently include a random element:
train_test_split
DecisionTreeClassifier
StratifiedKFold
You correctly seed the first one (using random_state=27), but you fail to do so for the other two, leaving random_state=None in both of them.
What you should do is simply replace the two cases of random_state=None in your code with an explicit seed, as you have done for train_test_split; it doesn't have to be any specific number, or even the same for all cases, it just needs to be explicitly set.
Related
I have a custom scorer function whose inputs depend on the specific train and validation fold, additionally, the estimator's .predict_survival_function output is also needed. To give a more concrete example:
I am trying to run a GridSearch for a Random Survival Forest (scikit-survival package) with the Integrated Brier Score (IBS)
as the scoring method. The challenge is in the fact that the domain of the IBS is data- (and therefore, fold-) specific as it relies on the Kaplan-Meyer estimate at some point. Moreover, the .predict_survival_function method needs to be called every time during the scoring evaluation step and not only at the end of it.
It seems that I managed to to deal with the first issue by creating the following function:
def IB_time_interval(y_train, y_test):
y_times_tr = [i[2] for i in y_train]
y_times_te = [i[2] for i in y_test]
T1 = np.percentile(y_times_tr, 5, interpolation='higher')
T2 = np.percentile(y_times_tr, 95, interpolation='lower')
T3 = np.percentile(y_times_te, 5, interpolation='higher')
T4 = np.percentile(y_times_te, 95, interpolation='lower')
return np.linspace(np.maximum(T1,T3), np.minimum(T2, T4))
That is robust enough to work throughout all the folds. However, I am unable to retrieve the estimator's predictions during the grid search phase, as a non-fitted copy of it seems to passed instead every time the custom scorer function is called. The workaround that I I tried is to re-fit the estimator inside the scoring function, but not only this is conceptually wrong, it also raises errors.
The custom scorer function looks like the following:
def IB_scorer(y_true, y_pred, times=times_linspace, y=y, clf=rsf):
rsf.fit(X_train,y_train) #<--- = conceptually wrong
survs_test = rsf.predict_survival_function(X_test, return_array=False) #<---
T1, T2 = survs_test[0].x.min(), survs_test[0].x.max()
mask2 = np.logical_or(times >= T2, times < T1) # mask outer interval
times = times[~mask2]
#preds has shape (n_y-s, n_times)
preds_test = np.asarray([[fn(t) for t in times] for fn in survs_test])
return integrated_brier_score(y, y_true, preds_test, times)
and I create the scoring object immediately afterwards:
trial_IB_scorer = make_scorer(IB_scorer, greater_is_better=False)
Any suggestions? It would be great to be able to use GridSearch with more complex scoring functions, especially for the survival analysis case!
PS. I will paste the rest of the minimal working code here:
import numpy as np
from sksurv.ensemble import RandomSurvivalForest
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sksurv.metrics import integrated_brier_score
from sksurv.datasets import load_breast_cancer
X, y = load_breast_cancer()
X = X.drop(["er", "grade"], axis=1)
y_cens = np.array([i[0] for i in y]) #censoring status 1 or 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,
shuffle=True,
random_state=0,
stratify = y_cens)
param_grid = {
'max_depth': [4, 20, None],
'max_features': ["sqrt", None]
}
rsf = RandomSurvivalForest(n_jobs=1, random_state=0)
times_linspace = IB_time_interval(y_train, y_test)
clf = GridSearchCV(rsf, param_grid, refit=True, n_jobs=1,
scoring=trial_IB_scorer)
clf.fit(X_train, y_train)
print("final score clf", clf.score(X_train, y_train))
print(clf.best_params_)
When I do something like:
scoring = ["accuracy", "balanced_accuracy", "f1", "precision", "recall", "roc_auc"]
scores = cross_validate(SVC(), my_x, my_y, scoring = scoring , cv=5, verbose=3, return_train_score=True, return_estimator=True)
how can I get a confusion matrix of a single validation run, e.g. the first one or ideally the best one?
I don't need a plot or something beautiful, only the numbers. If I could see at least the split, then I could recalculate it.
If you want to use cross-validation to perform something quite specific during each iteration, maybe it is best to use a CV splitter like StratifiedKFold :
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
svm = SVC()
kf = StratifiedKFold(n_splits=5)
scores = []
results = []
for train_index, test_index in kf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
svm.fit(X_train, y_train)
y_pred = svm.predict(y_test)
scores.append(accuracy_score(y_test, y_pred)) # use other scoring as prefered
results.append(confusion_matrix(y_test, y_pred))
This will compute the confusion matrix for each of the five iterations and store them in results. If you want to get the confusion matrix of the best validation round, you can additionally compute the scoring metric in the loop as well (see the scores list) and retrieve the corresponding confusion matrix.
How to use RandomizedSearchCV or GridSearchCV for only 30% of data in order to speed up the process.
My X.shape is 94456,100 and I'm trying to use RandomizedSearchCV or GridSearchCV but it's taking a verly long time. I'm runnig my code for several hours but still with no results.
My code looks like this:
# Random Forest
param_grid = [
{'n_estimators': np.arange(2, 25), 'max_features': [2,5,10,25],
'max_depth': np.arange(10, 50), 'bootstrap': [True, False]}
]
clf = RandomForestClassifier()
grid_search_forest = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search_forest.fit(X, y)
rf_best_model = grid_search_forest.best_estimator_
# Decsision Tree
param_grid = {'max_depth': np.arange(1, 50), 'min_samples_split': [20, 30, 40]}
grid_search_dec_tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy')
grid_search_dec_tree.fit(X, y)
dt_best_model = grid_search_dec_tree.best_estimator_
# K Nearest Neighbor
knn = KNeighborsClassifier()
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid_search_knn = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid_search_knn.fit(X, y)
knn_best_model = grid_search_knn.best_estimator_
You can always sample a part of your data to fit your models. Although not designed for this purpose, train_test_split can be useful here (it can take care of shuffling, stratification etc, which in a manual sampling you would have to take care of by yourself):
from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.70)
By asking for test_size=0.70, your training data X_train will now be 30% of your initial set X.
You should now replace all the .fit(X, y) statements in your code with .fit(X_train, y_train).
On a more general level, all these np.arange() statements in your grid look like overkill - I would suggest selecting some representative values in a list instead of going through a grid search in that detail. Random Forests in particular are notoriously insensitive in the number of trees n_estimators, and adding one tree at a time is hardly useful - go for something like 'n_estimators': [50, 100]...
ShuffleSplit fits very well to this problem. You can define your cv as:
cv = ShuffleSplit(n_splits=1, test_size=.3)
This means setting aside and using 30% of your training data for validating each hyper-parameter setting. cv=5 on the other hand will carry out a 5-fold cross validation, which means going through 5 fit and predict for each hyper-parameter setting.
So, this also requires very minimal change to your code. Just replace those cv=5 or cv=10 inside GridSearchCV with cv = ShuffleSplit(n_splits=1, test_size=.3) and you're done.
I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within Cross Validation and if it makes a difference I also manually cross validated with StratifiedKFold. I would expect not same but somewhat similar results. As this is not the case I guess that I wrongly use one method but I don´t understand which one. Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import f1_score
rfc = RandomForestClassifier(n_estimators = 200,
criterion = "gini",
max_depth = None,
min_samples_leaf = 1,
max_features = "auto",
random_state = 42,
class_weight = "balanced")
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify=y)
I also tried the Classifier without the class_weight argument. From here I proceed to compare both methods with the f1-score
cv = cross_val_score(estimator=rfc,
X=X_train_val,
y=y_train_val,
cv=10,
scoring="f1")
print(cv)
The 10 f1-scores from cross validation are all around 65%.
Now the StratifiedKFold:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X_train_val, y_train_val):
X_train, X_val = X_train_val[train_index], X_train_val[test_index]
y_train, y_val = y_train_val[train_index], y_train_val[test_index]
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_val)
print("F1-Score: ", round(f1_score(y_val, rfc_predictions),3))
The 10 f1-scores from StratifiedKFold gets me values around 90%. This is where I get confused as I don´t understand the large deviations between both methods. If I just fit the Classifier to the train data and apply it to the test data I get f1-scores of around 90% as well which lets me believe that my way of applying cross_val_score is not correct.
One possible reason for the difference is that cross_val_score uses StratifiedKFold with the default shuffle=False parameter, whereas in your manual cross-validation using StratifiedKFold you have passed shuffle=True. Therefore it could just be an artifact of the way your data is ordered that cross-validating without shuffling produces worse F1 scores.
Try passing shuffle=False when creating the skf instance to see if the scores match the cross_val_score, and then if you want to use shuffling when using cross_val_score just manually shuffle the training data before applying cross_val_score.
Each time I run this code, I get a different value for the print statement. I'm confused why it's doing that because I specifically included the random_state parameter for the train/test split. (On a side note, I hope I'm supposed to encode the data; it was giving "ValueError: could not convert string to float" otherwise).
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',
names=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety', 'acceptability'])
# turns variables into numbers (algorithms won't let you do it otherwise)
df = df.apply(LabelEncoder().fit_transform)
print(df)
X = df.reindex(columns=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety'])
y = df['acceptability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
# decision trees classification
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, y_train)
y_true = y_test
y_pred = clf.predict(X_test)
print(math.sqrt(mean_squared_error(y_true, y_pred)))
DecisionTreeClassifier also takes a random_state param: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
All you did was ensure that the train/test splits are repeatable but the classifier also needs to ensure it's own seed is the same on each run
Update
Thanks to #Chester VonWinchester for pointing out: https://github.com/scikit-learn/scikit-learn/issues/8443 due to sklearn's implementation choice it can be non-deterministic with max_features= None even though it should mean that all features are considered.
There is further information and discussion in the link above.