Nest cross validation for predictions using groups - python

I'm not able to do something and I would like to know if it's a bug or normal way.
I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case.
As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.
( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )
I proceed that way:
pipe = Pipeline([('pca', PCA()),
('clf', SVC()),
])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)
# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)
And get some:
File : _split.py", line 489, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.
In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.
Can I have some clues on it?
Have a nice day,
Best regards

I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:
outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
cv=inner_cv, verbose=2, random_state=42,
n_jobs=-1, scoring=my_squared_score)
# Fit the random search model
rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
print(rf_random.best_params_)
nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))
print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))
I hope this helps.
Best regards,
Felix

Related

i get one set of feature importance after loocv, is it a result of all loocv model sets or of one mean model? (random forest)

I'm a person who doesnt get used to loocv yet.
I've been curious of the title problem.
I did leave one out cross validation for my random forest model.
My codes are like this.
for train_index, test_index in loo.split(x):
x_train, x_test = x.iloc[train_index], x.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = RandomForestRegressor()
model.fit(x_train, y_train.values.ravel()) #y_train.values.ravel()
y_pred = model.predict(x_test)
#y_pred = [np.round(x) for x in y_pred]
y_tests += y_test.values.tolist()[0]
y_preds += list(y_pred)
rr = metrics.r2_score(y_tests, y_preds)
ms_error = metrics.mean_squared_error(y_tests, y_preds)**0.5
After that, I wanted to get a feature importance of my model like this.
features = x.columns
sorted_idx = model.feature_importances_.argsort()
It's pretty different to what i've expected.
In loocv process, my computer made many different models with using different test and datasets from my original data, which has a length of literally same to original data.
So I'm thinking that feature importances should be multiple as the length of original data, because the test set of each loocv epoch is just one.. (I don't know which word is best for explaining this in English, english is not my mother tongue)
It was not multiple results, just one though.
It was only one feature importance, like calculated for only one sets (as if loocv hadnt added in my codes)
Then why should i had gotten only one importance? I want to understand the reason of it.
Thank you for reading my question.
want to know the reason why I got only one feature importance even though loocv was added in my codes

Using Keras Multi Layer Perceptron with Cross Validation prediction [duplicate]

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.
What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.
My question is: do I need to split the data set at the beginning into train and test, or not?
It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.
But some guys just do a cross-val. Here is the workflow I often use:
# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)
# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))
# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
model = model_with_param
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
print('Mean CV-score with param: ' + str(cv_score.mean()))
# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)
# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)
Short answer: NO
Long answer.
If you want to use K-fold validation when you do not usually split initially into train/test.
There are a lot of ways to evaluate a model. The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test.
If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.
It's up to you what to choose but I would go with K-Folds or LOOCV.
K-Folds procedure is summarised in the figure (for K=5):

Using Fitted Model in make_scorer For gridsearchcv

I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?

Why should we perform a Kfold cross validation on test set??

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

Categories