Using Keras Multi Layer Perceptron with Cross Validation prediction [duplicate] - python

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?

If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.

You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE

I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.

The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.

The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Related

Keras prediction incorrect with scaler and feature selection

I build an application that trains a Keras binary classifier model (0 or 1) every x time (hourly,daily) given the new data. The data preparation, training and testing works well, or at least as expected. It tests different features and scales it with MinMaxScaler (some values are negative).
On live data predictions with one single data point, the values are unrealistic (around 0.9987 to 1 most of the time, which is inaccurate). Since the result should be how close to "1" the prediction is, getting such high numbers constantly raises alerts.
Code for live prediction is as follows
current_df is a pandas dataframe that contains the 1 row with the data pulled live and the column headers, we select the "features" (since why load the features from the db and we implement dynamic feature selection when training the model, which could mean on every model there are different features)
Get the features as a list:
# Convert literal str to list
features = ast.literal_eval(features)
Then select only the features that I need in the dataframe:
# Select the features
selected_df = current_df[features]
Get the values as a list:
# Get the values of the df
current_list = selected_df.values.tolist()[0]
Then I reshape it:
# Reshape to allow scaling and predicting
current_list = np.reshape(current_list, (-1, 1))
If I call "transform" instead of "fit_transform" in the line above, I get the following error: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Reshape again:
# Reshape to be able to scale
current_list = np.reshape(current_list, (1, -1))
Loads the model using Keras (model_location is a Path) and predict:
# Loads the model from the local folder
reconstructed_model = keras.models.load_model(model_location)
prediction = reconstructed_model.predict(current_list)
prediction = prediction.flat[0]
Updated
The data gets scaled using fit_transform and transform (MinMaxScaler although it can be Standard Scaler):
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
And this is run when training the model (the "model" config is not shown):
# Compile the model
model.compile(optimizer=optimizer,
loss=loss,
metrics=['binary_accuracy'])
# build the model
model.fit(X_train, y_train, epochs=epochs, verbose=0)
# Evaluate using Keras built-in function
scores = model.evaluate(X_test, y_test, verbose=0)
testing_accuracy = scores[1]
# create model with sklearn KerasClassifier for evaluation
eval_model = KerasClassifier(model=model, epochs=epochs, batch_size=10, verbose=0)
# Evaluate model using RepeatedStratifiedKFold
accuracy = ML.evaluate_model_KFold(eval_model, X_test, y_test)
# Predict testing data
pred_test= model.predict(X_test, verbose=0)
pred_test = pred_test.flatten()
# extract the predicted class labels
y_predicted_test = np.where(pred_test > 0.5, 1, 0)
Regarding feature selection, the features are not always the same --I use both SelectKBest (10 or 15 features) or RFECV. And select the trained model with highest accuracy, meaning the features can be different.
Is there anything I'm doing wrong here? I'm thinking maybe the scaling should be done before the feature selection or there's some issue with the scaling (since maybe some values might be 0 when training and 100 when using it and the features are not necessarily the same when scaling).
The issues seems to stem from a StandardScaler / MinMaxScaler. The following example shows how to apply the former. However, if there are separate scripts handling learning/prediction, then the scaler will also need to be serialized and loaded at prediction time.
Set up a classification problem:
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Fit a StandardScaler instance on the training set and use the same parameters to .transform the test set:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Train time: Serialize the scaler to a pickle file.
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
# Test time: Load the scaler and apply to the test set.
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
X_test = new_scaler.transform(X_test)
Which means that the model should be fit on features with similar distributions:
model = keras.Sequential([
keras.Input(shape=X_train.shape[1]),
layers.Dense(100),
layers.Dropout(0.1),
layers.Dense(1, activation="relu")])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
model.fit(X_train, y_train, epochs=25)
y_pred = np.where(model.predict(X_test)[:, 0] > 0.5, 1, 0)
print(accuracy_score(y_test, y_pred))
# 0.8708
Alexander's answer is correct, I think there is just some confusion between testing and live prediction. What he said regarding the testing step is equally applicable to live prediction step. After you've called scaler.fit_transform on your training set, add the following code to save the scaler:
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
Then, during live prediction step, you don't call fit_transform. Instead, you load the scaler saved during training and call transform:
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
# Load features, reshape them, etc
# Scaling step
current_list = new_scaler.transform(current_list)
# Features are scaled properly now, put the rest of your prediction code here
You always call fit_transform only once per model, during the training step on your training pool. After that (during testing or calculating predictions after model deployment) you never call it, only call transform. Treat scaler as part of the model. Naturally, you fit the model on the training set and then during testing and live prediction you use the same model, never refitting it. The same should be true for the scaler.
If you call scaler.fit_transform on live prediction features it creates a new scaler that has no prior knowledge of feature distribution on training set.

Performing voting for classification tasks

I am wondering is it possible to do voting for classification tasks. I have seen plenty of blogs explaining how to use voting for regression purposes.As given below.
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
That can be done.
# initializing all the model objects with default parameters
model_1= svm.SVC(kernel='rbf')
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(estimators=[('svc', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
# applying 10 fold cross validation
scores = cross_val_score(final_model, X_all, y, cv=10, scoring='accuracy')
print(scores)
print('Model accuracy score : {0:0.4f}'.format(scores.mean()))
You can add more machine learning models than three if necessary
Here note that I have applied cross validation and got the accuracy
Of course you can use the same for classes, only your voting will use a different function. This is, how Random Forests arrive at their prediction (the single decision trees in the forest "vote" for a common prediction). You can for example employ a majority vote over all classifiers. Or you can use the single predictions to formulate a probability for your prediction. For example, each class could get the fraction of votes it got assigned as the output.

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))
The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.
What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.
My question is: do I need to split the data set at the beginning into train and test, or not?
It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.
But some guys just do a cross-val. Here is the workflow I often use:
# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)
# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))
# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
model = model_with_param
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
print('Mean CV-score with param: ' + str(cv_score.mean()))
# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)
# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)
Short answer: NO
Long answer.
If you want to use K-fold validation when you do not usually split initially into train/test.
There are a lot of ways to evaluate a model. The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test.
If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.
It's up to you what to choose but I would go with K-Folds or LOOCV.
K-Folds procedure is summarised in the figure (for K=5):

Using cross_val_predict against test data set

I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()

Categories