Performing voting for classification tasks - python

I am wondering is it possible to do voting for classification tasks. I have seen plenty of blogs explaining how to use voting for regression purposes.As given below.
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))

That can be done.
# initializing all the model objects with default parameters
model_1= svm.SVC(kernel='rbf')
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(estimators=[('svc', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
# applying 10 fold cross validation
scores = cross_val_score(final_model, X_all, y, cv=10, scoring='accuracy')
print(scores)
print('Model accuracy score : {0:0.4f}'.format(scores.mean()))
You can add more machine learning models than three if necessary
Here note that I have applied cross validation and got the accuracy

Of course you can use the same for classes, only your voting will use a different function. This is, how Random Forests arrive at their prediction (the single decision trees in the forest "vote" for a common prediction). You can for example employ a majority vote over all classifiers. Or you can use the single predictions to formulate a probability for your prediction. For example, each class could get the fraction of votes it got assigned as the output.

Related

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))
The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

Using Keras Multi Layer Perceptron with Cross Validation prediction [duplicate]

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Problem of doing preprocessing for testing set in GridSearchCV

I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)
short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.

Automatic selection of best model in function that compared ML model performance

I am trying to have this function select the model with the higher validation accuracy as the final mode, retrain the selected final model on the training+validation set, and then calculate the prediction on the test set and the accuracy of the test set predictions. I have everything I think I need to compare the models but can not think of the most appropriate way to select the best model and continue as mentioned above all within the function.
def compare_models(X,y,model1,model2,test_size,val_size,random_state=0):
# Split data first into training and testing to get test set using 15% of data for test
X_train_full,X_test,y_train_full,y_test = train_test_split(X, y, random_state=0,test_size=0.15)
# Now split the training set again into training and validation, using 15% of training data for validation
X_train,X_val,y_train,y_val = train_test_split(X_train_full,y_train_full,random_state=0,test_size=0.15)
# Compare the performance of the two models using the validation set
model1.fit(X_train,y_train)
val_preds_model1 = model1.predict(X_val)
model2.fit(X_train,y_train)
val_preds_model2 = model2.predict(X_val)
# Calculate the validation accuracy of each model
acc_val_model1 = sum(val_preds_model1==y_val)/len(y_val)
acc_val_model2 = sum(val_preds_model2==y_val)/len(y_val)
# Train our selected model on the training plus validation sets
XXX.fit(X_train_full,y_train_full)
# Evaluate its performance on the test set
preds_test = XXX.predict(X_test)
acc_test = sum(preds_test==y_test)/len(y_test)
return acc_test
Calling fit again on a model refits the model, so something like this should do it:
XXX = model1 if if acc_val_model1 > acc_val_model2 else model2
But maybe call it best_model or something, not XXX :P

K-fold Cross Validation Queries

I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.

Categories