Are the two kinds of interface of xgboost work completely same? - python

I'm currently working on a In Class Competition in Kaggle.
I have read about the official python API reference, and I'm kind of confused about the two kinds of interfaces, especially in grid-search, cross-validation and early-stopping.
In XGBoost API, I can use xgb.cv(), which split the whose dataset into two parts to cross validate, to tune a good hyper parameters and then get the best_iteration.
Thus I can adjust the num_boost_round to the best_iteration. To maximizely utilize the data, I train the whole dataset again with the well-tuned hyper parameters, and then use it to classify. The only defect is I have to write the code of GridSearch myself.
ATTENTION: this cross validation set is changed at each fold, so the traning result will have no specific tendency to any part of the data.
But in sklearn, it seem that I can not get best_iteration using clf.fit() as I do in xgb model. Indeed, fit() method has early_stopping_rounds and eval_set to implement the early stopping part. Many people implement the code like that:
X_train, X_test, y_train, y_test = train_test_split(train, target_label, test_size=0.2, random_state=0)
clf = GridSearchCV(xgb_model, para_grid, scoring='roc_auc', cv=5, \
verbose=True, refit=True, return_train_score=False)
clf.fit(X_train, y_train, early_stopping_rounds=30, eval_set=[(X_test, y_test)])
....
clf.predict(something)
But problem is that I have split the data into two part at first. The cross validation set will not be changed at each fold. So maybe the result will have a tendency toward this random part of the whole dataset. The same problem also occurs in the grid search, the final parameter may tend to fit
X_test and y_test more.
I'm fond of the GridSearchCV in sklearn, but I also want to get the eval_set changed at each fold, just like xgb.cv do. I believe it can utilize the data while preventing overfitting.
How should I do?
I have thought of two ways:
using XGB API, and write GridSearch myself.
using sklean API, and change the eval_set manually at each fold.
Are there any more convenient methods?

AS you have summarised, both approaches have advantages and disadvantages.
xgb.cv will use the left-out fold for early stopping, thus you do not need an additional split into a validation/train sample to determine when to trigger early stopping.
GridSearchCV (or maybe you try out RandomizedSearchCV) will handle parameter grid and optimal choice for you.
Note, that it is not a problem to use a fixed sub-sample for early stopping in all CV folds. So i do not think that you have to do anything like "change the eval_set manually at each fold". The evaluation sample used in early stopping does not directly affect model parameters- it is used to decide when evaluation metric on a hold-out sample stops improving. For the final model you can drop early-stopping- you can see when the model stops with the optimal hyper-parameters using the aforementioned split and then use that number of tree as a fixed parameter in the final model fit.
So at the end it is a matter of taste as in both cases you will need to compromise on something. IMO, the sklearn API is the optimal choice as it allows to use the rest of sklearn tools (e.g. for data pre-processing) in a natural way in a pipeline in CV and it allows a homogeneous interface to model training for various approaches. But at the end it is up to you

Related

How to use GridSearchCV , cross_val_score and a model

I need to find best hyperparams for ANN and then run prediction on the best model. I use KerasRegressor. I find conflicting examples and advices. Please help me understand the right sequence and which params to use when.
I split my data into Train and Test datasets
I look for the best hyperparams using GridSearchCV on Train dataset
GridSearchCV.fit(X_Train, Y_Train)
I take GridSearchCV.best_estimator_ and use it in cross_val_score on Test dataset, i.e
cross_val_score(model.best_estimator_, X_Test, Y_Test , scoring='r2')
I'm not sure if I need to do this step? In theory, it should show similar r2 scores as GridSearchCV did for this best_estimator_ shouldn't it ?
I use model.best_estimator_.predict( X_Test, Y_Test) on Test data to predict the results. I.e I pass best_estimator_ from GridSearchCV to run actual prediction.
Is this correct ?
*Do I need to fit again model.best_estimator_ on Train data before doing a prediction? Or does it keep all the weights found during GridSearchCV ?
Do I need to save weights to be able to reuse it later ?
Usually when you use GridSearchCV on your training set, you will have an object which contains the best trained model with the best parameters.
gs = GridSearchCV.fit(X_train, y_train)
This is also evident from running gs.best_params_ which will print out the best parameters of the model after cross validation. Now, you can make predictions on your test set directly by running gs.predict(X_test, y_test) which will use the best selected model to predict on your test set.
For question 3, you don't need to use cross_val_score again, as this is an helper function that allows you to perform cross validation on your dataset and returns the score of each fold of the data split.
For question 4, I believe this answer is quite explanatory: https://stats.stackexchange.com/a/26535

How can i split my dataset into training and validation with no using and spliting test set?

I know this is wrong for training and validation sets spliting,but you can understand here what i really need. I want to use just training set and validation set. I don't need any test set
#Data Split
from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.976,random_state=0)
The test is the validation;
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_test and y_test is your validation test or test set. They are the same. It is a small slice of the total x, y samples to validate your model on data it hasn't been trained on.
By using random_state you get reproducible results. In other words, you get the same sets each times you run the script.
The terms validation set and test set are sometimes used to interchangeably, and sometimes to mean slightly different things. #Sy Ker's point is correct: the sklearn module you're using does provide you with a validation set, though the term used in the module is test. Effectively, what you're doing is getting data for training and data for evaluation, regardless of the term used. I'm adding this answer to answer that you might, in fact, need a form of test set.
Using test_train_split will give you a pair of sets that allow you to train a model (with a proportion specified in the percentage argument -- which, generally, should be something like 10-25% to ensure that it's a representative subsample). But I would suggest thinking of the process a little more broadly.
Splitting data for use in testing and model evaluation can be done simply (and, likely, incorrectly) by just using some y% of the rows from the bottom of a dataset. If normalization/standardization is being done, then make sure it train that on the test set and apply it to the set for evaluation so that the same treatment is applied to both.
sklearn and others have also made it possible to do cross-validation very simply, and in this case "validation" sets should be thought of a little differently. Cross-validation will take a portion of your data and subdivide it into smaller groups for repeated testing-and-training passes. In this case, you might start with a split of data like that from train_test_split, and keep the "test" set in this case as a total holdout -- meaning that the cross-validation procedure never uses (or "sees") the data during it's test/train process.
That test set that you got from the test_train_split process can then serve as a good set of data to use as a test for how the model performs against data it has never seen. You might see this referred to as a "holdout" set, or again as some version of "test" and/or "validation".
This link has a quick, but intuitive, description of cross-validation and holdout sets.

Training a model by looping through the train_test_split and training without looping

I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.

Do I give cross_val_score() the entire dataset or just the training-set?

As the documentation of the class is not very clear. I don't understand what value I give it.
cross_val_score(estimator, X, y=None)
This is my code:
clf = LinearSVC(random_state=seed, **params)
cvscore = cross_val_score(clf, features, labels)
I am not sure if this is correct or if I need to give X_train and y_train instead of features and labels.
Thanks
It is always a good idea to seperate the test set and training set, even while using cross_val_score. The reason behind this is knowledge leaking. It basically means that when you use both training and test sets, you are leaking information from test set into your model, thereby making your model biased, leading to incorrect predictions.
Here is detailed blog post on the same issue.
References:
Reddit post on cross-validation
Cross_val_Score example showing correct way of using it
A similar question on stats.stackexchange
I assume you were referring to the below documentation:
sklearn.model_selection.cross_val_score
The purpose of cross validation is ensuring that your model hasn't had particularly high variance leading to a good-fit in one instance, but a poor fit in another instance. This is used generally in model validation. With this in mind, you should be passing the training set (X_train, y_train) and seeing how your model performs.
Your question was focused on:
"Can I pass in the whole data-set into cross validation?"
The answer is, yes. This is conditional and based off whether or not you are satisfied with your ML output. Say for example, I have the following below:
I have used a Random forest model and am happy with my overall model fit and score.
In this case, I have a hold-out set.
Once I remove this hold-out set and give my model the whole data-set, we would get a plot with an even higher score as I am giving my model more information (and as such, your CV scores will also reflectively be higher).
An example of calling the method might be as such:
probablistic_scores = cross_val_score(model, X_train, y_train, cv=5)
Generally a 5 Fold Cross validation is preferred.
If you wish to go higher than 5 fold - please note that as your 'n' folds increase, the number of computational resources required will also increase and will take longer to process.

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Categories