make xgboost train the same way every time - python

I am trying to decide which variables I will use for training my xgboost classifier.
I fix the hyperparameters: n_estimators, max_depth, learning_rate, min_child_weight, reg_alpha. random_state in XGBClassifier, and sklearn.model_selection.train_test_split is also set to a fixed int every time. However, my model comes out quite different each time I train. The area under the ROC can go between 0.87 to 0.91. This makes it a little hard to compare if removing a variable actually made the model better/worse or if the difference in area was just due to the model training differently.
Is there a way to make xgboost train the same every single time?
If not, I also am thinking about training with the same variables 10 times and averaging ROC for each time then compare. But this also has a problem as sklearn.metrics.roc_curve returns a different length array every time, which makes it hard to compute an average roc_curve. If this is the way I need to go, I will make another thread about how to make sklearn.metric.roc_curve return a fixed length result.

From the API, you can set XgbClassifier seed parameter to ensure that the classifier generate the same folds as follows:
xgboost.XGBClassifier(
max_depth=3,
learning_rate=0.1,
n_estimators=100,
seed=42
)

Related

Training a model by looping through the train_test_split and training without looping

I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.

How to check machine learning accuracy without cross validation

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you
Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.
As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here
It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.
In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)
if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!
There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.
It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)
I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.
Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

Different results on the same dataset in machine learning

I use the scikit-learn library for the machine learning (with text data). It looks like this:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=nltk.word_tokenize, stop_words=stop_words).fit(train)
matr_train = vectorizer.transform(train)
X_train = matr_train.toarray()
matr_test = vectorizer.transform(test)
X_test = matr_test.toarray()
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_predict = rfc.predict(X_test)
When I run it for the first time, the result for the test dataset is 0.17 for the recall and 1.00 for the precision. Ok.
But when I run it for the second time on this test dataset and this training dataset the result is different - 0.23 for the recall and 1.00 for the precision. And when I'll run it for the next times the result will be different. At the same time the precision and the recall for the training dataset are one and the same.
Why does it happen? Maybe this fact refers to something about my data?
Thanks.
A random forest fits a number of decision tree classifiers on various sub-samples of the dataset. Every time you call the classifier, sub-samples are randomly generated and thus different results. In order to control this thing you need to set a parameter called random_state.
rfc = RandomForestClassifier(random_state=137)
Note that random_state is the seed used by the random number generator. You can use any integer to set this parameter. Whenever you change the random_state value the results are likely to change. But as long as you use the same value for random_state you will get the same results.
The random_state parameter is used in various other classifiers as well. For example in Neural Networks we use random_state in order to fix initial weight vectors for every run of the classifier. This helps in tuning other hyper-parameters like learning rate, weight decay etc. If we don't set the random_state, we are not sure whether the performance change is due to the change in hyper-parameters or due to change in initial weight vectors. Once we tune the hyper-parameters we can change the random_state to further improve the performance of the model.
The clue is (at least partly) in the name.
A Random Forest uses randomised decision trees, and as such, each time you fit, the result will change.
https://www.quora.com/How-does-randomization-in-a-random-forest-work

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn

Categories