I'm using XGBoost, the default, to forecast a hourly time series(720 hours is the size of my test), I always run the model 10 times, and in the end I analyse the the 10 runs. With XGBoost I'm getting the exactly same predictions, all 720 hours, in all the 10 times, without any change. I'm using the default version of the XGBoost, already tried to change seed, putting a different seed number in each run, changed the random_state too, all that with no sucesss, the only thing that changes it is when I put subsample = 0.99, but my professor told me that he wants to use all the training dataset in subsample, but using subsample = 1 present the same erro of all 10 exactly same predictions. Already checked my code and thats no error in the way I'm predicting the time series.
I'm using Xgboost that way.
import xgboost as xg
for run in range(10):
model = xg.XGBRegressor().fit(X_train,y_train)
Thanks for the help.
Related
I am trying to decide which variables I will use for training my xgboost classifier.
I fix the hyperparameters: n_estimators, max_depth, learning_rate, min_child_weight, reg_alpha. random_state in XGBClassifier, and sklearn.model_selection.train_test_split is also set to a fixed int every time. However, my model comes out quite different each time I train. The area under the ROC can go between 0.87 to 0.91. This makes it a little hard to compare if removing a variable actually made the model better/worse or if the difference in area was just due to the model training differently.
Is there a way to make xgboost train the same every single time?
If not, I also am thinking about training with the same variables 10 times and averaging ROC for each time then compare. But this also has a problem as sklearn.metrics.roc_curve returns a different length array every time, which makes it hard to compute an average roc_curve. If this is the way I need to go, I will make another thread about how to make sklearn.metric.roc_curve return a fixed length result.
From the API, you can set XgbClassifier seed parameter to ensure that the classifier generate the same folds as follows:
xgboost.XGBClassifier(
max_depth=3,
learning_rate=0.1,
n_estimators=100,
seed=42
)
I got this error, “Allocation of 73138176 exceeds 10% of system memory”, when I run image classification codes via CNN. I used different solutions to solve my problem. However, it changed the model accuracy in each testing.
Model accuracy here was 0.6761.
model.fit(X, y, batch_size=32, epochs=9, validation_split=0.3)
Then, when I lowered batch_size to 2, the accuracy here increased to 0.8451. Also, it did not give any errors related to the allocation problem.
model.fit(X, y, batch_size=2, epochs=9, validation_split=0.3)
Then, I was also curious about a code which also solved the allocation problem. However, this time, Model accuracy here was 0.7183. The code is;
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
model.fit(X, y, batch_size=32, epochs=9, validation_split=0.3)
My question is, which code do you actually suggest that I should follow? Also, could you please brighten me why the accuracy changes each time?
Thank you for every help and suggestion.
If you want exactly repeatable training results, you need to eliminate all sources of randomness. For a typical model training, the main sources are in 1: your dataset; randomization of the test/train split, or randomization of the order in which batches are generated. And 2: the model initialization; if you want to train the same model every time, you need to start with the same initial parameters every time. How you ensure that you get 'the same random numbers' with every training run varies by framework; and it was unreasonably painful last time I tried years ago in TF; but it can be done and google should know how to do if you search for fixing the random seed in TF.
However, fixing the random seed may not be what you are interested in; for doing repeatable experiments, it's what you want. But as far as the production qualities of your model are concerned, thats a different matter. If you find that the eventual model you end up with, behaves rather different depending on the seed (and many problems will intrinsically have this property, where multiple 'equally valid' but rather different interpretations exist), training an ensemble of such models, with a different random seed each, is a useful thing to do; in this way you can gain an explicit awareness of the amount of 'room for interpretation' that you model and dataset leaves open.
I have my model and a fixed dataset on which I do the train_test_split twice: once for getting train and test sets and the second time for getting a validation set too.
I have to reuse the same network, on the same data, twice in two different modules but every time I do that I get different results.
Is there a way to fix it?
I have the weights fixed and random_state = 42 so to eliminate every form of randomness but still it does not seem enough.
The optimizer I used is Adam and the loss function is the mean absolute error.
Do you train and evaluate (predict) the model in the same script and process?
Please check the official guide how to obtain reproducible results using keras during development.
In addition you can try to save and load your model (in another file) to check the predictions.
I use the scikit-learn library for the machine learning (with text data). It looks like this:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=nltk.word_tokenize, stop_words=stop_words).fit(train)
matr_train = vectorizer.transform(train)
X_train = matr_train.toarray()
matr_test = vectorizer.transform(test)
X_test = matr_test.toarray()
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_predict = rfc.predict(X_test)
When I run it for the first time, the result for the test dataset is 0.17 for the recall and 1.00 for the precision. Ok.
But when I run it for the second time on this test dataset and this training dataset the result is different - 0.23 for the recall and 1.00 for the precision. And when I'll run it for the next times the result will be different. At the same time the precision and the recall for the training dataset are one and the same.
Why does it happen? Maybe this fact refers to something about my data?
Thanks.
A random forest fits a number of decision tree classifiers on various sub-samples of the dataset. Every time you call the classifier, sub-samples are randomly generated and thus different results. In order to control this thing you need to set a parameter called random_state.
rfc = RandomForestClassifier(random_state=137)
Note that random_state is the seed used by the random number generator. You can use any integer to set this parameter. Whenever you change the random_state value the results are likely to change. But as long as you use the same value for random_state you will get the same results.
The random_state parameter is used in various other classifiers as well. For example in Neural Networks we use random_state in order to fix initial weight vectors for every run of the classifier. This helps in tuning other hyper-parameters like learning rate, weight decay etc. If we don't set the random_state, we are not sure whether the performance change is due to the change in hyper-parameters or due to change in initial weight vectors. Once we tune the hyper-parameters we can change the random_state to further improve the performance of the model.
The clue is (at least partly) in the name.
A Random Forest uses randomised decision trees, and as such, each time you fit, the result will change.
https://www.quora.com/How-does-randomization-in-a-random-forest-work
I am using Random Forest Regressor python's scikit-learn module for predicting some values. I used joblib.dump for saving models. Therea 24 joblib.dump files, and each weights 45 megabyte (sum of all files = 931mb). My problem is:
I want to load all this 24 files in one program to predict 24 values - but i cannot do it. It gives an MemoryError. How can i load all 24 joblib files in one program without any errors?
Thanks in advance...
There are few options, depending on where exactly you are running out of memory.
Since you are predicting 24 different values, based on the same input data, you can do predictions sequentially. So you keep only one RFR in memory at a time.
e.g.:
predictions = []
for regressor_file in all_regressors:
regressor = joblib.load(regressor_file)
predictions.append(regressor.predict(X))
(might not be applied to your case, but this problem is very common).
You might be running out of memory when loading a large batch of input data. To solve this issue - you can split your input data and run prediction on sub-batch. That helped us when we moved from running predictions locally to EC2. Try to run your code on a smaller input dataset, to test whether this helps.
You may want to optimise parameters for RFR. You may find that you can get the same predictive power with shallower trees or smaller number of trees (or both). It is very easy to build a Random Forest that is just unnecessarily big. This is, of course, problem specific. I had to reduce number of trees and make trees smaller to make model run efficiently in production. In my case, AUC was the same before/after optimisations. This last step of model-tuning is sometimes omitted from tutorials.