I am using xgboost with python in order to perform a binary classification in which the class 0 appears roughly 9 times more frequently than the class 1. I am of course using scale_pos_weight=9. However, when I perform the prediction on the testing data after training the model using train_test_split, I obtain a y_pred with twice the elements belonging to the class 1 than it should (20% instead of 10%). How can I correct this output? I thought the scale_pos_weight=9 would be enough to inform the model the expected proportion.
Your question seems sketchy: what is y_pred?
+Remember you are better to run a grid search or Bayesian optimizer to figure out the best scores.
I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.
I have my model and a fixed dataset on which I do the train_test_split twice: once for getting train and test sets and the second time for getting a validation set too.
I have to reuse the same network, on the same data, twice in two different modules but every time I do that I get different results.
Is there a way to fix it?
I have the weights fixed and random_state = 42 so to eliminate every form of randomness but still it does not seem enough.
The optimizer I used is Adam and the loss function is the mean absolute error.
Do you train and evaluate (predict) the model in the same script and process?
Please check the official guide how to obtain reproducible results using keras during development.
In addition you can try to save and load your model (in another file) to check the predictions.
I'm running a keras model for binary classification on two separate computers; the first is running python 2.7.5 with Tensorflow 1.0.1 and keras 2.0.2 and cpu computations; the second is running python 2.7.5 with Tensorflow 1.2.1 and keras 2.0.6 and uses the gpu.
My model is modified from the siamese network model at https://gist.github.com/mmmikael/0a3d4fae965bdbec1f9d. I added regularization (activity_regularizer=keras.regularizers.l1 in the Dense layers), but for everything else I'm using the same structure as the mnist example.
I use the exact same code and training data on both computers, but the first one gives me a classification accuracy 86% and recall 88% on the test set, and the other gives me accuracy 52% and recall 100% (it classifies every test sample as "positive"). These results are reproducible with separate initializations.
I'm not even sure where to start looking for why the performance is so vastly different. I've been reading through the keras/tensorflow release notes to see if any of the changes pertain to something in my model, but I don't see anything that looks helpful. It doesn't make sense that a version change in tensorflow/keras would cause that much of a difference in the performance. Any sort of help figuring this out would be greatly appreciated.
I have a support vector machine trained on ~300,000 examples, and it takes roughly 1.5-2 hours to train this model, and I pickled(serialized) it. Currently, I want to add/remove a couple of the parameters of the model. Is there a way to do this without having to retrain the entire model? I am using sklearn in python.
If you are using SVC from sklearn then the answer is no. There is no way to do it, this implementation is purely batch training based. If you are training linear SVM using SGDClassifier from sklearn then the answer is yes as you can simply start the optimization from the previous solution (when removing feature - simply with removed corresponding weight, and when adding - with added any weight there).