I have made a classification model, which has been saved using
bst.save_model('final_model.model')
in another file i load the model and do testing on my testdata using:
bst = xgb.Booster() # init model
bst.load_model('final_model.model') # load data
ypred = bst.predict(dtest) # make prediction
Since I use kfold in my training process but need to use the whole test file for testing (so no kfold splitting) it is not possible for me to verify if I still get the exact same results as I should when loading the model in a new file. This made me curious as if there was a way to print my loaded models hyperparameters. After a lot of googling I found a way to do this in R with xgb.parameters(bst) or maybe also xgb.attr(bst) - but I have found no way to do this in Python. Since I do not use R I have not tested the above lines, but from documentation it seems to do what i need: output the hyperparameters in a loaded model. So can this be done in Python with xgboost?
EDIT: I can see that if i instead write ypred = bst.predict(dtest, ntree_limit=bst.best_iteration) i get the error 'Booster' object has no attribute 'best_iteration'. So it seems that the loaded model is not remembering all my hyperparameters. If i write bst.attributes() i can get it to output the number of the best iteration and it's eval score - but i don't see how to output the actual hyperparameters used.
if you had used a xgboost.sklearn.XGBModel model You can then use the function get_xgb_params(), but there is no equivalent in the base xgboost.Booster class. Remember that a Booster is the BASE model of xgboost, that contains low level routines for training, prediction and evaluation. You can find more information here
Related
I need to train my own model with word2vec and fasttext. By readind different sourcs I found different information.
So I did the model and trained it like this:
model = FastText(all_words, size=300, min_count= 3,sg=1)
model = Word2Vec(all_words, min_count=3, sg = 1, size = 300 )
So I read that that should be enough to creat and train the model. But then I saw, that some people do it seperatly:
model = FastText(size=4, window=3, min_count=1) # instantiate
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10) # train
Now I am confused and dont know if what I did is correct. Can sombody help me to make it clear?
Thank you
It's perfectly acceptable to supply your training corpus – all_words – when you instantiate the model object. In that case, the model will automatically perform all steps needed to train the model, using that data. So you can do this:
model = Word2Vec(all_words, ...) # where '...' is your non-default params
It's also acceptable to not provide the corpus when instantiating the model - but then the model is extremely minimal, with just your initial parameters. It still needs to discover the relevant vocabulary (which requires a single pass over the training data), then allocate some vary-large internal structures to accommodate those words, then do the actual training (which requires multiple additional passes over the training data).
So if you don't provide the corpus when the model is instantiated, you should do two extra method calls:
model = Word2Vec(...) # where '...' is your non-default params
model.build_vocab(all_words) # discover vocabulary & allocate model
# now train, with #-of-passes & #-of-texts set by earlier steps
model.train(all_words, epochs=model.iter, total_examples=model.corpus_count)
These two code blocks I've shown are equivalent. The top does the usual steps for you; the bottom breaks the steps out into your explicit control.
(The code you'd excerpted in your question, showing only a .train() call, would error for a number of reasons. The .build_vocab() is a necessary step to have a fully-allocated model, and the call to .train() must explicitly state the desired epochs and an accurate count total_examples of the number-of-items in the corpus. But, you can and typically should re-use values that were already cached into the model by the two previous steps.)
It's your choice which approach to use. Generally people only use the 3-separate-steps process if they want to do other output/logging between the steps, or something advanced between the steps that might tamper with the model state.
I have my model and a fixed dataset on which I do the train_test_split twice: once for getting train and test sets and the second time for getting a validation set too.
I have to reuse the same network, on the same data, twice in two different modules but every time I do that I get different results.
Is there a way to fix it?
I have the weights fixed and random_state = 42 so to eliminate every form of randomness but still it does not seem enough.
The optimizer I used is Adam and the loss function is the mean absolute error.
Do you train and evaluate (predict) the model in the same script and process?
Please check the official guide how to obtain reproducible results using keras during development.
In addition you can try to save and load your model (in another file) to check the predictions.
If I use GridSearchCV in scikit-learn library to find the best model, what will be the final model it returns? That said, for each set of hyper-parameters, we train the number of CV (say 3) models. In this way, will the function return the best model in those 3 models for the best setting of parameters?
The GridSearchCV will return an object with quite a lot information. It does return the model that performs the best on the left-out data:
best_estimator_ : estimator or dict
Estimator that was chosen by the search, i.e. estimator which gave
highest score (or smallest loss if specified) on the left out data.
Not available if refit=False.
Note that this is not the model that's trained on the entire data. That means, once you are confident that this is the model you want, you will need to retrain the model on the entire data by yourself.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
This is given in sklearn:
“The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.”
So, you don’t need to fit the model again. You can directly get the best model from best_estimator_ attribute
Rather than having my model retrain every time I run my code, I just want to test how to classifier responds to certain inputs. Is there a way in SKLearn I can "export" my classifier, save it somewhere and then use it to predict over and over in future?
Yes. You can serialize your model and save it to a file.
This is documented here.
Keep in mind, that there may be problems if you reload a model which was trained with some other version of scikit-learn. Usually you will see a warning then.
I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.
If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?
I'm open to suggestions including other classifiers that can accept new training data over time.
There's 2 options that I know of:
1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.
2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.
I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.
There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:
from textblob.classifiers import NaiveBayesClassifier
train = [
('training test totally tubular', 't'),
]
cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])
print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))
This prints out:
t t
s s
As Jacob said, the second method is the right way
And hopefully someone write a code
Look
https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/