Pre-trained XGBoost model does not reproduce results if loaded from file

Pre-trained XGBoost model does not reproduce results if loaded from file - python

I use XGBoostClassifier (XGBoost version 1.6.0) for a simple binary classification model in Google Colab. I save the model into file for further use. Within the same session, the model loaded from file reproduces results well on validation set. But, if the session is over and I connect to Colab from scratch, the same model from the same file shows way worse results on the same validation set, and needs to be trained again to be reproduced.
Tried three different ways to save and load model:
native
xgb_model.save_model('xgb_native_save.model')
joblib
joblib.dump(xgb_model, 'xgb_joblib.model')
pickle
with open('xgb_pickle.pkl','wb') as f:
pickle.dump(xgb_model,f)
Same result with all three methods: the results on validation set are not even close to those the model showed before saving to file.
Random_state is fixed.
Any thoughts on where might the problem be?

I find the same issue with tree based models for binary classification (XGBoost, LightGBM, ...)
When you restart the kernel and load the saved model, the order of the variables inside the booster changes.
This is how I solved it (xgboost):
lst_vars_in_model = model.get_booster().feature_names
model.predict_proba(df[lst_vars_in_model])

Related

Connecting untrained python predictive model to backend

I already have a built predictive model written in Python, however currently it is executed by hand and functions on a single data file. I am hoping to generalize the model so that it can read in different datasets from my backend, each time effectively producing a different model since we are using different data for training as well. How would I be able to add the model onto my backend then?

Store the model as a pickle and read it from your backend when you need analog to your training data.
But you might want to checkout MLFlow for an integrated model handling solution. It is possible to run it on prem. With MLFlow you can easily implement a proper ML lifecycle. You can store your training stats and keep the history of your trained models.

Why does "load_model" cause RAM memory problems while predicting?

I trained neural network (transformer architecture) and saved it by using:
model.save(directory + args.name, save_format="tf")
After that, I want to load the model again with another script to test it by letting it make iterative predictions:
from keras.models import load_model
model = load_model(args.model)
for i in range(very_big_number):
out, _ = model(something, training=False)
However, I have noticed that the RAM usage increases with each prediction and I don't know why. At some point the programme stops because there is no more memory available. You can also see the RAM consumption in the following screenshot:
If I use the same architecture, but only load the weights of the model with model.load_weigts( ... ), I do not have the problem.
My question now is, why does load_model seem to cause this and how do I solve the problem?
I'm using tensorflow 2.5.0.
Edit:
As I was not able to solve the problem and the answers did not help either, I simply used the load_weights method so that I created a new model and loaded the weights of the saved model like this:
model = myModel()
saved_model = load_model(args.model)
model.load_weights(saved_model + "/variables/variables")
In this way, the usage of RAM remained constant. Nevertheless an non-optimal solution, in my opinion.

There is a fundamental difference between load_model and load_weights. When you save an model using save_model you save the following things:
A Keras model consists of multiple components:
The architecture, or configuration, which specifies what layers the model contain, and how they're connected.
A set of weights values (the "state of the model").
An optimizer (defined by compiling the model).
A set of losses and metrics (defined by compiling the model or calling
add_loss() or add_metric()).
However when you save the weights using save_weights, you only saves the weights, and this is useful for the inference purpose, while when you want to resume the training process, you need a model object, that is the reason we save everything in the model. When you just want to predict and get the result save_weights is enough. To learn more, you can check the documentation of save/load models.
So, as you can see when you do load_model, it has many things to load as compared to load_weights, thus it will have more overhead hence your RAM usage.

Loading pre-trained resnet model to cleverhans model format

I am trying to load a pre-trained ResNet model from the MadryLab CIFAR-10 challenge into CleverHans to compute transfer attacks.
However restoring the saved models into the model_zoo.madry_lab_challenges.cifar10_model.ResNet object does not work. It appears to restore fine initially, but when I try to actually use the model, I get an error such as:
Attempting to use uninitialized value
ResNet/unit_3_1/residual_only_activation/BatchNorm/moving_mean
The easiest way to reproduce this error is to actually just run the provided attack_model.py example included in CleverHans here:
https://github.com/tensorflow/cleverhans/blob/master/examples/madry_lab_challenges/cifar10/attack_model.py
It encounters the same error after loading the model when it tries to use it, on both adv_trained and naturally_trained models.
Is there a workaround to this problem?
It seems the other option is to use the cleverhans.model.CallableModelWrapper instead, but I haven't been able to find an example of how to use that.

How can I save my trained SVM model to retrieve it later for time saving in python?

I'm new to python and working on machine learning. I have trained LinearSVC from sklearn.svm, and training takes quite a long time, mostly because of stemming (7-8 minutes), I want to know if it is possible to save model results as some extension that can be fed as it is back to python when running the application, just to save the time of the training happening in every run of the application..

My Answer:-
Pickle or Joblib is used to save a trained model
For your reference, check it out the link given below.
Reference Link

How to use a model that is already trained in SKLearn?

Rather than having my model retrain every time I run my code, I just want to test how to classifier responds to certain inputs. Is there a way in SKLearn I can "export" my classifier, save it somewhere and then use it to predict over and over in future?

Yes. You can serialize your model and save it to a file.
This is documented here.
Keep in mind, that there may be problems if you reload a model which was trained with some other version of scikit-learn. Usually you will see a warning then.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pre-trained XGBoost model does not reproduce results if loaded from file - python

Related

Connecting untrained python predictive model to backend

Why does "load_model" cause RAM memory problems while predicting?

Loading pre-trained resnet model to cleverhans model format

How can I save my trained SVM model to retrieve it later for time saving in python?

How to use a model that is already trained in SKLearn?

Categories

Resources