Updating an LGBM model with new data - python

I want to use additional data to 'update' an already trained Light Gradient Boosting Model (LGBM). Is there a way to do that?
I am looking for an approach that uses the Skealrn API and thus can be used in a pipeline.

An LGBM model in python can be fitted both with the original model API and with the Sklearn API.
I couldn't find any examples of using the Sklearn API for continuous learning.
Regardless of that, you can fit a model either way and it is compatible with the .train() function from the original API.
It can be saved with save_model() or with joblib.dump().
This does not affect its compatibility with Python Pileline() - it is perfectly compatible.

Related

Connecting untrained python predictive model to backend

I already have a built predictive model written in Python, however currently it is executed by hand and functions on a single data file. I am hoping to generalize the model so that it can read in different datasets from my backend, each time effectively producing a different model since we are using different data for training as well. How would I be able to add the model onto my backend then?
Store the model as a pickle and read it from your backend when you need analog to your training data.
But you might want to checkout MLFlow for an integrated model handling solution. It is possible to run it on prem. With MLFlow you can easily implement a proper ML lifecycle. You can store your training stats and keep the history of your trained models.

Calibrating AutoML models in H2O

From what I can see in the docs, H2O supports calibration for GBM, DRF, XGBoost models only and has to be specified prior to the training phase.
I find it confusing. If calibration is a post-processing step and is model agnostic, shouldn't it be possible to calibrate any model trained using H2O, even after the training process is finished?
Currently, I'm dealing with a model that I've trained using AutoML. Even though it is a GBM model, I'm not able to easily calibrate it by providing a calibrate_model parameter as it is not supported by AutoML. I don't see any option to calibrate it after it's trained either.
Does anyone know an easy way to calibrate already-trained H2O models? Is it necessary to "manually" calibrate them using algorithms such as Platt scaling or is there a way to do it without using any extra libraries?
Thanks
I find it confusing. If calibration is a post-processing step
The reason why it is part of model training right now is to have it in MOJO (our deployment artifact).
and is model agnostic, shouldn't it be possible to calibrate any model trained using H2O, even after the training process is finished?
Calibrating a model ex-post makes a lot of sense, all the code is already in - it “just” needs to be exposed to users. We created a ticket for this here.

NotFittedError when saving a scikit-learn model with a custom transformer in the pipeline and using it to predict in a different file

I am in the same situation as was presented in this question: I want to save a model pipeline (a ColumnTransformer) in which there are some custom transformers I created, to use it to predict in another file placed in a different folder.
I managed to import the model in the new file through joblib as explained in the question linked above, i.e. creating a file with the custom transformers inside and importing them. However, when I use the predict() method of the uploaded model, I get the following:
NotFittedError: This ColumnTransformer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
This makes sense since my custom transformers are just taken from a "new file" and therefore are not fitted yet.
However I do not understand which would be the best practice to solve this issue. Is there a better way than refitting the model to the dataset?
I saw that in some answers someone was suggesting to use cloudpickle, could it be an option?

Do you need to store your old data to refit a model in sklearn?

I am trying to use sklearn to build an Isolation Forest machine learning program to go through a ton of data. I can only store the past 10 days of data, so I was wondering:
When I use the "fit" function on new data that comes in, does it refit the model considering the hyper-parameters from the old data without having had access to that old data anymore? Or is it completely recreating the model?
In general, only the estimators implementing the partial_fit method are able to do this. Unfortunately, IsolationForest is not one of them.

How to use a model that is already trained in SKLearn?

Rather than having my model retrain every time I run my code, I just want to test how to classifier responds to certain inputs. Is there a way in SKLearn I can "export" my classifier, save it somewhere and then use it to predict over and over in future?
Yes. You can serialize your model and save it to a file.
This is documented here.
Keep in mind, that there may be problems if you reload a model which was trained with some other version of scikit-learn. Usually you will see a warning then.

Categories