In the IO section of the kedro API docs I could not find functionality w.r.t. storing trained models (e.g. .pkl, .joblib, ONNX, PMML)? Have I missed something?
There is the pickle dataset in kedro.io, that you can use to save trained models and/or anything you want to pickle and is serialisable (models being a common object). It accepts a backend that defaults to pickle but can be set to joblib if you want to use joblib instead.
I'm just going to quickly note that Kedro is moving to kedro.extras.datasets for its datasets and moving away from having non-core datasets in kedro.io. You might want to look at kedro.extras.datasets and in Kedro 0.16 onwards pickle.PickleDataSet with joblib support.
The Kedro spaceflights tutorial in the documentation actually saves the trained linear regression model using the pickle dataset if you want to see an example of it. The relevant section is here.
There is PickleDataSet in https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pickle.PickleDataSet.html and joblib support in PickleDataSet is in the next release (see https://github.com/quantumblacklabs/kedro/blob/develop/RELEASE.md)
Related
First of all, let me introduce myself. I am a young researcher and I am interested in machine learning. I created a model, trained, tested and validated. Now I would like to know if there is a way to save my trained model.
I am also interested in knowing if the model is saved trained.
Finally, is there a way to use the saved (and trained) model with new data without having to train the model again?
I work with python!
Welcome to the community.
Yes, you may save the trained model and reuse it later. There are several ways to do so and I will introduce you to a couple of them here. However, please note which library you used to build your model and use a method for that library.
Pickel: Pickle is the standard way of serializing objects in Python.
import pickle
pickle.dump(model, open(filename, 'wb'))
loaded_model = pickle.load(open(filename, 'rb'))
Joblib: Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.
import joblib
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Finally, as suggested by others, if you used libraries such as Tensorflow to build and train your models, please note that they have extensive ways to work with the built model and save/load it. Please check the following information:
Tensorflow Save and Load model
There may be a better way to do this but this is how I have done it before, with python.
So you have a ML model that you have trained. That model is basically just a set of parameters. Depending on what module you are using, you can save those parameters in a file, and import them to regenerate your model later.
Perhaps a simpler way is to save the model object entirely in a file using pickling.
https://docs.python.org/3/library/pickle.html
You can dump the object into a file, and load it back when you want to run it again.
I am curious to find if there is an accepted solution to saving sklearn objects to json, instead of pickling them.
I'm interested in this because saving to json will take up much less storage and make saving the objects to db's like redis much more straightforward.
In particular, for something like a ColumnTransformer, all I need is the mean and std for a specific feature. With that, I can easily rebuild the transformer, but when reconstructing a transformer object from the saved json object, I have to manually set learned and private attributes, which feels hacky.
The closest thing I've found is this article: https://stackabuse.com/scikit-learn-save-and-restore-models/
Is this how others are going about this?
What is stopping sklearn from building this functionality into the library?
Thanks!
Think this package is what you are looking for https://pypi.org/project/sklearn-json/
Export scikit-learn model files to JSON for sharing or deploying predictive models with peace of mind.
This code snippet is from the link above and shows how to export sklearn models to json:
import sklearn_json as skljson
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0).fit(X, y)
skljson.to_json(model, file_name)
deserialized_model = skljson.from_json(file_name)
deserialized_model.predict(X)
Furthermore to answer the json vs. pickle question, this might be helpful
Pickle or json?
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)
joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?
When storing the classifier trained with sklearn I have a choice between pickle (or cPickle) and joblib.dump().
Is there any benefits apart from performance to using joblib.dump()? Can a classifier saved by pickle produce worse results than the one saved with joblib?
They actually use the same protocol (i.e. joblib uses pickle). Check out the documentation for joblib.dump - you can specify the level of pickle compression using arguments to joblib.
joblib works especially well with NumPy arrays which are used by sklearn so depending on the classifier type you use you might have performance and size benefits using joblib.
Otherwise pickle does work correctly so saving a trained classifier and loading it again will produce the same results no matter which of the serialization libraries you use. See also the docs of sklearn on this topic.
Please note that joblib is included in sklearn.