I am curious to find if there is an accepted solution to saving sklearn objects to json, instead of pickling them.
I'm interested in this because saving to json will take up much less storage and make saving the objects to db's like redis much more straightforward.
In particular, for something like a ColumnTransformer, all I need is the mean and std for a specific feature. With that, I can easily rebuild the transformer, but when reconstructing a transformer object from the saved json object, I have to manually set learned and private attributes, which feels hacky.
The closest thing I've found is this article: https://stackabuse.com/scikit-learn-save-and-restore-models/
Is this how others are going about this?
What is stopping sklearn from building this functionality into the library?
Thanks!
Think this package is what you are looking for https://pypi.org/project/sklearn-json/
Export scikit-learn model files to JSON for sharing or deploying predictive models with peace of mind.
This code snippet is from the link above and shows how to export sklearn models to json:
import sklearn_json as skljson
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0).fit(X, y)
skljson.to_json(model, file_name)
deserialized_model = skljson.from_json(file_name)
deserialized_model.predict(X)
Furthermore to answer the json vs. pickle question, this might be helpful
Pickle or json?
Related
In the IO section of the kedro API docs I could not find functionality w.r.t. storing trained models (e.g. .pkl, .joblib, ONNX, PMML)? Have I missed something?
There is the pickle dataset in kedro.io, that you can use to save trained models and/or anything you want to pickle and is serialisable (models being a common object). It accepts a backend that defaults to pickle but can be set to joblib if you want to use joblib instead.
I'm just going to quickly note that Kedro is moving to kedro.extras.datasets for its datasets and moving away from having non-core datasets in kedro.io. You might want to look at kedro.extras.datasets and in Kedro 0.16 onwards pickle.PickleDataSet with joblib support.
The Kedro spaceflights tutorial in the documentation actually saves the trained linear regression model using the pickle dataset if you want to see an example of it. The relevant section is here.
There is PickleDataSet in https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pickle.PickleDataSet.html and joblib support in PickleDataSet is in the next release (see https://github.com/quantumblacklabs/kedro/blob/develop/RELEASE.md)
I pickled a model and want to expose only the prediction api written in Flask. However when I write a dockerfile to make a image without sklearn in it, I get an error ModuleNotFoundError: No module named 'sklearn.xxxx' where xxx refers to sklearn's ML algorithm classes, at the point where I am loading the model using pickle like classifier = pickle.load(f).
When I rewrite the dockerfile to make an image that has sklearn too, then I don't get the error even though in the API I never import sklearn.
My concept of pickling is very simple, that it will serialize the classifier class with all of its data. So when we unpickle it, since the classifier class already has a predict attribute, we can just call it. Why do I need to have sklearn in the environment?
You have a misconception of how pickle works.
It does not seralize anything, except of instance state (__dict__ by default, or custom implementation). When unpickling, it just tries to create instance of corresponding class (here goes your import error) and set pickled state.
There's a reason for this: you don't know beforehand what methods will be used after load, so you can not pickle implementation. In addition to this, in pickle time you can not build some AST to see what methods/modules will be needed after deserializing, and main reason for this is dynamic nature of python — your implementation can actually vary depending on input.
After all, even assuming that theoretically we'd have smart self-contained pickle serialization, it will be actual model + sklearn in single file, with no proper way to manage it.
The pickle is just the representation of the data inside the model. You still need the code to use it, that's why you need to have sklearn inside the container.
joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?
When storing the classifier trained with sklearn I have a choice between pickle (or cPickle) and joblib.dump().
Is there any benefits apart from performance to using joblib.dump()? Can a classifier saved by pickle produce worse results than the one saved with joblib?
They actually use the same protocol (i.e. joblib uses pickle). Check out the documentation for joblib.dump - you can specify the level of pickle compression using arguments to joblib.
joblib works especially well with NumPy arrays which are used by sklearn so depending on the classifier type you use you might have performance and size benefits using joblib.
Otherwise pickle does work correctly so saving a trained classifier and loading it again will produce the same results no matter which of the serialization libraries you use. See also the docs of sklearn on this topic.
Please note that joblib is included in sklearn.
I have trained scikit learn model and now I want to use in my python code.
Is there a way I can re-use the same model instance?
In a simple way, I can load the model again whenever I need it, but as my needs are more frequent I want to load the model once and reuse it again.
Is there a way I can achieve this in python?
Here is the code for one thread in prediction.py:
clf = joblib.load('trainedsgdhuberclassifier.pkl')
clf.predict(userid)
Now for another user I don't want to initiate prediction.py again and spend time in loading the model. Is there a way, I can simply write.
new_recommendations = prediction(userid)
Is it multiprocessing that I should be using here? I am not sure !!
As per the Scikit-learn documentation the following code may help you:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0])
In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
Later you can load back the pickled model (possibly in another Python process) with:
clf = joblib.load('filename.pkl')
Once you have loaded your model again. You can re-use it without retraining it.
clf.predict(X[0])
Source: http://scikit-learn.org/stable/modules/model_persistence.html
First, you should check how much of a bottleneck this is and if it is really worth avoiding the IO. An SGDClassifier is usually quite small. You can easily reuse the model, but the question is not really about how to reuse the model I would say, but how to get the new user instances to the classifier.
I would imagine userid is a feature vector, not an ID, right?
To make the model do prediction on new data, you need some kind of event based processing that calls the model when a new input arrives.
I am by far no expert here but I think one simple solution might be using an http interface and use a light-weight server like flask.