When storing the classifier trained with sklearn I have a choice between pickle (or cPickle) and joblib.dump().
Is there any benefits apart from performance to using joblib.dump()? Can a classifier saved by pickle produce worse results than the one saved with joblib?
They actually use the same protocol (i.e. joblib uses pickle). Check out the documentation for joblib.dump - you can specify the level of pickle compression using arguments to joblib.
joblib works especially well with NumPy arrays which are used by sklearn so depending on the classifier type you use you might have performance and size benefits using joblib.
Otherwise pickle does work correctly so saving a trained classifier and loading it again will produce the same results no matter which of the serialization libraries you use. See also the docs of sklearn on this topic.
Please note that joblib is included in sklearn.
Related
I am curious to find if there is an accepted solution to saving sklearn objects to json, instead of pickling them.
I'm interested in this because saving to json will take up much less storage and make saving the objects to db's like redis much more straightforward.
In particular, for something like a ColumnTransformer, all I need is the mean and std for a specific feature. With that, I can easily rebuild the transformer, but when reconstructing a transformer object from the saved json object, I have to manually set learned and private attributes, which feels hacky.
The closest thing I've found is this article: https://stackabuse.com/scikit-learn-save-and-restore-models/
Is this how others are going about this?
What is stopping sklearn from building this functionality into the library?
Thanks!
Think this package is what you are looking for https://pypi.org/project/sklearn-json/
Export scikit-learn model files to JSON for sharing or deploying predictive models with peace of mind.
This code snippet is from the link above and shows how to export sklearn models to json:
import sklearn_json as skljson
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0).fit(X, y)
skljson.to_json(model, file_name)
deserialized_model = skljson.from_json(file_name)
deserialized_model.predict(X)
Furthermore to answer the json vs. pickle question, this might be helpful
Pickle or json?
In the IO section of the kedro API docs I could not find functionality w.r.t. storing trained models (e.g. .pkl, .joblib, ONNX, PMML)? Have I missed something?
There is the pickle dataset in kedro.io, that you can use to save trained models and/or anything you want to pickle and is serialisable (models being a common object). It accepts a backend that defaults to pickle but can be set to joblib if you want to use joblib instead.
I'm just going to quickly note that Kedro is moving to kedro.extras.datasets for its datasets and moving away from having non-core datasets in kedro.io. You might want to look at kedro.extras.datasets and in Kedro 0.16 onwards pickle.PickleDataSet with joblib support.
The Kedro spaceflights tutorial in the documentation actually saves the trained linear regression model using the pickle dataset if you want to see an example of it. The relevant section is here.
There is PickleDataSet in https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pickle.PickleDataSet.html and joblib support in PickleDataSet is in the next release (see https://github.com/quantumblacklabs/kedro/blob/develop/RELEASE.md)
joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?
There seem to be a few options for exporting PMML models out of scikit-learn, such as sklearn2pmml, but a lot less information going in the other direction. My case is an XGboost model previously built in R, and saved to PMML using r2pmml, that I would like to use in Python. Scikit normally uses pickle to save/load models, but is it also possible to import models into scikit-learn using PMML?
You can't connect different specialized representations (such as R and Scikit-Learn native data structures) over a generalized representation (such as PMML). You may have better luck trying to translate R data structures to Scikit-Learn data structures directly.
XGBoost is really an exception to the above rule, because its R and Scikit-Learn implementations are just thin wrappers around the native XGBoost library. Inside a trained R XGBoost object there's a blob raw, which is the model in its native XGBoost representation. Save it to a file, and load in Python using the xgb.Booster.load_model(fname) method.
If you know that you need to the deploy XGBoost model in Scikit-Learn, then why do you train it in R?
I have standardized my data in sklearn using preprocessing.standardscaler. Question is how could I save this in my local for latter use?
Thanks
If I understand you correctly you want to save your trained model so it can be loaded again correct?
There are two methods, using python's pickle and the other method which is to use joblib. The recommend method is joblib as this will result in a much smaller file than a pickle, which dumps a string representation of your object:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
#then load it later, remember to import joblib of course
clf = joblib.load('filename.pk1')
See the online docs
Note: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead