I am trying to save a large classifier; unfortunately, I keep getting a memory error due to its size. I have tried pickle and joblib. Does anyone know if streaming pickle will work? Are sci-kits model objects iterable?
Related
I am curious to find if there is an accepted solution to saving sklearn objects to json, instead of pickling them.
I'm interested in this because saving to json will take up much less storage and make saving the objects to db's like redis much more straightforward.
In particular, for something like a ColumnTransformer, all I need is the mean and std for a specific feature. With that, I can easily rebuild the transformer, but when reconstructing a transformer object from the saved json object, I have to manually set learned and private attributes, which feels hacky.
The closest thing I've found is this article: https://stackabuse.com/scikit-learn-save-and-restore-models/
Is this how others are going about this?
What is stopping sklearn from building this functionality into the library?
Thanks!
Think this package is what you are looking for https://pypi.org/project/sklearn-json/
Export scikit-learn model files to JSON for sharing or deploying predictive models with peace of mind.
This code snippet is from the link above and shows how to export sklearn models to json:
import sklearn_json as skljson
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0).fit(X, y)
skljson.to_json(model, file_name)
deserialized_model = skljson.from_json(file_name)
deserialized_model.predict(X)
Furthermore to answer the json vs. pickle question, this might be helpful
Pickle or json?
I am a beginner in Machine Learning, and I've got a small doubt!
I have been working on a machine learning code, where I am using tensorflow for predictions of future values. Now, the code uses a dataset which implements One Hot Encoding initially, then the OHE columns are combined and the estimator function is created. For the estimator model, I have used DNN Regressor. There is no Keras used anywhere in the code.
model=tf.estimator.DNNRegressor(hidden_units=[100,100,100],feature_columns=feature_columns,
optimizer=tf.optimizers.Adam(learning_rate=0.01),
activation_fn=tf.nn.relu )
Now, I tried using Pickle for saving. However, I get this error:
AttributeError: Can't pickle local object 'DNNRegressorV2.__init__.<locals>._model_fn'
I tried the same using joblib, but I got the following issue:
PicklingError: Can't pickle <function DNNRegressorV2.__init__.<locals>._model_fn at 0x00000175F1A0E948>: it's not found as tensorflow_estimator.python.estimator.canned.dnn.DNNRegressorV2.__init__.<locals>._model_fn
Following this, I tried this code
tf.keras.models.save_model( model, filepath, overwrite=True,
include_optimizer=True, save_format=None, signatures=None, options=None )
But I got the error:
AttributeError: 'DNNRegressorV2' object has no attribute 'built'
I have also tried other methods such as model.save(), model.to_json(), and also tried saving using API - none of them worked out.
Can someone help me out with the same?
I pickled a model and want to expose only the prediction api written in Flask. However when I write a dockerfile to make a image without sklearn in it, I get an error ModuleNotFoundError: No module named 'sklearn.xxxx' where xxx refers to sklearn's ML algorithm classes, at the point where I am loading the model using pickle like classifier = pickle.load(f).
When I rewrite the dockerfile to make an image that has sklearn too, then I don't get the error even though in the API I never import sklearn.
My concept of pickling is very simple, that it will serialize the classifier class with all of its data. So when we unpickle it, since the classifier class already has a predict attribute, we can just call it. Why do I need to have sklearn in the environment?
You have a misconception of how pickle works.
It does not seralize anything, except of instance state (__dict__ by default, or custom implementation). When unpickling, it just tries to create instance of corresponding class (here goes your import error) and set pickled state.
There's a reason for this: you don't know beforehand what methods will be used after load, so you can not pickle implementation. In addition to this, in pickle time you can not build some AST to see what methods/modules will be needed after deserializing, and main reason for this is dynamic nature of python — your implementation can actually vary depending on input.
After all, even assuming that theoretically we'd have smart self-contained pickle serialization, it will be actual model + sklearn in single file, with no proper way to manage it.
The pickle is just the representation of the data inside the model. You still need the code to use it, that's why you need to have sklearn inside the container.
joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?
When storing the classifier trained with sklearn I have a choice between pickle (or cPickle) and joblib.dump().
Is there any benefits apart from performance to using joblib.dump()? Can a classifier saved by pickle produce worse results than the one saved with joblib?
They actually use the same protocol (i.e. joblib uses pickle). Check out the documentation for joblib.dump - you can specify the level of pickle compression using arguments to joblib.
joblib works especially well with NumPy arrays which are used by sklearn so depending on the classifier type you use you might have performance and size benefits using joblib.
Otherwise pickle does work correctly so saving a trained classifier and loading it again will produce the same results no matter which of the serialization libraries you use. See also the docs of sklearn on this topic.
Please note that joblib is included in sklearn.