I have standardized my data in sklearn using preprocessing.standardscaler. Question is how could I save this in my local for latter use?
Thanks
If I understand you correctly you want to save your trained model so it can be loaded again correct?
There are two methods, using python's pickle and the other method which is to use joblib. The recommend method is joblib as this will result in a much smaller file than a pickle, which dumps a string representation of your object:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
#then load it later, remember to import joblib of course
clf = joblib.load('filename.pk1')
See the online docs
Note: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead
Related
I was given a pickle file with a trained gradient boosting model that was trained by someone else on another machine. I realised that I could not load this pickle file on my machine using
with open('gb_model.pickle','rb') as f:
gbmodel = pickle.load(f)
My current version of scikit-learn==0.24.2. I got the error ModuleNotFoundError: No module named 'sklearn.ensemble.gradient_boosting'. I then tried installing other versions of sklearn but I keep getting other errors related to sklearn. I also tried using joblib but get the same results:
from joblib import load
clf = load('gb_model.pickle')
I realised I need to load the pickled file with the same sklearn version this was installed with. I saw here that one is able to check the version after loading it, but it seems like I can't even load the pickle file. Is there another way of doing this? I want to end up being able to load the pickled model. According to official documentation, ideally there should be metadata saved along the pickled model, but I was not provided this, is there a way to obtain this from the pickle file alone?
If you trained the model with sklearn version 0.18 or higher, then try:
import pickle
clf = pickle.load(open('gb_model.pickle', 'rb'))
clf.__getstate__()['_sklearn_version']
However, there is literally no module called gradient_boosting inside sklearn.ensemble, which is what's causing the problem. The closest module would be sklearn.ensemble.GradientBoostingClassifier, or this module from OpenML (which I had never heard of).
I am curious to find if there is an accepted solution to saving sklearn objects to json, instead of pickling them.
I'm interested in this because saving to json will take up much less storage and make saving the objects to db's like redis much more straightforward.
In particular, for something like a ColumnTransformer, all I need is the mean and std for a specific feature. With that, I can easily rebuild the transformer, but when reconstructing a transformer object from the saved json object, I have to manually set learned and private attributes, which feels hacky.
The closest thing I've found is this article: https://stackabuse.com/scikit-learn-save-and-restore-models/
Is this how others are going about this?
What is stopping sklearn from building this functionality into the library?
Thanks!
Think this package is what you are looking for https://pypi.org/project/sklearn-json/
Export scikit-learn model files to JSON for sharing or deploying predictive models with peace of mind.
This code snippet is from the link above and shows how to export sklearn models to json:
import sklearn_json as skljson
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=0).fit(X, y)
skljson.to_json(model, file_name)
deserialized_model = skljson.from_json(file_name)
deserialized_model.predict(X)
Furthermore to answer the json vs. pickle question, this might be helpful
Pickle or json?
In the IO section of the kedro API docs I could not find functionality w.r.t. storing trained models (e.g. .pkl, .joblib, ONNX, PMML)? Have I missed something?
There is the pickle dataset in kedro.io, that you can use to save trained models and/or anything you want to pickle and is serialisable (models being a common object). It accepts a backend that defaults to pickle but can be set to joblib if you want to use joblib instead.
I'm just going to quickly note that Kedro is moving to kedro.extras.datasets for its datasets and moving away from having non-core datasets in kedro.io. You might want to look at kedro.extras.datasets and in Kedro 0.16 onwards pickle.PickleDataSet with joblib support.
The Kedro spaceflights tutorial in the documentation actually saves the trained linear regression model using the pickle dataset if you want to see an example of it. The relevant section is here.
There is PickleDataSet in https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pickle.PickleDataSet.html and joblib support in PickleDataSet is in the next release (see https://github.com/quantumblacklabs/kedro/blob/develop/RELEASE.md)
When storing the classifier trained with sklearn I have a choice between pickle (or cPickle) and joblib.dump().
Is there any benefits apart from performance to using joblib.dump()? Can a classifier saved by pickle produce worse results than the one saved with joblib?
They actually use the same protocol (i.e. joblib uses pickle). Check out the documentation for joblib.dump - you can specify the level of pickle compression using arguments to joblib.
joblib works especially well with NumPy arrays which are used by sklearn so depending on the classifier type you use you might have performance and size benefits using joblib.
Otherwise pickle does work correctly so saving a trained classifier and loading it again will produce the same results no matter which of the serialization libraries you use. See also the docs of sklearn on this topic.
Please note that joblib is included in sklearn.
I have trained scikit learn model and now I want to use in my python code.
Is there a way I can re-use the same model instance?
In a simple way, I can load the model again whenever I need it, but as my needs are more frequent I want to load the model once and reuse it again.
Is there a way I can achieve this in python?
Here is the code for one thread in prediction.py:
clf = joblib.load('trainedsgdhuberclassifier.pkl')
clf.predict(userid)
Now for another user I don't want to initiate prediction.py again and spend time in loading the model. Is there a way, I can simply write.
new_recommendations = prediction(userid)
Is it multiprocessing that I should be using here? I am not sure !!
As per the Scikit-learn documentation the following code may help you:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0])
In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
Later you can load back the pickled model (possibly in another Python process) with:
clf = joblib.load('filename.pkl')
Once you have loaded your model again. You can re-use it without retraining it.
clf.predict(X[0])
Source: http://scikit-learn.org/stable/modules/model_persistence.html
First, you should check how much of a bottleneck this is and if it is really worth avoiding the IO. An SGDClassifier is usually quite small. You can easily reuse the model, but the question is not really about how to reuse the model I would say, but how to get the new user instances to the classifier.
I would imagine userid is a feature vector, not an ID, right?
To make the model do prediction on new data, you need some kind of event based processing that calls the model when a new input arrives.
I am by far no expert here but I think one simple solution might be using an http interface and use a light-weight server like flask.