I have trained a prediction model using scikit-learn, and used pickle to save it to hard drive. The pickle file is 58M, which is quite sizable.
To use the model, I wrote something like this:
def loadModel(pkl_fn):
with open(pkl_fn, 'r') as f:
return pickle.load(f)
if __name__ == "__main__":
import sys
feature_vals = read_features(sys.argv[1])
model = loadModel("./model.pkl")
# predict
# model.predict(feature_vals)
I am wondering about the efficiency when running the program many times in command line.
Pickle files are supposed to be fast to load, but is there any way to even speed up? Can I compile the whole thing into a binary executable?
If you are worried about loading time, you can use joblib.dump and joblib.load, they are more efficient than pickle in the case of scikit-learn.
For a full (pretty straightforward) example see the docs or this related answer from ogrisel:
Save classifier to disk in scikit-learn
Related
First of all, let me introduce myself. I am a young researcher and I am interested in machine learning. I created a model, trained, tested and validated. Now I would like to know if there is a way to save my trained model.
I am also interested in knowing if the model is saved trained.
Finally, is there a way to use the saved (and trained) model with new data without having to train the model again?
I work with python!
Welcome to the community.
Yes, you may save the trained model and reuse it later. There are several ways to do so and I will introduce you to a couple of them here. However, please note which library you used to build your model and use a method for that library.
Pickel: Pickle is the standard way of serializing objects in Python.
import pickle
pickle.dump(model, open(filename, 'wb'))
loaded_model = pickle.load(open(filename, 'rb'))
Joblib: Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.
import joblib
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Finally, as suggested by others, if you used libraries such as Tensorflow to build and train your models, please note that they have extensive ways to work with the built model and save/load it. Please check the following information:
Tensorflow Save and Load model
There may be a better way to do this but this is how I have done it before, with python.
So you have a ML model that you have trained. That model is basically just a set of parameters. Depending on what module you are using, you can save those parameters in a file, and import them to regenerate your model later.
Perhaps a simpler way is to save the model object entirely in a file using pickling.
https://docs.python.org/3/library/pickle.html
You can dump the object into a file, and load it back when you want to run it again.
I am teaching myself to code Convolutional Neural Networks. In particular I am looking at the "Dogs vs. Cats" challenge (https://medium.com/#mrgarg.rajat/kaggle-dogs-vs-cats-challenge-complete-step-by-step-guide-part-2-e9ee4967b9). I am using PyCharm.
In PyCharm, is there a way of using the trained model to make a prediction on the test data without having to run the entire file each time (and thus retrain the model each time)? Additionally, is there a way to skip the part of the script that prepares the data for input into the CNN? In a similar manner, does PyCharm store variables- can I print individual variables after the script has been run.
Would it be better if I used a different IDLE?
You can use sklearn joblib to save the trained model as a pickle and use it later for predictions.
from sklearn.externals import joblib
# Save the model as a pickle in a file
joblib.dump(knn, 'filename.pkl')
# Load the model from the file
knn_from_joblib = joblib.load('filename.pkl')
# Use the loaded model to make predictions
knn_from_joblib.predict(X_test)
I have trained scikit learn model and now I want to use in my python code.
Is there a way I can re-use the same model instance?
In a simple way, I can load the model again whenever I need it, but as my needs are more frequent I want to load the model once and reuse it again.
Is there a way I can achieve this in python?
Here is the code for one thread in prediction.py:
clf = joblib.load('trainedsgdhuberclassifier.pkl')
clf.predict(userid)
Now for another user I don't want to initiate prediction.py again and spend time in loading the model. Is there a way, I can simply write.
new_recommendations = prediction(userid)
Is it multiprocessing that I should be using here? I am not sure !!
As per the Scikit-learn documentation the following code may help you:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0])
In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
Later you can load back the pickled model (possibly in another Python process) with:
clf = joblib.load('filename.pkl')
Once you have loaded your model again. You can re-use it without retraining it.
clf.predict(X[0])
Source: http://scikit-learn.org/stable/modules/model_persistence.html
First, you should check how much of a bottleneck this is and if it is really worth avoiding the IO. An SGDClassifier is usually quite small. You can easily reuse the model, but the question is not really about how to reuse the model I would say, but how to get the new user instances to the classifier.
I would imagine userid is a feature vector, not an ID, right?
To make the model do prediction on new data, you need some kind of event based processing that calls the model when a new input arrives.
I am by far no expert here but I think one simple solution might be using an http interface and use a light-weight server like flask.
I have trained a LinearSVC in a python script named main.py, for training an image classification algorithm. The model looks like this.
lin_clf = svm.LinearSVC()
lin_clf.fit(feature,g)
I need to use this trained model for predicting image classes in another code. How do i export the genereated model i.e. lin_clf to the other code.
Thank you in advance.
I understand that your "other code" is another python script.
In this case, you can certainly use the pickle or shelve modules to write lin_clf to disk in main.py, and to read it from disk in the script that will use the model.
Here is an example showing how to write the lin_clf object to disk using shelve:
import shelve
a = shelve.open("output")
a['lin_clf'] = lin_clf
a.close()
I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.
You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.