I am doing a project in Machine Learning and for that I am using the pickle module of Python.
Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.
So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.
You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.
Here's a rough sketch of how to update the pickled classifier data.
import pickle
import os
from os.path import exists
# other imports required for nltk ...
picklename = "naivebayes.pickle"
# stuff to set up featuresets ...
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
# Load or create a classifier and apply training set to it
if exists(picklename):
# Update existing classifier
with open(picklename, "rb") as f:
classifier = pickle.load(f)
classifier.train(training_set)
else:
# Create a brand new classifier
classifier = nltk.NaiveBayesClassifier.train(training_set)
# Create backup
if exists(picklename):
backupname = picklename + '.bak'
if exists(backupname):
os.remove(backupname)
os.rename(picklename, backupname)
# Save
with open(picklename, "wb") as f:
pickle.dump(classifier, f)
The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.
BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing
import pickle
with
import cPickle as pickle
Related
I want to cache my model results in order to make predictions without redoing the clustering.
I read that I can do that with memory parameter in HDBSCAN.
I did that instead because I wanted to save the file in the same directory as my script instead of '/tmp/joblib' that's here ((HDBSCAN cluster caching and persistance)) :
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True).fit(data)
# save the model to disk
filename = 'finalized_model.joblib'
joblib.dump(clusterer, filename)
I then tried to load the model in a different file:
from joblib import load
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = model.approximate_predict(model, test_points)
But I got this error: AttributeError: 'HDBSCAN' object has no attribute 'approximate_predict'
Last time I got this error, it was because prediction_data was not set to True, but what's the problem now?
approximate_predict() is under hdbscan package itself, instead of a HDBSCAN object.
Here's what you need to do:
from joblib import load
import hdbscan
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = hdbscan.approximate_predict(model, test_points)
API Reference:
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
Official documents state that "It is not recommended to use pickle or cPickle to save a Keras model."
However, my need for pickling Keras model stems from hyperparameter optimization using sklearn's RandomizedSearchCV (or any other hyperparameter optimizers). It's essential to save the results to a file, since then the script can be executed remotely in a detached session etc.
Essentially, I want to:
trial_search = RandomizedSearchCV( estimator=keras_model, ... )
pickle.dump( trial_search, open( "trial_search.pickle", "wb" ) )
As of now, Keras models are pickle-able. But we still recommend using model.save() to save model to disk.
This works like a charm http://zachmoshe.com/2017/04/03/pickling-keras-models.html:
import types
import tempfile
import keras.models
def make_keras_picklable():
def __getstate__(self):
model_str = ""
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
keras.models.save_model(self, fd.name, overwrite=True)
model_str = fd.read()
d = { 'model_str': model_str }
return d
def __setstate__(self, state):
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
fd.write(state['model_str'])
fd.flush()
model = keras.models.load_model(fd.name)
self.__dict__ = model.__dict__
cls = keras.models.Model
cls.__getstate__ = __getstate__
cls.__setstate__ = __setstate__
make_keras_picklable()
PS. I had some problems, due to my model.to_json() raised TypeError('Not JSON Serializable:', obj) due to circular reference, and this error has been swallowed by the code above somehow, hence resulting in pickle function running forever.
USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")
#for heavy model architectures, .h5 file is unsupported.
weigh= model.get_weights(); pklfile= "D:/modelweights.pkl"
try:
fpkl= open(pklfile, 'wb') #Python 3
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
except:
fpkl= open(pklfile, 'w') #Python 2
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
You can Pickle a Keras neural network by using the deploy-ml module which can be installed via pip
pip install deploy-ml
Full training and deployment of a kera neural network using the deploy-ml wrapper looks like this:
import pandas as pd
from deployml.keras import NeuralNetworkBase
# load data
train = pd.read_csv('example_data.csv')
# define the moel
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
# define data for the model
NN.data = train
# define the column in the data you're trying to predict
NN.outcome_pointer = 'paid'
# train the model, scale means that it's using a standard
# scaler to scale the data
NN.train(scale=True, batch_size=100)
NN.show_learning_curve()
# display the recall and precision
NN.evaluate_outcome()
# Pickle your model
NN.deploy_model(description='Keras NN',
author="maxwell flitton", organisation='example',
file_name='neural.sav')
The Pickled file contains the model, the metrics from the testing, a list of variable names and their order in which they have to be inputted, the version of Keras and python used, and if a scaler is used it will also be stored in the file. Documentation is here. Loading and using the file is done by the following:
import pickle
# use pickle to load the model
loaded_model = pickle.load(open("neural.sav", 'rb'))
# use the scaler to scale your data you want to input
input_data = loaded_model['scaler'].transform([[1, 28, 0, 1, 30]])
# get the prediction
loaded_model['model'].predict(input_data)[0][0]
I appreciate that the training can be a bit restrictive. Deploy-ml supports importing your own model for Sk-learn but it's still working on this support for Keras. However, I've found that you can create a deploy-ml NeuralNetworkBase object, define your own Keras neural network outside of Deploy-ml, and assign it to the deploy-ml model attribute and this works just fine:
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
NN.model = neural_network_you_defined_yourself
I am using CatBoostRegressor in Python version of the Catboost library.
According to documentation, it's possible to use overfitting detector, which I am doing, like this:
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
# this code didn't executed
model.save_model(model_name)
However, after the overfitting occurs, I've got my Python script interrupted, prematurely stopped, pick any phrase you want, and save model part didn't get executed, which leads to a lot of waisted time and no results in the end. I didn't get any stacktrace.
Is there any possibility to handle it in CatBoost and save hours of fitting work?
Use this code. It will save the model no matter what happens in the try block.
try:
model.fit(X, y)
finally:
model.save_model()
Well i don't know how catboost work but i would like to share a different way to save/store your trained data maybe it could help
import pickle
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
#----To store model----------
filename = 'final_model' # name to store model
pickle.dump(model, open(filename, 'wb')) # pickling
#-----To load model------------
loaded_model = pickle.load(open(filename, 'rb'))
You can do it with pickle, just train your module and dump it using pickle.
pickle.dump(regr, open("models/svrrbf.sav",'wb'))
Further you can use that module to test your inputs.
Hope it helps
I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.
How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?
I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?
I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.
Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:
scale_ : ndarray, shape (n_features,)
Per feature relative scaling of the data.
New in version 0.17: scale_ is recommended instead of deprecated std_.
mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
The following short snippet illustrates this:
from sklearn import preprocessing
import numpy as np
s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
Scale with standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
save mean_ and var_ for later use
means = scaler.mean_
vars = scaler.var_
(you can print and copy paste means and vars or save to disk with np.save....)
Later use of saved parameters
def scale_data(array,means=means,stds=vars **0.5):
return (array-means)/stds
scale_new_data = scale_data(new_data)
You can use the joblib module to store the parameters of your scaler.
from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')
Later you can load the scaler.
from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)
Pickle brings a security vulnerability and allows attackers to execute arbitrary code on the servers. The conditions are:
possibility to replace the pickle file with another pickle file on the server (if no auditing of the pickle performed, i.e. signature validation or hash comparison)
the same but on the developer PC (attacker compromised some dev PC
If your server-side applications are executed as root (or under root in docker containers), then this is definitely worth of your attention.
Possible solution:
Model training should be done in a secure environment
Trained models should be signed by the key from another secure environment, which is not loaded to the gpg-agent (otherwise the attacker can quite easily replace the signature)
CI should test the models in an isolated environment (quarantine)
Use python3.8 or later which added security hooks to prevent code injection techniques
or just avoid pickle:)
Some links:
https://docs.python.org/3/library/pickle.html
Python: can I safely unpickle untrusted data?
https://github.com/pytorch/pytorch/issues/52596
https://www.python.org/dev/peps/pep-0578/
Possible approach to avoid pickling:
# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)
#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])
scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])
This is in context of making our own search engine in web2py.
I want to load the pickled index only once and then keep reusing it for every request that comes into my web2py application.
Is there a way to do that in a way that doesn't impact per request performance?
Loading it in model doesn't work because the model is executed on every request.
Similarly doing it in a module also will execute the code in the module on every request.
So I tried to load it in shell.py in gluon module exec_environment definition, and put the following code in it.
from gluon import current
fp = open(file_name, "r")
tree = pickle.load(fp)
fp.close()
current.tree = tree
And to use the tree, in the module I have written
from gluon import current
tree = current.tree
But there is no increase in performance and the speed is very slow and same as loading the pickle every time. Normally, the search time of query is very low but here still it is taking too much time.
Am I missing something and what I have done is incorrect or is there a correct and better way of doing it?
In /modules/search_index.py:
fp = open(file_name, "r")
tree = pickle.load(fp)
fp.close()
In your app code (i.e., in model or controller):
from search_index import tree
[do something with tree]