Official documents state that "It is not recommended to use pickle or cPickle to save a Keras model."
However, my need for pickling Keras model stems from hyperparameter optimization using sklearn's RandomizedSearchCV (or any other hyperparameter optimizers). It's essential to save the results to a file, since then the script can be executed remotely in a detached session etc.
Essentially, I want to:
trial_search = RandomizedSearchCV( estimator=keras_model, ... )
pickle.dump( trial_search, open( "trial_search.pickle", "wb" ) )
As of now, Keras models are pickle-able. But we still recommend using model.save() to save model to disk.
This works like a charm http://zachmoshe.com/2017/04/03/pickling-keras-models.html:
import types
import tempfile
import keras.models
def make_keras_picklable():
def __getstate__(self):
model_str = ""
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
keras.models.save_model(self, fd.name, overwrite=True)
model_str = fd.read()
d = { 'model_str': model_str }
return d
def __setstate__(self, state):
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
fd.write(state['model_str'])
fd.flush()
model = keras.models.load_model(fd.name)
self.__dict__ = model.__dict__
cls = keras.models.Model
cls.__getstate__ = __getstate__
cls.__setstate__ = __setstate__
make_keras_picklable()
PS. I had some problems, due to my model.to_json() raised TypeError('Not JSON Serializable:', obj) due to circular reference, and this error has been swallowed by the code above somehow, hence resulting in pickle function running forever.
USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")
#for heavy model architectures, .h5 file is unsupported.
weigh= model.get_weights(); pklfile= "D:/modelweights.pkl"
try:
fpkl= open(pklfile, 'wb') #Python 3
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
except:
fpkl= open(pklfile, 'w') #Python 2
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
You can Pickle a Keras neural network by using the deploy-ml module which can be installed via pip
pip install deploy-ml
Full training and deployment of a kera neural network using the deploy-ml wrapper looks like this:
import pandas as pd
from deployml.keras import NeuralNetworkBase
# load data
train = pd.read_csv('example_data.csv')
# define the moel
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
# define data for the model
NN.data = train
# define the column in the data you're trying to predict
NN.outcome_pointer = 'paid'
# train the model, scale means that it's using a standard
# scaler to scale the data
NN.train(scale=True, batch_size=100)
NN.show_learning_curve()
# display the recall and precision
NN.evaluate_outcome()
# Pickle your model
NN.deploy_model(description='Keras NN',
author="maxwell flitton", organisation='example',
file_name='neural.sav')
The Pickled file contains the model, the metrics from the testing, a list of variable names and their order in which they have to be inputted, the version of Keras and python used, and if a scaler is used it will also be stored in the file. Documentation is here. Loading and using the file is done by the following:
import pickle
# use pickle to load the model
loaded_model = pickle.load(open("neural.sav", 'rb'))
# use the scaler to scale your data you want to input
input_data = loaded_model['scaler'].transform([[1, 28, 0, 1, 30]])
# get the prediction
loaded_model['model'].predict(input_data)[0][0]
I appreciate that the training can be a bit restrictive. Deploy-ml supports importing your own model for Sk-learn but it's still working on this support for Keras. However, I've found that you can create a deploy-ml NeuralNetworkBase object, define your own Keras neural network outside of Deploy-ml, and assign it to the deploy-ml model attribute and this works just fine:
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
NN.model = neural_network_you_defined_yourself
Related
I want to cache my model results in order to make predictions without redoing the clustering.
I read that I can do that with memory parameter in HDBSCAN.
I did that instead because I wanted to save the file in the same directory as my script instead of '/tmp/joblib' that's here ((HDBSCAN cluster caching and persistance)) :
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True).fit(data)
# save the model to disk
filename = 'finalized_model.joblib'
joblib.dump(clusterer, filename)
I then tried to load the model in a different file:
from joblib import load
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = model.approximate_predict(model, test_points)
But I got this error: AttributeError: 'HDBSCAN' object has no attribute 'approximate_predict'
Last time I got this error, it was because prediction_data was not set to True, but what's the problem now?
approximate_predict() is under hdbscan package itself, instead of a HDBSCAN object.
Here's what you need to do:
from joblib import load
import hdbscan
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = hdbscan.approximate_predict(model, test_points)
API Reference:
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
I uploaded a pretrained scikit learn classification model to Vertex AI and ran a batch prediction on 5 samples. It just returned a list of false predictions with no confidence score. I don't see anywhere in the SDK documentation or Google console for how to get batch predictions to include the confidence scores. Is that something Vertex AI can do?
My intent is to automate a batch prediction pipeline using the following code.
# Predict
# "csv", ""bigquery", "tf-record", "tf-record-gzip", or "file-list"
batch_prediction_job = model.batch_predict(
job_display_name = job_display_name,
gcs_source = input_path,
instances_format = "", # jsonl, csv, bigquery,
gcs_destination_prefix = output_path,
starting_replica_count = 1,
max_replica_count = 10,
sync = True,
)
batch_prediction_job.wait()
return batch_prediction_job.resource_name
I tried it out in google console as a test to make sure my input data was properly formatted.
I don't think so; the stock sklearn container provided by vertex doesn't provide such a score I guess. You might need to write a custom container.
You can now do this with the custom prediction routines. Here are a couple good e2e examples
Official google
One of mine - focuses on batch prediction with predict_proba()
Here's an example of the interface for the predictor.py:
%%writefile src/predictor.py
import joblib
import numpy as np
import pickle
from google.cloud import storage
from google.cloud.aiplatform.prediction.sklearn.predictor import SklearnPredictor
import json
class CprPredictor(SklearnPredictor):
def __init__(self):
return
def load(self, gcs_artifacts_uri: str):
"""Loads the preprocessor artifacts."""
gcs_client = storage.Client()
with open("model.joblib", 'wb') as gcs_model:
gcs_client.download_blob_to_file(
gcs_artifacts_uri + "/model.joblib", gcs_model
)
with open("model.joblib", "rb") as f:
self._model = joblib.load("model.joblib")
def predict(self, instances):
outputs = self._model.predict_proba(instances)
return outputs
Note you have to utilize an experimental branch of the SDK at the moment, will likely change to official.
I am using CatBoostRegressor in Python version of the Catboost library.
According to documentation, it's possible to use overfitting detector, which I am doing, like this:
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
# this code didn't executed
model.save_model(model_name)
However, after the overfitting occurs, I've got my Python script interrupted, prematurely stopped, pick any phrase you want, and save model part didn't get executed, which leads to a lot of waisted time and no results in the end. I didn't get any stacktrace.
Is there any possibility to handle it in CatBoost and save hours of fitting work?
Use this code. It will save the model no matter what happens in the try block.
try:
model.fit(X, y)
finally:
model.save_model()
Well i don't know how catboost work but i would like to share a different way to save/store your trained data maybe it could help
import pickle
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
#----To store model----------
filename = 'final_model' # name to store model
pickle.dump(model, open(filename, 'wb')) # pickling
#-----To load model------------
loaded_model = pickle.load(open(filename, 'rb'))
You can do it with pickle, just train your module and dump it using pickle.
pickle.dump(regr, open("models/svrrbf.sav",'wb'))
Further you can use that module to test your inputs.
Hope it helps
(I'm using tensorflow 1.0 and Python 2.7)
I'm having trouble getting an Estimator to work with queues. Indeed, if I use the deprecated SKCompat interface with custom data files and a given batch size, the model trains properly. I'm trying to use the new interface with an input_fn that batches features out of TFRecord files (equivalent to my custom data files). The scripts runs properly but the loss value doesn't change after 200 or 300 steps. It seems that the model is looping on a small input batch (this would explain why the loss converges so fast).
I have a 'run.py' script that looks like the following:
import tensorflow as tf
from tensorflow.contrib import learn, metrics
#[...]
evalMetrics = {'accuracy':learn.MetricSpec(metric_fn=metrics.streaming_accuracy)}
runConfig = learn.RunConfig(save_summary_steps=10)
estimator = learn.Estimator(model_fn=myModel,
params=myParams,
modelDir='/tmp/myDir',
config=runConfig)
session = tf.Session(graph=tf.get_default_graph())
with session.as_default():
tf.global_variables_initializer()
coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=session,coord=coordinator)
estimator.fit(input_fn=lambda: inputToModel(trainingFileList),steps=10000)
estimator.evaluate(input_fn=lambda: inputToModel(evalFileList),steps=10000,metrics=evalMetrics)
coordinator.request_stop()
coordinator.join(threads)
session.close()
My inputToModel function looks like this:
import tensorflow as tf
def inputToModel(fileList):
features = {'rawData': tf.FixedLenFeature([100],tf.float32),
'label': tf.FixedLenFeature([],tf.int64)}
tensorDict = tf.contrib.learn.read_batch_record_features(fileList,
batch_size=100,
features=features,
randomize_input=True,
reader_num_threads=4,
num_epochs=1,
name='inputPipeline')
tf.local_variables_initializer()
data = tensorDict['rawData']
labelTensor = tensorDict['label']
inputTensor = tf.reshape(data,[-1,10,10,1])
return inputTensor,labelTensor
Any help or suggestions is welcome !
Try to use: tf.global_variables_initializer().run()
I wanna do a similar thing but I do not know how to use Estimator API with multi-threading. There is an Experiment class for serving too - might be useful
delete line session = tf.Session(graph=tf.get_default_graph())
and session.close() and try:
with tf.Session() as sess:
tf.global_variables_initializer().run()
I am doing a project in Machine Learning and for that I am using the pickle module of Python.
Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.
So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.
You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.
Here's a rough sketch of how to update the pickled classifier data.
import pickle
import os
from os.path import exists
# other imports required for nltk ...
picklename = "naivebayes.pickle"
# stuff to set up featuresets ...
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
# Load or create a classifier and apply training set to it
if exists(picklename):
# Update existing classifier
with open(picklename, "rb") as f:
classifier = pickle.load(f)
classifier.train(training_set)
else:
# Create a brand new classifier
classifier = nltk.NaiveBayesClassifier.train(training_set)
# Create backup
if exists(picklename):
backupname = picklename + '.bak'
if exists(backupname):
os.remove(backupname)
os.rename(picklename, backupname)
# Save
with open(picklename, "wb") as f:
pickle.dump(classifier, f)
The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.
BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing
import pickle
with
import cPickle as pickle