I am using CatBoostRegressor in Python version of the Catboost library.
According to documentation, it's possible to use overfitting detector, which I am doing, like this:
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
# this code didn't executed
model.save_model(model_name)
However, after the overfitting occurs, I've got my Python script interrupted, prematurely stopped, pick any phrase you want, and save model part didn't get executed, which leads to a lot of waisted time and no results in the end. I didn't get any stacktrace.
Is there any possibility to handle it in CatBoost and save hours of fitting work?
Use this code. It will save the model no matter what happens in the try block.
try:
model.fit(X, y)
finally:
model.save_model()
Well i don't know how catboost work but i would like to share a different way to save/store your trained data maybe it could help
import pickle
model = CatBoostRegressor(iterations=iters, learning_rate=0.03, depth=depth, verbose=True, od_pval=1, od_type='IncToDec', od_wait=20)
model.fit(train_pool, eval_set=validation_pool)
#----To store model----------
filename = 'final_model' # name to store model
pickle.dump(model, open(filename, 'wb')) # pickling
#-----To load model------------
loaded_model = pickle.load(open(filename, 'rb'))
You can do it with pickle, just train your module and dump it using pickle.
pickle.dump(regr, open("models/svrrbf.sav",'wb'))
Further you can use that module to test your inputs.
Hope it helps
Related
I want to cache my model results in order to make predictions without redoing the clustering.
I read that I can do that with memory parameter in HDBSCAN.
I did that instead because I wanted to save the file in the same directory as my script instead of '/tmp/joblib' that's here ((HDBSCAN cluster caching and persistance)) :
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True).fit(data)
# save the model to disk
filename = 'finalized_model.joblib'
joblib.dump(clusterer, filename)
I then tried to load the model in a different file:
from joblib import load
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = model.approximate_predict(model, test_points)
But I got this error: AttributeError: 'HDBSCAN' object has no attribute 'approximate_predict'
Last time I got this error, it was because prediction_data was not set to True, but what's the problem now?
approximate_predict() is under hdbscan package itself, instead of a HDBSCAN object.
Here's what you need to do:
from joblib import load
import hdbscan
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = hdbscan.approximate_predict(model, test_points)
API Reference:
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.
I have an MXNet MultilayerPerceptron inside a class MyModel.
I first load the trained weights from a file.
I am performing prediction with the MLP like this:
class MyModel:
...
def predict(self, X):
data_iterator = mx.io.NDArrayIter(data=X,
batch_size=self.model.data_shapes[0].shape[0], shuffle=False)
predictions_npa = self.model.predict(data_iterator ).asnumpy()
where X is a numpy array (1,777)
Now the first time i'm performing a MyModel.predict this works perfectly.
I then store the MyModel instance in a functools.LRUCache and trying to perform a second time the prediction with the exact same input.
And every time when doing that, my python process just stops doing anything, no logs, no actions, neither does it exit. I just know that when I try to inspect the result of self.model.predict(data_iterator ) in my PyCharm debugger I get a loading error.
So I'm a bit confused with what's happening there, if anyone had an idea it could be a great help!
Thanks
This maybe be because you have to recreate data_iterator. data_iterator is exausted once it has finished and .next() call will raise the error.
Official documents state that "It is not recommended to use pickle or cPickle to save a Keras model."
However, my need for pickling Keras model stems from hyperparameter optimization using sklearn's RandomizedSearchCV (or any other hyperparameter optimizers). It's essential to save the results to a file, since then the script can be executed remotely in a detached session etc.
Essentially, I want to:
trial_search = RandomizedSearchCV( estimator=keras_model, ... )
pickle.dump( trial_search, open( "trial_search.pickle", "wb" ) )
As of now, Keras models are pickle-able. But we still recommend using model.save() to save model to disk.
This works like a charm http://zachmoshe.com/2017/04/03/pickling-keras-models.html:
import types
import tempfile
import keras.models
def make_keras_picklable():
def __getstate__(self):
model_str = ""
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
keras.models.save_model(self, fd.name, overwrite=True)
model_str = fd.read()
d = { 'model_str': model_str }
return d
def __setstate__(self, state):
with tempfile.NamedTemporaryFile(suffix='.hdf5', delete=True) as fd:
fd.write(state['model_str'])
fd.flush()
model = keras.models.load_model(fd.name)
self.__dict__ = model.__dict__
cls = keras.models.Model
cls.__getstate__ = __getstate__
cls.__setstate__ = __setstate__
make_keras_picklable()
PS. I had some problems, due to my model.to_json() raised TypeError('Not JSON Serializable:', obj) due to circular reference, and this error has been swallowed by the code above somehow, hence resulting in pickle function running forever.
USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")
#for heavy model architectures, .h5 file is unsupported.
weigh= model.get_weights(); pklfile= "D:/modelweights.pkl"
try:
fpkl= open(pklfile, 'wb') #Python 3
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
except:
fpkl= open(pklfile, 'w') #Python 2
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
You can Pickle a Keras neural network by using the deploy-ml module which can be installed via pip
pip install deploy-ml
Full training and deployment of a kera neural network using the deploy-ml wrapper looks like this:
import pandas as pd
from deployml.keras import NeuralNetworkBase
# load data
train = pd.read_csv('example_data.csv')
# define the moel
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
# define data for the model
NN.data = train
# define the column in the data you're trying to predict
NN.outcome_pointer = 'paid'
# train the model, scale means that it's using a standard
# scaler to scale the data
NN.train(scale=True, batch_size=100)
NN.show_learning_curve()
# display the recall and precision
NN.evaluate_outcome()
# Pickle your model
NN.deploy_model(description='Keras NN',
author="maxwell flitton", organisation='example',
file_name='neural.sav')
The Pickled file contains the model, the metrics from the testing, a list of variable names and their order in which they have to be inputted, the version of Keras and python used, and if a scaler is used it will also be stored in the file. Documentation is here. Loading and using the file is done by the following:
import pickle
# use pickle to load the model
loaded_model = pickle.load(open("neural.sav", 'rb'))
# use the scaler to scale your data you want to input
input_data = loaded_model['scaler'].transform([[1, 28, 0, 1, 30]])
# get the prediction
loaded_model['model'].predict(input_data)[0][0]
I appreciate that the training can be a bit restrictive. Deploy-ml supports importing your own model for Sk-learn but it's still working on this support for Keras. However, I've found that you can create a deploy-ml NeuralNetworkBase object, define your own Keras neural network outside of Deploy-ml, and assign it to the deploy-ml model attribute and this works just fine:
NN = NeuralNetworkBase(hidden_layers = (7, 3),
first_layer=len(train.keys())-1,
n_classes=len(train.keys())-1)
NN.model = neural_network_you_defined_yourself
I am doing a project in Machine Learning and for that I am using the pickle module of Python.
Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.
So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.
You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.
Here's a rough sketch of how to update the pickled classifier data.
import pickle
import os
from os.path import exists
# other imports required for nltk ...
picklename = "naivebayes.pickle"
# stuff to set up featuresets ...
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
# Load or create a classifier and apply training set to it
if exists(picklename):
# Update existing classifier
with open(picklename, "rb") as f:
classifier = pickle.load(f)
classifier.train(training_set)
else:
# Create a brand new classifier
classifier = nltk.NaiveBayesClassifier.train(training_set)
# Create backup
if exists(picklename):
backupname = picklename + '.bak'
if exists(backupname):
os.remove(backupname)
os.rename(picklename, backupname)
# Save
with open(picklename, "wb") as f:
pickle.dump(classifier, f)
The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.
BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing
import pickle
with
import cPickle as pickle