tensorflow models load successfully individually, but not sequentially? Garbage collection?

tensorflow models load successfully individually, but not sequentially? Garbage collection? - python

Sorry about the vague title but I'm not sure exactly how to describe it.
I am currently running tests on a model written in tensorflow.compat.v1. When it is used for inference, it must be restored as follows:
class Model
def __init__(self, filepath):
...
self.sess, self.saver = self.setup_tf()
self.merge = tf.compat.v1.summary.merge(tf.compat.v1.summary.scalar('loss', self.loss)])
def setup_tf():
sess = tf.compat.v1.Session() # TF session
saver = tf.compat.v1.train.Saver(max_to_keep=1)
latest_snapshot = tf.train.latest_checkpoint(join("../", self.model_dir))
saver.restore(sess, latest_snapshot)
return sess, saver
I have 2 of these tests, involving restoring a model and performing some inferences with it. After successful loading and inference of the first model, the second test fails to restore the model completely. The specific error here is
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
class TestInference(unittest.TestCase):
def setUp(self):
pass
def tearDown(self):
pass:
def test1:
model = Model(filepath)
model.infer()
def test2:
model = Model(filepath)
model.infer()
However when I run each test individually (commenting one out), there is no problem in restoring the model. Even when I swap the order of the models being loaded around, there the first one is always successful and the second one fails.
I figure that I should be using setUp and tearDown for each test, but I'm confused on what exactly I should be 'tearing down'? Is it garbage collection? I have tried gc.collect, but the same error occurs.
Here is the testing class, if it helps:

Related

Loading Model only once in fastAPI

I have a Deep learning model and i want to load that model only once when my api start.
Right now model is loading for every request made in my api service with the test image, and it is taking fair bit of time to print an output.
I tried loading my model with fastapi #app.on_event('startup')
main.py
#app.on_event('startup')
def init_data():
print("init call")
model = load_model(model_path)
return model
and I want to import this model variable to another python file classification.py
#classification.py
pred = model.predict(image)
I am not sure if this is the correct way to do this. Any help regarding this will be appreciated.
Thanks

So what you need, it is just to encapsulate this in some functions and pass the model along.
# main.py
import classification
# Other imports...
model = load_model(model_path)
# Model is global so loaded once.
# ...
#app.get('/whatever/page')
def predict(image):
global model # We make sure model is not local
# We call classification's predict function
result = classification.predict(image, model)
#...
#classification.py
# import ...
# The function predict takes the model as a param, and the image to predict.
def predict(image, model):
pred = model.predict(image)
# ...
return pred
You should do this better with typing of the parameters, etc. but that gives an idea of the dataflow. Another approach would be to load the model in the root of classification.py so you don't need to pass it along. That will be executed only once when imported the first time.

Loading Gensim FastText Model with Callbacks Fails

After creating a FastText model using Gensim, I want to load it but am running into errors seemingly related to callbacks.
The code used to create the model is
TRAIN_EPOCHS = 30
WINDOW = 5
MIN_COUNT = 50
DIMS = 256
vocab_model = gensim.models.FastText(sentences=model_input,
size=DIMS,
window=WINDOW,
iter=TRAIN_EPOCHS,
workers=6,
min_count=MIN_COUNT,
callbacks=[EpochSaver("./ftchkpts/")])
vocab_model.save('ft_256_min_50_model_30eps')
and the callback EpochSaver is defined as
from gensim.models.callbacks import CallbackAny2Vec
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch and show training parameters '''
def __init__(self, savedir):
self.savedir = savedir
self.epoch = 0
os.makedirs(self.savedir, exist_ok=True)
def on_epoch_end(self, model):
savepath = os.path.join(self.savedir, f"ft256_{self.epoch}e")
model.save(savepath)
print(f"Epoch saved: {self.epoch + 1}")
if os.path.isfile(os.path.join(self.savedir, f"ft256_{self.epoch-1}e")):
os.remove(os.path.join(self.savedir, f"ft256_{self.epoch-1}e"))
print("Previous model deleted ")
self.epoch += 1
Aside from the type of model, this is identical to my process for Word2Vec which worked without issue. However when I open another file and try to load the model with
from gensim.models import FastText
vocab = FastText.load(r'vocab/ft_256_min_50_model_30eps')
I'm greeted with the error
AttributeError: Can't get attribute 'EpochSaver' on <module '__main__'>
What can I do to get the vocabulary to load so I can create the embedding layer for my keras model? If it's relevant, this is happening in JupyterLab.

This extra difficulty loading models with custom callbacks is a known, open issue (at least through gensim-3.8.1 and October 2019).
You can see discussions of possible workarounds and fixes there – and the gensim team is considering simply disabling the auto-saving of callbacks at all, requiring them to be re-specified for each later train()/etc call that needs them.
You may be able to load existing models saved with your custom callbacks by importing those same callback classes, as the same names, into the code context where you're doing a load().
You could save callback-free versions of your trained models by blanking the model's callbacks property to its empty default value, just before you save(), eg:
model.callbacks = ()
model.save(save_path)
Then, you wouldn't need to do any special importing of custom classes before a load(). (Of course if you again needed callback functionality on the re-loaded model, they'd then have to be explicitly reestablished after load()).

How many times are DoFn's constructed?

I'm using the apache beam python SDK and Dataflow to write an inference pipeline for making predictions with TensorFlow models. I have the prediction step in a DoFn, but I don't want to have to load the model every time I process a bundle because that's very expensive. From the docs here, "If required, a fresh instance of the argument DoFn is created on a worker, and the DoFn.Setup method is called on this instance. This may be through deserialization or other means. A PipelineRunner may reuse DoFn instances for multiple bundles. A DoFn that has terminated abnormally (by throwing an Exception) will never be reused." I've noticed that if I write my code like this
class StatefulGetEmbeddingsDoFn(beam.DoFn):
def __init__(self, model_dir):
self.model = None # initialize
self.model_dir = model_dir
def process(self, element):
if not self.model: # load model if model hasn't been loaded yet
global i
i += 1
logging.info('Getting model: {}'.format(i))
self.model = Model(saved_model_dir=self.model_dir)
ids, b64 = element
embeddings = self.model.predict(b64)
res = [
{
'image': _id,
'embeddings': embedding.tolist()
} for _id, embedding in zip(ids, embeddings)
]
return res
It seems like the model is being loaded more than once on every worker (I've got a cluster of ~30-40 machines). Is there a way of preventing the model from being loaded more than once? I would've expected this DoFn to only be constructed once on every machine but from the logs, it seems like that's not the case...

I know this is an older question, but my initial thoughts are to use the setup and start_bundle methods.
https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.setup

How can I fix an AttributeError while loading checkpoint?

I am working on Project 2 of a course with Udacity (Artificial Intelligence with Python Programming).
I have trained a model and saved it in checkpoint.pth and I want to load the checkpoint.pth so I can rebuild the model .
I have written the code to save checkpoint.pth and also to load checkpoint.
model.class_to_idx = image_datasets['train_dir'].class_to_idx
model.cpu()
checkpoint = {'input_size': 25088,
'output_size': 102,
'hidden_layers': 4096,
'epochs': epochs,
'optimizer': optimizer.state_dict(),
'state_dict': model.state_dict(),
'class_to_index' : model.class_to_idx
}
torch.save(checkpoint, 'checkpoint.pth')
def load_checkpoint(filepath):
checkpoint = torch.load(filepath)
model = checkpoint.Network(checkpoint['input_size'],
checkpoint['output_size'],
checkpoint['hidden_layers'],
checkpoint['epochs'],
checkpoint['optimizer'],
checkpoint['class_to_index']
)
model.load_state_dict(checkpoint['state_dict'])
return model
model = load_checkpoint('checkpoint.pth')
While loading checkpoint.pth, I get an error:
AttributeError: 'dict' object has no attribute 'Network'
I want successfully load checkpoint.
Thank you

UPDATE: With the full code visibile, I think the issues is in the implementation. torch.load will load the information from the dict that has been deserialized to the file. This loads as the original dict object, so in the function, you should expect checkpoint == checkpoint(original definition).
In this instance, I think what you are actually looking to do is calling the load on the file saved as checkpoint.pth and the first call might not be necessary.
def load_checkpoint(filepath):
model = torch.load(filepath)
return model
The other possibility is that the nested object must be what the object is called, and then it would be just a small adjustment:
def load_checkpoint(filepath):
checkpoint = torch.load(filepath)
model = torch.load_state_dict(checkpoint['state_dict'])
return model
The most likely problem is that you are calling on the Network class, which is not contained within the checkpoint dictionary object.
I can't speak to the actual lesson or other nuances within the lesson, the simplest solution might be to just call the Network class definition with the variables already in the checkpoint dictionary like so:
model = Network(checkpoint['input_size'],
checkpoint['output_size'],
checkpoint['hidden_layers'],
checkpoint['epochs'],
checkpoint['optimizer'],
checkpoint['class_to_index'])
model.load_state_dict(checkpoint['state_dict'])
return model
The checkpoint dict may only have the values you expect ('input_size', 'output_size' etc) But this is just the most obvious issue I see.

Saving and restoring functions in TensorFlow

I am working on a VAE project in TensorFlow where the encoder/decoder networks are build in functions. The idea is to be able to save, then load the trained model and do sampling, using the encoder function.
After restoring the model, I am having trouble getting the decoder function to run and give me back the restored, trained variables, getting an "Uninitialized value" error. I assume it is because the function is either creating a new new one, overwriting the existing, or otherwise. But I cannot figure out how to solve this. Here is some code:
class VAE(object):
def __init__(self, restore=True):
self.session = tf.Session()
if restore:
self.restore_model()
self.build_decoder = tf.make_template('decoder', self._build_decoder)
#staticmethod
def _build_decoder(z, output_size=768, hidden_size=200,
hidden_activation=tf.nn.elu, output_activation=tf.nn.sigmoid):
x = tf.layers.dense(z, hidden_size, activation=hidden_activation)
x = tf.layers.dense(x, hidden_size, activation=hidden_activation)
logits = tf.layers.dense(x, output_size, activation=output_activation)
return distributions.Independent(distributions.Bernoulli(logits), 2)
def sample_decoder(self, n_samples):
prior = self.build_prior(self.latent_dim)
samples = self.build_decoder(prior.sample(n_samples), self.input_size).mean()
return self.session.run([samples])
def restore_model(self):
print("Restoring")
self.saver = tf.train.import_meta_graph(os.path.join(self.save_dir, "turbolearn.meta"))
self.saver.restore(self.sess, tf.train.latest_checkpoint(self.save_dir))
self._restored = True
want to run samples = vae.sample_decoder(5)
In my training routine, I run:
if self.checkpoint:
self.saver.save(self.session, os.path.join(self.save_dir, "myvae"), write_meta_graph=True)
UPDATE
Based on the suggested answer below, I changed the restore method
self.saver = tf.train.Saver()
self.saver.restore(self.session, tf.train.latest_checkpoint(self.save_dir))
But now get a value error when it creates the Saver() object:
ValueError: No variables to save

The tf.train.import_meta_graph restores the graph, meaning rebuilds the network architecture that was stored to the file. The call to tf.train.Saver.restore on the other hand only restores the variable values from the file to the current graph in the session (this naturally fails if the some values of in the file belong to variables that do not exist in the currently active graph).
So if you already build the network layers in the code, you don't need to call tf.train.import_meta_graph. Otherwise this might be causing you problems.
Not sure how the rest of your code looks like but here are some suggestions. First build the graph, then create the session, and finally restore if applicable. Your init might look like this then
def __init__(self, restore=True):
self.build_decoder = tf.make_template('decoder', self._build_decoder)
self.session = tf.Session()
if restore:
self.restore_model()
However if you are only restoring the encoder, and building the decoder anew, you might build the decoder last. But then don't forget to initialize its variables before usage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tensorflow models load successfully individually, but not sequentially? Garbage collection? - python

Related

Loading Model only once in fastAPI

Loading Gensim FastText Model with Callbacks Fails

How many times are DoFn's constructed?

How can I fix an AttributeError while loading checkpoint?

Saving and restoring functions in TensorFlow

Categories

Resources