PyTorch Lightning: includes some Tensor objects in checkpoint file

PyTorch Lightning: includes some Tensor objects in checkpoint file - python

As Pytorch Lightning provides automatic saving for model checkpoints, I use it to save top-k best models. Specifically in Trainer setting,
checkpoint_callback = ModelCheckpoint(
monitor='val_acc',
dirpath='checkpoints/',
filename='{epoch:02d}-{val_acc:.2f}',
save_top_k=5,
mode='max',
)
This is working well but it does not save some attribute of the model object. My model stores some Tensor at every training epoch end such that
class SampleNet(pl.LightningModule):
def __init__(self):
super().__init__()
self.save_hyperparameters()
self.layer = torch.nn.Linear(100, 1)
self.loss = torch.nn.CrossEntropy()
self.some_data = None # Initialize as None
def training_step(self, batch):
x, t = batch
out = self.layer(x)
loss = self.loss(out, t)
results = {'loss': loss}
return results
def training_epoch_end(self, outputs):
self.some_data = some_tensor_object
This is a simplified example but I want the checkpoint file made by above checkpoint_callback to remember the attribute self.some_data but when I load the model from checkpoint, it always reset to None. I confirmed that it is successfully updated during the training.
I tried not to initialize it as None in init but then the attribute will disappear when loading model.
Saving the attribute as a distinct pt file is something I want to avoid as it is associated with model configuration so I manually need to match the file with corresponding checkpoint file later.
Would it be possible to include such tensor attribute in checkpoint file?

Simply use the model class hooks on_save_checkpoint() and on_load_checkpoint() for all sorts of objects that you want to save alongside the default attributes.
def on_save_checkpoint(self, checkpoint) -> None:
"Objects to include in checkpoint file"
checkpoint["some_data"] = self.some_data
def on_load_checkpoint(self, checkpoint) -> None:
"Objects to retrieve from checkpoint file"
self.some_data= checkpoint["some_data"]
See module docs

It doesn't seem to be possible directly, as to extract the parameters most likely nn.Module.state_dict() is used.
This methods only extracts the values of the tensors that are actually considered as parameters. So in this case a workaround would be saving your data as a parameter (see docs):
self.some_data = torch.nn.parameter.Parameter(your_data)

Related

Loading a modified pretrained model using strict=False in PyTorch

I want to use a pretrained model as the encoder part in my model. You can find a version of my model:
class MyClass(nn.Module):
def __init__(self, pretrained=False):
super(MyClass, self).__init__()
self.encoder=S3D_featureExtractor_multi_output()
if pretrained:
weight_dict=torch.load(os.path.join('models','weights.pt'))
model_dict=self.encoder.state_dict()
list_weight_dict=list(weight_dict.items())
list_model_dict=list(model_dict.items())
for i in range(len(list_model_dict)):
assert list_model_dict[i][1].shape==list_weight_dict[i][1].shape
model_dict[list_model_dict[i][0]].copy_(weight_dict[list_weight_dict[i][0]])
for i in range(len(list_model_dict)):
assert torch.all(torch.eq(model_dict[list_model_dict[i][0]],weight_dict[list_weight_dict[i][0]].to('cpu')))
print('Loading finished!')
def forward(self, x):
a, b = self.encoder(x)
return a, b
Because I modified some parts of the code of this pretrained model, based on this post I need to apply strict=False to avoid facing error, but based on the scenario that I load the pretrained weights, I cannot find a place in the code to apply strict=False. How can I apply that or how can I change the scenario of loading the pretrained model taht makes it possible to apply strict=False?

strict = False is to specify when you use load_state_dict() method. state_dict are just Python dictionaries that helps you save and load model weights.
(for more details, see https://pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html)
If you use strict=False in load_state_dict, you inform PyTorch that the target model and the original model are not identical, so it just initialises the weights of layers which are present in both and ignores the rest.
(see https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=load_state_dict#torch.nn.Module.load_state_dict)
So, you will need to specify the strict argument when you load the pretrained model weights. load_state_dict can be called at this step.
If the model for which weights must be loaded is self.encoder
and if state_dict can be retrieved from the model you just loaded, you can just do this
loaded_weights = torch.load(os.path.join('models','weights.pt'))
self.encoder.load_state_dict(loaded_weights, strict=False)
for more details and a tutorial, see https://pytorch.org/tutorials/beginner/saving_loading_models.html .

Loading Gensim FastText Model with Callbacks Fails

After creating a FastText model using Gensim, I want to load it but am running into errors seemingly related to callbacks.
The code used to create the model is
TRAIN_EPOCHS = 30
WINDOW = 5
MIN_COUNT = 50
DIMS = 256
vocab_model = gensim.models.FastText(sentences=model_input,
size=DIMS,
window=WINDOW,
iter=TRAIN_EPOCHS,
workers=6,
min_count=MIN_COUNT,
callbacks=[EpochSaver("./ftchkpts/")])
vocab_model.save('ft_256_min_50_model_30eps')
and the callback EpochSaver is defined as
from gensim.models.callbacks import CallbackAny2Vec
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch and show training parameters '''
def __init__(self, savedir):
self.savedir = savedir
self.epoch = 0
os.makedirs(self.savedir, exist_ok=True)
def on_epoch_end(self, model):
savepath = os.path.join(self.savedir, f"ft256_{self.epoch}e")
model.save(savepath)
print(f"Epoch saved: {self.epoch + 1}")
if os.path.isfile(os.path.join(self.savedir, f"ft256_{self.epoch-1}e")):
os.remove(os.path.join(self.savedir, f"ft256_{self.epoch-1}e"))
print("Previous model deleted ")
self.epoch += 1
Aside from the type of model, this is identical to my process for Word2Vec which worked without issue. However when I open another file and try to load the model with
from gensim.models import FastText
vocab = FastText.load(r'vocab/ft_256_min_50_model_30eps')
I'm greeted with the error
AttributeError: Can't get attribute 'EpochSaver' on <module '__main__'>
What can I do to get the vocabulary to load so I can create the embedding layer for my keras model? If it's relevant, this is happening in JupyterLab.

This extra difficulty loading models with custom callbacks is a known, open issue (at least through gensim-3.8.1 and October 2019).
You can see discussions of possible workarounds and fixes there – and the gensim team is considering simply disabling the auto-saving of callbacks at all, requiring them to be re-specified for each later train()/etc call that needs them.
You may be able to load existing models saved with your custom callbacks by importing those same callback classes, as the same names, into the code context where you're doing a load().
You could save callback-free versions of your trained models by blanking the model's callbacks property to its empty default value, just before you save(), eg:
model.callbacks = ()
model.save(save_path)
Then, you wouldn't need to do any special importing of custom classes before a load(). (Of course if you again needed callback functionality on the re-loaded model, they'd then have to be explicitly reestablished after load()).

How can I fix an AttributeError while loading checkpoint?

I am working on Project 2 of a course with Udacity (Artificial Intelligence with Python Programming).
I have trained a model and saved it in checkpoint.pth and I want to load the checkpoint.pth so I can rebuild the model .
I have written the code to save checkpoint.pth and also to load checkpoint.
model.class_to_idx = image_datasets['train_dir'].class_to_idx
model.cpu()
checkpoint = {'input_size': 25088,
'output_size': 102,
'hidden_layers': 4096,
'epochs': epochs,
'optimizer': optimizer.state_dict(),
'state_dict': model.state_dict(),
'class_to_index' : model.class_to_idx
}
torch.save(checkpoint, 'checkpoint.pth')
def load_checkpoint(filepath):
checkpoint = torch.load(filepath)
model = checkpoint.Network(checkpoint['input_size'],
checkpoint['output_size'],
checkpoint['hidden_layers'],
checkpoint['epochs'],
checkpoint['optimizer'],
checkpoint['class_to_index']
)
model.load_state_dict(checkpoint['state_dict'])
return model
model = load_checkpoint('checkpoint.pth')
While loading checkpoint.pth, I get an error:
AttributeError: 'dict' object has no attribute 'Network'
I want successfully load checkpoint.
Thank you

UPDATE: With the full code visibile, I think the issues is in the implementation. torch.load will load the information from the dict that has been deserialized to the file. This loads as the original dict object, so in the function, you should expect checkpoint == checkpoint(original definition).
In this instance, I think what you are actually looking to do is calling the load on the file saved as checkpoint.pth and the first call might not be necessary.
def load_checkpoint(filepath):
model = torch.load(filepath)
return model
The other possibility is that the nested object must be what the object is called, and then it would be just a small adjustment:
def load_checkpoint(filepath):
checkpoint = torch.load(filepath)
model = torch.load_state_dict(checkpoint['state_dict'])
return model
The most likely problem is that you are calling on the Network class, which is not contained within the checkpoint dictionary object.
I can't speak to the actual lesson or other nuances within the lesson, the simplest solution might be to just call the Network class definition with the variables already in the checkpoint dictionary like so:
model = Network(checkpoint['input_size'],
checkpoint['output_size'],
checkpoint['hidden_layers'],
checkpoint['epochs'],
checkpoint['optimizer'],
checkpoint['class_to_index'])
model.load_state_dict(checkpoint['state_dict'])
return model
The checkpoint dict may only have the values you expect ('input_size', 'output_size' etc) But this is just the most obvious issue I see.

Saving and restoring functions in TensorFlow

I am working on a VAE project in TensorFlow where the encoder/decoder networks are build in functions. The idea is to be able to save, then load the trained model and do sampling, using the encoder function.
After restoring the model, I am having trouble getting the decoder function to run and give me back the restored, trained variables, getting an "Uninitialized value" error. I assume it is because the function is either creating a new new one, overwriting the existing, or otherwise. But I cannot figure out how to solve this. Here is some code:
class VAE(object):
def __init__(self, restore=True):
self.session = tf.Session()
if restore:
self.restore_model()
self.build_decoder = tf.make_template('decoder', self._build_decoder)
#staticmethod
def _build_decoder(z, output_size=768, hidden_size=200,
hidden_activation=tf.nn.elu, output_activation=tf.nn.sigmoid):
x = tf.layers.dense(z, hidden_size, activation=hidden_activation)
x = tf.layers.dense(x, hidden_size, activation=hidden_activation)
logits = tf.layers.dense(x, output_size, activation=output_activation)
return distributions.Independent(distributions.Bernoulli(logits), 2)
def sample_decoder(self, n_samples):
prior = self.build_prior(self.latent_dim)
samples = self.build_decoder(prior.sample(n_samples), self.input_size).mean()
return self.session.run([samples])
def restore_model(self):
print("Restoring")
self.saver = tf.train.import_meta_graph(os.path.join(self.save_dir, "turbolearn.meta"))
self.saver.restore(self.sess, tf.train.latest_checkpoint(self.save_dir))
self._restored = True
want to run samples = vae.sample_decoder(5)
In my training routine, I run:
if self.checkpoint:
self.saver.save(self.session, os.path.join(self.save_dir, "myvae"), write_meta_graph=True)
UPDATE
Based on the suggested answer below, I changed the restore method
self.saver = tf.train.Saver()
self.saver.restore(self.session, tf.train.latest_checkpoint(self.save_dir))
But now get a value error when it creates the Saver() object:
ValueError: No variables to save

The tf.train.import_meta_graph restores the graph, meaning rebuilds the network architecture that was stored to the file. The call to tf.train.Saver.restore on the other hand only restores the variable values from the file to the current graph in the session (this naturally fails if the some values of in the file belong to variables that do not exist in the currently active graph).
So if you already build the network layers in the code, you don't need to call tf.train.import_meta_graph. Otherwise this might be causing you problems.
Not sure how the rest of your code looks like but here are some suggestions. First build the graph, then create the session, and finally restore if applicable. Your init might look like this then
def __init__(self, restore=True):
self.build_decoder = tf.make_template('decoder', self._build_decoder)
self.session = tf.Session()
if restore:
self.restore_model()
However if you are only restoring the encoder, and building the decoder anew, you might build the decoder last. But then don't forget to initialize its variables before usage.

How do I save a trained model in PyTorch?

How do I save a trained model in PyTorch? I have read that:
torch.save()/torch.load() is for saving/loading a serializable object.
model.state_dict()/model.load_state_dict() is for saving/loading model state.

Found this page on their github repo:
Recommended approach for saving a model
There are two main approaches for serializing and restoring a model.
The first (recommended) saves and loads only the model parameters:
torch.save(the_model.state_dict(), PATH)
Then later:
the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))
The second saves and loads the entire model:
torch.save(the_model, PATH)
Then later:
the_model = torch.load(PATH)
However in this case, the serialized data is bound to the specific classes and the exact directory structure used, so it can break in various ways when used in other projects, or after some serious refactors.
See also: Save and Load the Model section from the official PyTorch tutorials.

It depends on what you want to do.
Case # 1: Save the model to use it yourself for inference: You save the model, you restore it, and then you change the model to evaluation mode. This is done because you usually have BatchNorm and Dropout layers that by default are in train mode on construction:
torch.save(model.state_dict(), filepath)
#Later to restore:
model.load_state_dict(torch.load(filepath))
model.eval()
Case # 2: Save model to resume training later: If you need to keep training the model that you are about to save, you need to save more than just the model. You also need to save the state of the optimizer, epochs, score, etc. You would do it like this:
state = {
'epoch': epoch,
'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(),
...
}
torch.save(state, filepath)
To resume training you would do things like: state = torch.load(filepath), and then, to restore the state of each individual object, something like this:
model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(state['optimizer'])
Since you are resuming training, DO NOT call model.eval() once you restore the states when loading.
Case # 3: Model to be used by someone else with no access to your code:
In Tensorflow you can create a .pb file that defines both the architecture and the weights of the model. This is very handy, specially when using Tensorflow serve. The equivalent way to do this in Pytorch would be:
torch.save(model, filepath)
# Then later:
model = torch.load(filepath)
This way is still not bullet proof and since pytorch is still undergoing a lot of changes, I wouldn't recommend it.

The pickle Python library implements binary protocols for serializing and de-serializing a Python object.
When you import torch (or when you use PyTorch) it will import pickle for you and you don't need to call pickle.dump() and pickle.load() directly, which are the methods to save and to load the object.
In fact, torch.save() and torch.load() will wrap pickle.dump() and pickle.load() for you.
A state_dict the other answer mentioned deserves just a few more notes.
What state_dict do we have inside PyTorch?
There are actually two state_dicts.
The PyTorch model is torch.nn.Module which has model.parameters() call to get learnable parameters (w and b).
These learnable parameters, once randomly set, will update over time as we learn.
Learnable parameters are the first state_dict.
The second state_dict is the optimizer state dict. You recall that the optimizer is used to improve our learnable parameters. But the optimizer state_dict is fixed. Nothing to learn there.
Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.
Let's create a super simple model to explain this:
import torch
import torch.optim as optim
model = torch.nn.Linear(5, 2)
# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
print("Model's state_dict:")
for param_tensor in model.state_dict():
print(param_tensor, "\t", model.state_dict()[param_tensor].size())
print("Model weight:")
print(model.weight)
print("Model bias:")
print(model.bias)
print("---")
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
print(var_name, "\t", optimizer.state_dict()[var_name])
This code will output the following:
Model's state_dict:
weight torch.Size([2, 5])
bias torch.Size([2])
Model weight:
Parameter containing:
tensor([[ 0.1328, 0.1360, 0.1553, -0.1838, -0.0316],
[ 0.0479, 0.1760, 0.1712, 0.2244, 0.1408]], requires_grad=True)
Model bias:
Parameter containing:
tensor([ 0.4112, -0.0733], requires_grad=True)
---
Optimizer's state_dict:
state {}
param_groups [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [140695321443856, 140695321443928]}]
Note this is a minimal model. You may try to add stack of sequential
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.Conv2d(A, B, C)
torch.nn.Linear(H, D_out),
)
Note that only layers with learnable parameters (convolutional layers, linear layers, etc.) and registered buffers (batchnorm layers) have entries in the model's state_dict.
Non-learnable things belong to the optimizer object state_dict, which contains information about the optimizer's state, as well as the hyperparameters used.
The rest of the story is the same; in the inference phase (this is a phase when we use the model after training) for predicting; we do predict based on the parameters we learned. So for the inference, we just need to save the parameters model.state_dict().
torch.save(model.state_dict(), filepath)
And to use later
model.load_state_dict(torch.load(filepath))
model.eval()
Note: Don't forget the last line model.eval() this is crucial after loading the model.
Also don't try to save torch.save(model.parameters(), filepath). The model.parameters() is just the generator object.
On the other hand, torch.save(model, filepath) saves the model object itself, but keep in mind the model doesn't have the optimizer's state_dict. Check the other excellent answer by #Jadiel de Armas to save the optimizer's state dict.

A common PyTorch convention is to save models using either a .pt or .pth file extension.
Save/Load Entire Model
Save:
path = "username/directory/lstmmodelgpu.pth"
torch.save(trainer, path)
Load:
(Model class must be defined somewhere)
model.load_state_dict(torch.load(PATH))
model.eval()

If you want to save the model and wants to resume the training later:
Single GPU:
Save:
state = {
'epoch': epoch,
'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)
Load:
checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
Multiple GPU:
Save
state = {
'epoch': epoch,
'state_dict': model.module.state_dict(),
'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)
Load:
checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
#Don't call DataParallel before loading the model otherwise you will get an error
model = nn.DataParallel(model) #ignore the line if you want to load on Single GPU

Saving locally
How you save your model depends on how you want to access it in the future. If you can call a new instance of the model class, then all you need to do is save/load the weights of the model with model.state_dict():
# Save:
torch.save(old_model.state_dict(), PATH)
# Load:
new_model = TheModelClass(*args, **kwargs)
new_model.load_state_dict(torch.load(PATH))
If you cannot for whatever reason (or prefer the simpler syntax), then you can save the entire model (actually a reference to the file(s) defining the model, along with its state_dict) with torch.save():
# Save:
torch.save(old_model, PATH)
# Load:
new_model = torch.load(PATH)
But since this is a reference to the location of the files defining the model class, this code is not portable unless those files are also ported in the same directory structure.
Saving to cloud - TorchHub
If you wish your model to be portable, you can easily allow it to be imported with torch.hub. If you add an appropriately defined hubconf.py file to a github repo, this can be easily called from within PyTorch to enable users to load your model with/without weights:
hubconf.py (github.com/repo_owner/repo_name)
dependencies = ['torch']
from my_module import mymodel as _mymodel
def mymodel(pretrained=False, **kwargs):
return _mymodel(pretrained=pretrained, **kwargs)
Loading model:
new_model = torch.hub.load('repo_owner/repo_name', 'mymodel')
new_model_pretrained = torch.hub.load('repo_owner/repo_name', 'mymodel', pretrained=True)

pip install pytorch-lightning
make sure your parent model uses pl.LightningModule instead of nn.Module
Saving and loading checkpoints using pytorch lightning
import pytorch_lightning as pl
model = MyLightningModule(hparams)
trainer.fit(model)
trainer.save_checkpoint("example.ckpt")
new_model = MyModel.load_from_checkpoint(checkpoint_path="example.ckpt")

These days everything is written in the official tutorial:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
You have several options on how to save and what to save and all is explained in that tutorial.

I use this approach, hope it will be useful for you.
num_labels = len(test_label_cols)
robertaclassificationtrain = '/dbfs/FileStore/tables/PM/TC/roberta_model'
robertaclassificationpath = "/dbfs/FileStore/tables/PM/TC/ROBERTACLASSIFICATION"
model = RobertaForSequenceClassification.from_pretrained(robertaclassificationpath,
num_labels=num_labels)
model.cuda()
model.load_state_dict(torch.load(robertaclassificationtrain))
model.eval()
Where I save my train model already in 'roberta_model' path. Save a train model.
torch.save(model.state_dict(), '/dbfs/FileStore/tables/PM/TC/roberta_model')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch Lightning: includes some Tensor objects in checkpoint file - python

Related

Loading a modified pretrained model using strict=False in PyTorch

Loading Gensim FastText Model with Callbacks Fails

How can I fix an AttributeError while loading checkpoint?

Saving and restoring functions in TensorFlow

How do I save a trained model in PyTorch?

Categories

Resources