I'm working with Pytorch on Mask RCNN. I get training results and losses data, so now I want to plot on graph after each epoch done like this:
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
In engine.py is code for train_one_epoch
def train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq):
model.train()
metric_logger = utils.MetricLogger(delimiter=" ")
metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
header = 'Epoch: [{}]'.format(epoch)
lr_scheduler = None
if epoch == 0:
warmup_factor = 1. / 1000
warmup_iters = min(1000, len(data_loader) - 1)
lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
# reduce losses over all GPUs for logging purposes
loss_dict_reduced = utils.reduce_dict(loss_dict)
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
loss_value = losses_reduced.item()
if not math.isfinite(loss_value):
print("Loss is {}, stopping training".format(loss_value))
print(loss_dict_reduced)
sys.exit(1)
optimizer.zero_grad()
losses.backward()
optimizer.step()
if lr_scheduler is not None:
lr_scheduler.step()
metric_logger.update(loss=losses_reduced, **loss_dict_reduced)
metric_logger.update(lr=optimizer.param_groups[0]["lr"])
return metric_logger
Here this function train_one_epoch returns metric_logger , so can I use that variable to somehow plot losses and accuracy after each epoch on graph to get something like this:
Related
I am following a Pytorch code on deep learning. Where I saw model evaluation taking place within the training epoch!
Q) Should the torch.no_grad and model.eval() be out of the training epoch loop?
Q) And how to determine that, which parameter (weight) are getting optimised by the optimiser during the back-propagation?
...
for l in range(1):
model = GTN(num_edge=A.shape[-1],
num_channels=num_channels,w_in = node_features.shape[1],w_out = node_dim,
num_class=num_classes,num_layers=num_layers,norm=norm)
if adaptive_lr == 'false':
optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=0.001)
else:
optimizer = torch.optim.Adam([{'params':model.weight},{'params':model.linear1.parameters()},{'params':model.linear2.parameters()},
{"params":model.layers.parameters(), "lr":0.5}], lr=0.005, weight_decay=0.001)
loss = nn.CrossEntropyLoss()
# Train & Valid & Test
best_val_loss = 10000
best_train_loss = 10000
best_train_f1 = 0
best_val_f1 = 0
for i in range(epochs):
print('Epoch: ',i+1)
model.zero_grad()
model.train()
loss,y_train,Ws = model(A, node_features, train_node, train_target)
train_f1 = torch.mean(f1_score(torch.argmax(y_train.detach(),dim=1), train_target, num_classes=num_classes)).cpu().numpy()
print('Train - Loss: {}, Macro_F1: {}'.format(loss.detach().cpu().numpy(), train_f1))
loss.backward()
optimizer.step()
model.eval()
# Valid
with torch.no_grad():
val_loss, y_valid,_ = model.forward(A, node_features, valid_node, valid_target)
val_f1 = torch.mean(f1_score(torch.argmax(y_valid,dim=1), valid_target, num_classes=num_classes)).cpu().numpy()
if val_f1 > best_val_f1:
best_val_loss = val_loss.detach().cpu().numpy()
best_train_loss = loss.detach().cpu().numpy()
best_train_f1 = train_f1
best_val_f1 = val_f1
print('---------------Best Results--------------------')
print('Train - Loss: {}, Macro_F1: {}'.format(best_train_loss, best_train_f1))
print('Valid - Loss: {}, Macro_F1: {}'.format(best_val_loss, best_val_f1))
final_f1 += best_test_f1
For each epoch, you are doing train, followed by validation/test.
For validation/test you are moving the model to evaluation model
using model.eval() and then doing forward propagation with
torch.no_grad() which is correct. Again, you are moving back the
model back to train model using model.train() at the start of
train. There is no issue with the code and you are using the model
modes correctly.
In your code, if adaptive_lr if False then you are optimizing the parameters given by model.parameters() and when adaptive_lr
is True then you are optimizing:
model.weight
model.linear1.parameters()
model.linear2.parameters()
model.layers.parameters()
I need to write a code which trains a network given one single batch of training data and computes the loss on the complete validation set for each epoch as well. Set batch_size = 64.
Also, need to provide the graph the training and validation loss over epochs.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.net_layer = Sequential(
nn.Flatten(),
nn.Linear(64*64,30),
nn.Sigmoid())
def foward(self, x):
x = self.net_layer(x)
return x
model = Net()
nepochs = 2
losses = np.zeros(nepochs)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for epoch in range(nepochs): # loop over the dataset multiple times
# initialise variables for mean loss calculation
running_loss = 0.0
n = 0
for data in train_loader:
inputs, labels = data
# Zero the parameter gradients to remove accumulated gradient from a previous iteration.
optimizer.zero_grad()
# Forward, backward, and update parameters
outputs = model(inputs) # running network
loss = loss_fn(outputs, labels) # calculating loss function
loss.backward() # backpropogating network
optimizer.step() # update model parameters with gradient decsent
# accumulate loss and increment minibatches
running_loss += loss.item()`enter code here`
n += 1
# record the mean loss for this epoch and show progress
losses[epoch] = running_loss / n
print(f"epoch: {epoch+1} loss: {losses[epoch] : .3f}")
I got this far and getting the following error:
error message
Any idea what I am doing wrong?
I am training my conditional GAN network using the code from TensorFlow tutorial that uses a written from scratch training loop
def fit(train_ds, epochs, test_ds):
for epoch in range(epochs):
start = time.time()
display.clear_output(wait=True)
for example_input, example_target in test_ds.take(1):
generate_images(generator, example_input, example_target)
print("Epoch: ", epoch)
# Train
for n, (input_image, target) in train_ds.enumerate():
print('.', end='')
if (n+1) % 100 == 0:
print()
train_step(input_image, target, epoch)
print()
# saving (checkpoint) the model every 20 epochs
if (epoch + 1) % 20 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print ('Time taken for epoch {} is {} sec\n'.format(epoch + 1,
time.time()-start))
checkpoint.save(file_prefix = checkpoint_prefix)
and the train step is defined like this
#tf.function
def train_step(input_image, target, epoch):
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
gen_output = generator(input_image, training=True)
disc_real_output = discriminator([input_image, target], training=True)
disc_generated_output = discriminator([input_image, gen_output], training=True)
gen_total_loss, gen_gan_loss, gen_l1_loss = generator_loss(disc_generated_output, gen_output, target)
disc_loss = discriminator_loss(disc_real_output, disc_generated_output)
generator_gradients = gen_tape.gradient(gen_total_loss,
generator.trainable_variables)
discriminator_gradients = disc_tape.gradient(disc_loss,
discriminator.trainable_variables)
generator_optimizer.apply_gradients(zip(generator_gradients,
generator.trainable_variables))
discriminator_optimizer.apply_gradients(zip(discriminator_gradients,
discriminator.trainable_variables))
with summary_writer.as_default():
tf.summary.scalar('gen_total_loss', gen_total_loss, step=epoch)
tf.summary.scalar('gen_gan_loss', gen_gan_loss, step=epoch)
tf.summary.scalar('gen_l1_loss', gen_l1_loss, step=epoch)
tf.summary.scalar('disc_loss', disc_loss, step=epoch)
now my question is for the summary writer is it saving the loss of the batch only or the average across the whole dataset and if it is for the batch which batch loss is it saving and how can I get the average across the whole dataset if the batches are not the same size ?
I assumed it was the average because I got the code from a tensorflow tutorial so I trusted it but when I think about it I am not sure that that is the case.
If you want Tensorboard to get only the loss per epoch, you need to save the values to Tensorboard at the end of every epoch, not at every batch.
First, create variables for mean values for every epoch:
for epoch in range(epochs):
mean_epoch_loss = tf.metrics.Mean()
# etc...
And in train_step, update this value with the corresponding loss:
#tf.function
def train_step(input_image, target, epoch):
# etc...
mean_epoch_loss.update_state(epoch_loss)
And at the end of every epoch, save this value to Tensorboard:
for epoch in range(epochs):
# etc...
with summary_writer.as_default():
tf.summary.scalar('mean_epoch_loss', mean_epoch_loss, step=epoch)
train_losses=[]
val_losses=[]
for epoch in range(EPOCH):
train_loss = 0
valid_loss = 0
# train steps
net.train()
for batch_index, (data, target) in enumerate(train_loader):
target = target.reshape(-1,1)
# clears gradients
optimizer.zero_grad()
# forward pass
output, dense2_output = net(data)
eps = 1e-6
#loss in batch
loss = torch.sqrt(criterion(output, target.double())+eps)
# backward pass for loss gradient
loss.backward()
# update paremeters/weights
optimizer.step()
# update training loss
#train_loss += math.pow(loss.item(),2)*data.size(0)
train_loss += loss.item()*data.size(0)
# average loss calculations
#train_loss = math.sqrt(train_loss/len(train_loader.sampler))
train_loss = train_loss/len(train_loader.sampler)
# Display loss statistics
print(f'Current Epoch: {epoch}\n Training Loss: {round(train_loss, 6)}')
train_losses.append(train_loss)
I am Calculating Average RMSE Loss in Pytorch Neural Network regression model. Is it the correct way ?
Please help me understand if there is any error with my approach.
Actually i want to know if there is any bug in my code
I am very new in pytorch and implementing my own network of image classifier. However I see for each epoch training accuracy is very good but validation accuracy is 0.i noted till 5th epoch. I am using Adam optimizer and have learning rate .001. also resampling the whole data set after each epoch into training n validation set. Please help where I am going wrong.
Here is my code:
### where is data?
data_dir_train = '/home/sup/PycharmProjects/deep_learning/CNN_Data/training_set'
data_dir_test = '/home/sup/PycharmProjects/deep_learning/CNN_Data/test_set'
# Define your batch_size
batch_size = 64
allData = datasets.ImageFolder(root=data_dir_train,transform=transformArr)
# We need to further split our training dataset into training and validation sets.
def split_train_validation():
# Define the indices
num_train = len(allData)
indices = list(range(num_train)) # start with all the indices in training set
split = int(np.floor(0.2 * num_train)) # define the split size
#train_idx, valid_idx = indices[split:], indices[:split]
# Random, non-contiguous split
validation_idx = np.random.choice(indices, size=split, replace=False)
train_idx = list(set(indices) - set(validation_idx))
# define our samplers -- we use a SubsetRandomSampler because it will return
# a random subset of the split defined by the given indices without replacement
train_sampler = SubsetRandomSampler(train_idx)
validation_sampler = SubsetRandomSampler(validation_idx)
#train_loader = DataLoader(allData,batch_size=batch_size,sampler=train_sampler,shuffle=False,num_workers=4)
#validation_loader = DataLoader(dataset=allData,batch_size=1, sampler=validation_sampler)
return (train_sampler,validation_sampler)
Training
from torch.optim import Adam
import torch
import createNN
import torch.nn as nn
import loadData as ld
from torch.autograd import Variable
from torch.utils.data import DataLoader
# check if cuda - GPU support available
cuda = torch.cuda.is_available()
#create model, optimizer and loss function
model = createNN.ConvNet(class_num=2)
optimizer = Adam(model.parameters(),lr=.001,weight_decay=.0001)
loss_func = nn.CrossEntropyLoss()
if cuda:
model.cuda()
# function to save model
def save_model(epoch):
torch.save(model.load_state_dict(),'imageClassifier_{}.model'.format(epoch))
print('saved model at epoch',epoch)
def exp_lr_scheduler ( epoch , init_lr = args.lr, weight_decay = args.weight_decay, lr_decay_epoch = cf.lr_decay_epoch):
lr = init_lr * ( 0.5 ** (epoch // lr_decay_epoch))
def train(num_epochs):
best_acc = 0.0
for epoch in range(num_epochs):
print('\n\nEpoch {}'.format(epoch))
train_sampler, validation_sampler = ld.split_train_validation()
train_loader = DataLoader(ld.allData, batch_size=30, sampler=train_sampler, shuffle=False)
validation_loader = DataLoader(dataset=ld.allData, batch_size=1, sampler=validation_sampler)
model.train()
acc = 0.0
loss = 0.0
total = 0
# train model with training data
for i,(images,labels) in enumerate(train_loader):
# if cuda then move to GPU
if cuda:
images = images.cuda()
labels = labels.cuda()
# Variable class wraps a tensor and we can calculate grad
images = Variable(images)
labels = Variable(labels)
# reset accumulated gradients for each batch
optimizer.zero_grad()
# pass images to model which returns preiction
output = model(images)
#calculate the loss based on prediction and actual
loss = loss_func(output,labels)
# backpropagate the loss and compute gradient
loss.backward()
# update weights as per the computed gradients
optimizer.step()
# prediction class
predVal , predClass = torch.max(output.data, 1)
acc += torch.sum(predClass == labels.data)
loss += loss.cpu().data[0]
total += labels.size(0)
# print the statistics
train_acc = acc/total
train_loss = loss / total
print('Mean train acc = {} over epoch = {}'.format(epoch,acc))
print('Mean train loss = {} over epoch = {}'.format(epoch, loss))
# Valid model with validataion data
model.eval()
acc = 0.0
loss = 0.0
total = 0
for i,(images,labels) in enumerate(validation_loader):
# if cuda then move to GPU
if cuda:
images = images.cuda()
labels = labels.cuda()
# Variable class wraps a tensor and we can calculate grad
images = Variable(images)
labels = Variable(labels)
# reset accumulated gradients for each batch
optimizer.zero_grad()
# pass images to model which returns preiction
output = model(images)
#calculate the loss based on prediction and actual
loss = loss_func(output,labels)
# backpropagate the loss and compute gradient
loss.backward()
# update weights as per the computed gradients
optimizer.step()
# prediction class
predVal, predClass = torch.max(output.data, 1)
acc += torch.sum(predClass == labels.data)
loss += loss.cpu().data[0]
total += labels.size(0)
# print the statistics
valid_acc = acc / total
valid_loss = loss / total
print('Mean train acc = {} over epoch = {}'.format(epoch, valid_acc))
print('Mean train loss = {} over epoch = {}'.format(epoch, valid_loss))
if(best_acc<valid_acc):
best_acc = valid_acc
save_model(epoch)
# at 30th epoch we save the model
if (epoch == 30):
save_model(epoch)
train(20)
I think you did not take into account that acc += torch.sum(predClass == labels.data) returns a tensor instead of a float value. Depending on the version of pytorch you are using I think you should change it to:
acc += torch.sum(predClass == labels.data).cpu().data[0] #pytorch 0.3
acc += torch.sum(predClass == labels.data).item() #pytorch 0.4
Although your code seems to be working for old pytorch version, I would recommend you to upgrade to the 0.4 version.
Also, I mentioned other problems/typos in your code.
You are loading the dataset for every epoch.
for epoch in range(num_epochs):
print('\n\nEpoch {}'.format(epoch))
train_sampler, validation_sampler = ld.split_train_validation()
train_loader = DataLoader(ld.allData, batch_size=30, sampler=train_sampler, shuffle=False)
validation_loader = DataLoader(dataset=ld.allData, batch_size=1, sampler=validation_sampler)
...
That should not happen, it should be enough loading it once
train_sampler, validation_sampler = ld.split_train_validation()
train_loader = DataLoader(ld.allData, batch_size=30, sampler=train_sampler, shuffle=False)
validation_loader = DataLoader(dataset=ld.allData, batch_size=1, sampler=validation_sampler)
for epoch in range(num_epochs):
print('\n\nEpoch {}'.format(epoch))
...
In the training part you have (this does not happen in the validation):
train_acc = acc/total
train_loss = loss / total
print('Mean train acc = {} over epoch = {}'.format(epoch,acc))
print('Mean train loss = {} over epoch = {}'.format(epoch, loss))
Where you are printing acc instead of train_acc
Also, in the validation part I mentioned that you are printing print('Mean train acc = {} over epoch = {}'.format(epoch, valid_acc)) when it should be something like 'Mean val acc'.
Changing this lines of code, using a standard model I created and CIFAR dataset the training seems to converge, accuracy increases at every epoch while mean loss value decreases.
I Hope I could help you!