Use Adam optimizer for LSTM network vs LBGFS

Use Adam optimizer for LSTM network vs LBGFS - python

I have modified pytorch tutorial on LSTM (sine-wave prediction: given [0:N] sine-values -> [N:2N] values) to use Adam optimizer instead of LBFGS optimizer. However, the model does not train well and cannot predict sine-wave correctly. Since in most cases we use Adam optimizer for RNN training, I wonder how this issue can be resolved. I also wonder if the code segment regarding sequence-in-sequence-out (done with a loop: for input_t in input.split(1, dim=1)), can be done by a pytorch module or function.
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
class Sequence(nn.Module):
def __init__(self):
super(Sequence, self).__init__()
self.lstm1 = nn.LSTMCell(1, 51)
self.lstm2 = nn.LSTMCell(51, 51)
self.linear = nn.Linear(51, 1)
def forward(self, input, future = 0):
outputs = []
h_t = torch.zeros(input.size(0), 51, dtype=torch.double)
c_t = torch.zeros(input.size(0), 51, dtype=torch.double)
h_t2 = torch.zeros(input.size(0), 51, dtype=torch.double)
c_t2 = torch.zeros(input.size(0), 51, dtype=torch.double)
for input_t in input.split(1, dim=1):
h_t, c_t = self.lstm1(input_t, (h_t, c_t))
h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
output = self.linear(h_t2)
outputs += [output]
for i in range(future):# if we should predict the future
h_t, c_t = self.lstm1(output, (h_t, c_t))
h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
output = self.linear(h_t2)
outputs += [output]
outputs = torch.cat(outputs, dim=1)
return outputs
if __name__ == '__main__':
# set random seed to 0
np.random.seed(0)
torch.manual_seed(0)
# load data and make training set
data = torch.load('traindata.pt')
input = torch.from_numpy(data[3:, :-1])
target = torch.from_numpy(data[3:, 1:])
test_input = torch.from_numpy(data[:3, :-1])
test_target = torch.from_numpy(data[:3, 1:])
print("input.size", input.size())
print("target.size", target.size())
# build the model
seq = Sequence()
seq.double()
criterion = nn.MSELoss()
# use LBFGS as optimizer since we can load the whole data to train
optimizer = optim.Adam(seq.parameters(), lr=0.005)
#begin to train
for i in range(15):
print('STEP: ', i)
seq.train()
def run1step():
optimizer.zero_grad()
out = seq(input)
loss = criterion(out, target)
print('train loss:', loss.item())
loss.backward()
return loss
run1step()
optimizer.step()
# begin to predict, no need to track gradient here
seq.eval()
with torch.no_grad():
future = 1000
pred = seq(test_input, future=future)
loss = criterion(pred[:, :-future], test_target)
print('test loss:', loss.item())
y = pred.detach().numpy()
# draw the result
def draw(yi, color):
plt.figure(figsize=(30,10))
plt.title('Predict future values for time sequences\n(Dashlines are predicted values)', fontsize=30)
plt.xlabel('x', fontsize=20)
plt.ylabel('y', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.plot(np.arange(input.size(1)), yi[:input.size(1)], color, linewidth = 2.0)
plt.plot(np.arange(input.size(1), input.size(1) + future), yi[input.size(1):], color + ':', linewidth = 2.0)
plt.show()
if i == 14:
draw(y[0], 'r')
draw(y[1], 'g')
draw(y[2], 'b')
plt.savefig('predict_LSTM%d.pdf'%i)
#plt.close()

I've just executed your code and the original code. I think the problem is you didn't train your code with ADAM long enough. You can see your training loss is still getting smaller at step 15. So I changed the number of steps from 15 to 45 and this is the figure generated after step 40:
The original code reached 4e-05 loss after step 4. But after that, the loss somehow exploded. Your code with ADAM can reduce the loss across all 45 steps, but the final loss is around 0.001. I hope I run both programs correctly.
Oh, regarding your second question.
also wonder if the code segment regarding sequence-in-sequence-out
Yes, you can write a function or define a module with two LSTMs to do that. But it doesn't make sense since your network contains only two LSTMs. After all, you have to do this “wiring“ work at some point.
If your network contains several such blocks, you can write a module with two LSTMs and use it as a primitive module, e.g. self.BigLSTM = BigLSTM(...) , just like you define self.lstm1 = nn.LSTMCell(...).

Related

No gradients provided for any variable - Custom loss function with random weights depending on the Softmax output

I have difficulties writing a custom loss function that makes use of some random weights generated according to the class/state predicted by the Softmax output. The desired property is:
The model is a simple feedforward neural network with input-dimension as 1 and the output dimension as 6.
The activation function of the output layer is Softmax, which intends to estimate the actual number of classes or states using Argmax.
Note that the training data only consists of X (there is no Y).
The loss function is defined according to random weights (i.e., Weibull distribution) sampled based on the predicted state number for each input sample X.
As follows, I provided a minimal example for illustration. For simplification purposes, I only define the loss function based on the random weights for state/class-1. I get: "ValueError: No gradients provided for any variable: ['dense_41/kernel:0', 'dense_41/bias:0', 'dense_42/kernel:0', 'dense_42/bias:0']."
As indicated in the post below, I found out that argmax is not differntiable, and a softargmax function would help (as I implemented in the following code). However, I still get the same error.
Getting around tf.argmax which is not differentiable
import sys
import time
from tqdm import tqdm
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from scipy.stats import weibull_min
###############################################################################################
# Generate Dataset
lb = np.array([2.0]) # Left boundary
ub = np.array([100.0]) # Right boundary
# Data Points - uniformly distributed
N_r = 50
X_r = np.linspace(lb, ub, N_r)
###############################################################################################
#Define Model
class DGM:
# Initialize the class
def __init__(self, X_r):
#Normalize training input data
self.Xmean, self.Xstd = np.mean(X_r), np.std(X_r)
X_r = (X_r - self.Xmean) / self.Xstd
self.X_r = X_r
#Input and output variable dimensions
self.X_dim = 1; self.Y_dim = 6
# Define tensors
self.X_r_tf = tf.convert_to_tensor(X_r, dtype=tf.float32)
#Learning rate
self.LEARNING_RATE=1e-4
#Feedforward neural network model
self.modelTest = self.test_model()
###############################################
# Initialize network weights and biases
def test_model(self):
input_shape = self.X_dim
dimensionality = self.Y_dim
model = tf.keras.Sequential()
model.add(layers.Input(shape=input_shape))
model.add(layers.Dense(64, kernel_initializer='glorot_uniform',bias_initializer='zeros'))
model.add(layers.Activation('tanh'))
model.add(layers.Dense(dimensionality))
model.add(layers.Activation('softmax'))
return model
##############################################
def compute_loss(self):
#Define optimizer
gen_opt = tf.keras.optimizers.Adam(lr=self.LEARNING_RATE, beta_1=0.0,beta_2=0.9)
with tf.GradientTape() as test_tape:
###### calculate loss
generated_u = self.modelTest(self.X_r_tf, training=True)
#number of data
n_data = generated_u.shape[0]
#initialize random weights assuming state-1 at all input samples
wt1 = np.zeros((n_data, 1),dtype=np.float32) #initialize weights
for b in range(n_data):
wt1[b] = weibull_min.rvs(c=2, loc=0, scale =4 , size=1)
wt1 = tf.reshape(tf.convert_to_tensor(wt1, dtype=tf.float32),shape=(n_data,1))
#print('-----------sampling done-----------')
#determine the actual state using softargmax
idst = self.softargmax(generated_u)
idst = tf.reshape(tf.cast(idst, tf.float32),shape=(n_data,1))
#index state-1
id1 = tf.constant(0.,dtype=tf.float32)
#assign weights if predicted state is state-1
wt1_final = tf.cast(tf.equal(idst, id1), dtype=tf.float32)*wt1
#final loss
test_loss = tf.reduce_mean(tf.square(wt1_final))
#print('-----------test loss calcuated-----------')
gradients_of_modelTest = test_tape.gradient(test_loss,
[self.modelTest.trainable_variables])
gen_opt.apply_gradients(zip(gradients_of_modelTest[0],self.modelTest.trainable_variables))
return test_loss
#reference: Getting around tf.argmax which is not differentiable
#https://stackoverflow.com/questions/46926809/getting-around-tf-argmax-which-is-not-differentiable
def softargmax(self, x, beta=1e10):
x = tf.convert_to_tensor(x)
x_range = tf.range(x.shape.as_list()[-1], dtype=x.dtype)
return tf.reduce_sum(tf.nn.softmax(x*beta,axis=1) * x_range, axis=-1)
##############################################
def train(self,training_steps=100):
train_start_time = time.time()
for step in tqdm(range(training_steps), desc='Training'):
start = time.time()
test_loss = self.compute_loss()
if (step + 1) % 10 == 0:
elapsed_time = time.time() - train_start_time
sec_per_step = elapsed_time / step
mins_left = ((training_steps - step) * sec_per_step)
tf.print("\nStep # ", step, "/", training_steps,
output_stream=sys.stdout)
tf.print("Current time:", elapsed_time, " time left:",
mins_left, output_stream=sys.stdout)
tf.print("Test Loss: ", test_loss, output_stream=sys.stdout)
###############################################################################################
#Define and train the model
model = DGM(X_r)
model.train(training_steps=100)

How to use smac for hyper-parameter optimization of Convolution Neural Network?

Note: Long Post. Please bear with me
I have implemented a convolution neural network in PyTorch on KMNIST dataset. I need to use SMAC to optimize the learning rate and the momentum of Stochastic Gradient Descent of the CNN. I am new in hyperparameter optimization and what I learnt from the smac documentation is,
SMAC evaluates the algorithm to be optimized by invoking it through a Target Algorithm Evaluator (TAE).
We need a Scenario-object to configure the optimization process.
run_obj parameter in Scenario object specifies what SMAC is supposed to optimize.
My Ultimate goal is to get a good accuracy or low loss
This is what I have done so far:
Convolution Neural Network
import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.autograd import Variable
from datasets import *
import torch.utils.data
import torch.nn.functional as F
import matplotlib.pyplot as plt
# Create the model class
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__() # to inherent the features of nn.Module
self.cnn1 = nn.Conv2d(in_channels = 1, out_channels = 8, kernel_size = 3, stride = 1, padding =1)
# in_channels =1 because of grey scale image
# kernel_size = feature_size
# padding = 1 because for same padding = [(filter_size -1)/2]
# the output size of the 8 feature maps is [(input_size - filter_size +2(padding)/stride)+1]
#Batch Normalization
self.batchnorm1 = nn.BatchNorm2d(8)
# RELU
self.relu = nn.ReLU()
self.maxpool1 = nn.MaxPool2d(kernel_size =2)
# After maxpooling, the output of each feature map is 28/2 =14
self.cnn2 = nn.Conv2d(in_channels = 8, out_channels = 32, kernel_size = 5, stride = 1, padding =2)
#Batch Normalization
self.batchnorm2 = nn.BatchNorm2d(32)
# RELU
#self.relu = nn.ReLU()
self.maxpool2 = nn.MaxPool2d(kernel_size =2)
# After maxpooling , the output of each feature map is 14/2 =7of them is of size 7x7 --> 32*7*7=1568
# Flatten the feature maps. You have 32 feature maps, each
self.fc1 = nn.Linear(in_features=1568, out_features = 600)
self.dropout = nn.Dropout(p=0.5)
self.fc2 = nn.Linear(in_features=600, out_features = 10)
def forward(self,x):
out = self.cnn1(x)
#out = F.relu(self.cnn1(x))
out = self.batchnorm1(out)
out = self.relu(out)
out = self.maxpool1(out)
out = self.cnn2(out)
out = self.batchnorm2(out)
out = self.relu(out)
out = self.maxpool2(out)
#Now we have to flatten the output. This is where we apply the feed forward neural network as learned
#before!
#It will the take the shape (batch_size, 1568) = (100, 1568)
out = out.view(-1, 1568)
#Then we forward through our fully connected layer
out = self.fc1(out)
out = self.relu(out)
out = self.dropout(out)
out = self.fc2(out)
return out
def train(model, train_loader, optimizer, epoch, CUDA, loss_fn):
model.train()
cum_loss=0
iter_count = 0
for i, (images, labels) in enumerate(train_load):
if CUDA:
images = Variable(images.cuda())
images = images.unsqueeze(1)
images = images.type(torch.FloatTensor)
images = images.cuda()
labels = Variable(labels.cuda())
labels = labels.type(torch.LongTensor)
labels = labels.cuda()
else:
images = Variable(images)
images = images.unsqueeze(1)
images = images.type(torch.DoubleTensor)
labels = Variable(labels)
labels = labels.type(torch.DoubleTensor)
optimizer.zero_grad()
outputs = model(images)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
cum_loss += loss
if (i+1) % batch_size == 0:
correct = 0
total = 0
acc = 0
_, predicted = torch.max(outputs.data,1)
total += labels.size(0)
if CUDA:
correct += (predicted.cpu()==labels.cpu()).sum()
else:
correct += (predicted==labels).sum()
accuracy = 100*correct/total
if i % len(train_load) == 0:
iter_count += 1
ave_loss = cum_loss/batch_size
return ave_loss
batch_size = 100
epochs = 5
e = range(epochs)
#print(e)
#Load datasets
variable_name=KMNIST()
train_images = variable_name.images
train_images = torch.from_numpy(train_images)
#print(train_images.shape)
#print(type(train_images))
train_labels = variable_name.labels
train_labels = torch.from_numpy(train_labels)
#print(train_labels.shape)
#print(type(train_labels))
train_dataset = torch.utils.data.TensorDataset(train_images, train_labels)
# Make the dataset iterable
train_load = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = batch_size, shuffle = True)
print('There are {} images in the training set' .format(len(train_dataset)))
print('There are {} images in the loaded training set' .format(len(train_load)))
def net(learning_rate, Momentum):
model = CNN()
CUDA = torch.cuda.is_available()
if CUDA:
model = model.cuda()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate,momentum = Momentum, nesterov= True)
iteration = 0
total_loss=[]
for epoch in range(epochs):
ave_loss = train(model, train_load, optimizer, epoch, CUDA, loss_fn)
total_loss.append(ave_loss)
return optimizer, loss_fn, model, total_loss
optimizer, loss_fn, model, total_loss = net(learning_rate= 0.01, Momentum = 0.09)
# Print model's state_dict
print("---------------")
print("Model's state_dict:")
for param_tensor in model.state_dict():
print(param_tensor, "\t", model.state_dict()[param_tensor].size())
print("---------------")
#print("Optimizer's state_dict:")
#for var_name in optimizer.state_dict():
# print(var_name, "\t", optimizer.state_dict()[var_name])
torch.save(model.state_dict(), "kmnist_cnn.pt")
plt.plot(e, (np.array(total_loss)))
plt.xlabel("# Epoch")
plt.ylabel("Loss")
plt.show()
print('Done!')
smac hyperparameter optimization:
from smac.configspace import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter, \
UniformFloatHyperparameter, UniformIntegerHyperparameter
from smac.configspace.util import convert_configurations_to_array
#from ConfigSpace.conditions import InCondition
# Import SMAC-utilities
from smac.tae.execute_func import ExecuteTAFuncDict
from smac.scenario.scenario import Scenario
from smac.facade.smac_facade import SMAC
# Build Configuration Space which defines all parameters and their ranges
cs = ConfigurationSpace()
# We define a few possible types of SVM-kernels and add them as "kernel" to our cs
lr = UniformFloatHyperparameter('learning_rate', 1e-4, 1e-1, default_value='1e-2')
momentum = UniformFloatHyperparameter('Momentum', 0.01, 0.1, default_value='0.09')
cs.add_hyperparameters([lr, momentum])
def kmnist_from_cfg(cfg):
cfg = {k : cfg[k] for k in cfg if cfg[k]}
print('Config is', cfg)
#optimizer, loss_fn, model, total_loss = net(**cfg)
#optimizer, loss_fn, model, total_loss = net(learning_rate= cfg["learning_rate"], Momentum= cfg["Momentum"])
optimizer, loss_fn, model, total_loss = net(learning_rate= 0.02, Momentum= 0.05)
return optimizer, loss_fn, model, total_loss
# Scenario object
scenario = Scenario({"run_obj": "quality", # we optimize quality (alternatively runtime)
"runcount-limit": 200, # maximum function evaluations
"cs": cs, # configuration space
"deterministic": "true"
})
#def_value = kmnist_from_cfg(cs.get_default_configuration())
#print("Default Value: %.2f" % (def_value))
# Optimize, using a SMAC-object
print("Optimizing! Depending on your machine, this might take a few minutes.")
smac = SMAC(scenario=scenario,tae_runner=kmnist_from_cfg) #rng=np.random.RandomState(42)
smac.solver.intensifier.tae_runner.use_pynisher = False
print("SMAC", smac)
incumbent = smac.optimize()
inc_value = kmnist_from_cfg(incumbent)
print("Optimized Value: %.2f" % (inc_value))
When I give loss as the run_obj parameter, I get the error message
ArgumentError: argument --run-obj/--run_obj: invalid choice: 'total_loss' (choose from 'runtime', 'quality')
To be honest, I do not know what does "quality" means. Anyways, when I give quality as the run_obj parameter, I get the error message
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
If I understood it correctly, the above error message is obtained when an int is expected but str is given. To check whether the problem was with configuration space, I tried
optimizer, loss_fn, model, total_loss = net(learning_rate= 0.02, Momentum= 0.05)
instead of these:
optimizer, loss_fn, model, total_loss = net(**cfg)
optimizer, loss_fn, model, total_loss = net(learning_rate= cfg["learning_rate"], Momentum= cfg["Momentum"])
the error remains the same.
Any ideas on how to use smac to optimize hyperparameters of CNN and why do I get this error message? I tried looking for similar problems online. This post was a little helpful. Unfortunately, since there is no implementation of smac on NN (at least I did not find it), I cannot figure out the solution. I ran out of all ideas.
Any help, ideas or useful link is appreciated.
Thank you!

I believe the tae_runner (kmnist_from_cfg in your case) has to be a callable that takes a configuration space point, which you correctly provide, and outputs a single number. You output a tuple of things. Perhaps only return the total_loss on the validation set? I am basing this on the svm example in the smac github at https://github.com/automl/SMAC3/blob/master/examples/svm.py.

Why can't I learn XOR function with this network and constraints?

Let's say I have the following constraints and the network:
The architecture is fixed (see this image) (note that there are no biases)
Activation function for the hidden layer is ReLU
There's no activation function for the output layer (should just return the sum of the inputs it receive).
I tried to implement this in pytorch with various initialization schemes and different data sets but I failed (the code is at the bottom).
My questions are:
Is there anything wrong with my NN training process?
Is this a feasible problem? If yes, how?
If this is doable, can we still achieve that by constraining the weights to be in the set {-1, 0, 1}
Code:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data_utils
import numpy as np
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.fc1 = nn.Linear(2,2,bias=False)
self.fc2 = nn.Linear(2,1, bias=False)
self.rl = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.rl(x)
x = self.fc2(x)
return x
#create an XOR data set to train
rng = np.random.RandomState(0)
X = rng.randn(200, 2)
y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0).astype('int32')
# test data set
X_test = np.array([[0,0],[0,1], [1,0], [1,1]])
train = data_utils.TensorDataset(torch.from_numpy(X).float(), \
torch.from_numpy(y).float())
train_loader = data_utils.DataLoader(train, batch_size=50, shuffle=True)
test = torch.from_numpy(X_test).float()
# training the network
num_epoch = 10000
net = Network()
net.fc1.weight.data.clamp_(min=-1, max=1)
net.fc2.weight.data.clamp_(min=-1, max=1)
# define loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters())
for epoch in range(num_epoch):
running_loss = 0 # loss per epoch
for (X, y)in train_loader:
# make the grads zero
optimizer.zero_grad()
# forward propagate
out = net(X)
# calculate loss and update
loss = criterion(out, y)
loss.backward()
optimizer.step()
running_loss += loss.data
if epoch%500== 0:
print("Epoch: {0} Loss: {1}".format(epoch, running_loss))
The loss doesn't improve. It gets stuck in some value after a few epochs ( i'm not sure how to make this reproducible as I'm getting different values every time)
net(test) returns a set of predictions that are no way close to XOR output.

You need to use a nonlinear activation function such as sigmoid in your hidden and output layers . because xor is not linearly separable.Also biases are required.

Why torch.nn package not supporting inputs that are a single sample

I am trying to understand deep-learning with pytorch. I read the pytorch tutorial: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html, and its written:
''torch.nn only supports mini-batches. The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample.
For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.
If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.''
I am not sure to understand what it means.
Indeed, I have made a simple feed-forward neural network (cf below code), on which I used a really small dataset (the idea is to learn how it works without mini-batches first, not really to have anything useful), and hence don't need to use mini-batches. Hence, I introduce directly all the sample at each epochs. If I understand it correctly, I should add 'train_data = train_data.unsqueeze(0) '. But I am not sure where, as it seems to change the data size to 1. Also, it works without adding this line so why should I really use it?
Any help would be highly appreciated!
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
import os
import numpy as np
# Download data
#...
# Construct network
len_input = len(data[0])
len_output = nbr_class
print('There is %d classes used for classification'%nbr_class)
#defining a new class: Net, that extended nn.Module
#setup the “skeleton” of our network architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
#creation of fully connected. A fully connected neural network layer is represented by the nn.Linear object,
#with the first argument in the definition being the number of nodes in layer l and the next argument being
#the number of nodes in layer l+1
self.fc1 = nn.Linear(len_input, 200)
self.fc2 = nn.Linear(200, 200)
self.fc3 = nn.Linear(200, len_output)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x)
# Create initial network
epochs = 3000
model = Net()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Train & test network
def train(data, target, epoch):
model.train() # set the model in "training mode"
# run the main training loop
#no need batch size as small number of sample! like this the process is more exact
#zero our gradients before calling .backward() which is necessary for new sum of gradients
optimizer.zero_grad()
target_pred = model(data)
loss = criterion(target_pred, target)
#Propagate the gradients back through the network
loss.backward()
#update the weights
#tell our optimizer to “step”, meaning that the weights will be updated using the calculated gradients
#according to our rule. perform model parameter update (update weights)
optimizer.step()
# for graphing puposes
loss_array.append(loss.data[0])
if epoch in list(range(0,20000,100)):
print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, loss.data[0]))
def test(epoch, test_data, test_target):
#eval mode to turn Dropout and BatchNorm off
model.eval()
test_loss = 0
correct = 0
test_target_pred = model(test_data)
criterion = nn.NLLLoss()
# sum up batch loss
test_loss += criterion(test_target_pred, test_target).data[0]
pred = test_target_pred.data.max(1)[1] # get the index of the max log-probability
correct += pred.eq(test_target.data).sum()
if epoch in list(range(0,20000,100)):
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_target), 100. * correct / len(test_target)))
if __name__ == '__main__':
for epoch in range(epochs):
train(data = train_data, target=train_target, epoch=epoch)
test(epoch, test_data, test_target)

CTC loss goes down and stops

I’m trying to train a captcha recognition model. Model details are resnet pretrained CNN layers + Bidirectional LSTM + Fully Connected. It reached 90% sequence accuracy on captcha generated by python library captcha. The problem is that these generated captcha seems to have similary location of each character. When I randomly add spaces between characters, the model does not work any more. So I wonder is LSTM learning segmentation during learning? Then I try to use CTC loss. At first, loss goes down pretty quick. But it stays at about 16 without significant drop later. I tried different layers of LSTM, different number of units. 2 Layers of LSTM reach lower loss, but still not converging. 3 layers are just like 2 layers. The loss curve:
#encoding:utf8
import os
import sys
import torch
import warpctc_pytorch
import traceback
import torchvision
from torch import nn, autograd, FloatTensor, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import MultiStepLR
from tensorboard import SummaryWriter
from pprint import pprint
from net.utils import decoder
from logging import getLogger, StreamHandler
logger = getLogger(__name__)
handler = StreamHandler(sys.stdout)
logger.addHandler(handler)
from dataset_util.utils import id_to_character
from dataset_util.transform import rescale, normalizer
from config.config import MAX_CAPTCHA_LENGTH, TENSORBOARD_LOG_PATH, MODEL_PATH
class CNN_RNN(nn.Module):
def __init__(self, lstm_bidirectional=True, use_ctc=True, *args, **kwargs):
super(CNN_RNN, self).__init__(*args, **kwargs)
model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
modules = list(model_conv.children())[:-1] # delete the last fc layer.
for param in modules[8].parameters():
param.requires_grad = True
self.resnet = nn.Sequential(*modules) # CNN with fixed parameters from resnet as feature extractor
self.lstm_input_size = 512 * 2 * 2
self.lstm_hidden_state_size = 512
self.lstm_num_layers = 2
self.chracter_space_length = 64
self._lstm_bidirectional = lstm_bidirectional
self._use_ctc = use_ctc
if use_ctc:
self._max_captcha_length = int(MAX_CAPTCHA_LENGTH * 2)
else:
self._max_captcha_length = MAX_CAPTCHA_LENGTH
if lstm_bidirectional:
self.lstm_hidden_state_size = self.lstm_hidden_state_size * 2 # so that hidden size for one direction in bidirection lstm is the same as vanilla lstm
self.lstm = self.lstm = nn.LSTM(self.lstm_input_size, self.lstm_hidden_state_size // 2, dropout=0.5, bidirectional=True, num_layers=self.lstm_num_layers)
else:
self.lstm = nn.LSTM(self.lstm_input_size, self.lstm_hidden_state_size, dropout=0.5, bidirectional=False, num_layers=self.lstm_num_layers) # dropout doen't work for one layer lstm
self.ouput_to_tag = nn.Linear(self.lstm_hidden_state_size, self.chracter_space_length)
self.tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_PATH)
# self.dropout_lstm = nn.Dropout()
def init_hidden_status(self, batch_size):
if self._lstm_bidirectional:
self.hidden = (autograd.Variable(torch.zeros((self.lstm_num_layers * 2, batch_size, self.lstm_hidden_state_size // 2))),
autograd.Variable(torch.zeros((self.lstm_num_layers * 2, batch_size, self.lstm_hidden_state_size // 2)))) # number of layers, batch size, hidden dimention
else:
self.hidden = (autograd.Variable(torch.zeros((self.lstm_num_layers, batch_size, self.lstm_hidden_state_size))),
autograd.Variable(torch.zeros((self.lstm_num_layers, batch_size, self.lstm_hidden_state_size)))) # number of layers, batch size, hidden dimention
def forward(self, image):
'''
:param image: # batch_size, CHANNEL, HEIGHT, WIDTH
:return:
'''
features = self.resnet(image) # [batch_size, 512, 2, 2]
batch_size = image.shape[0]
features = [features.view(batch_size, -1) for i in range(self._max_captcha_length)]
features = torch.stack(features)
self.init_hidden_status(batch_size)
output, hidden = self.lstm(features, self.hidden)
# output = self.dropout_lstm(output)
tag_space = self.ouput_to_tag(output.view(-1, output.size(2))) # [MAX_CAPTCHA_LENGTH * BATCH_SIZE, CHARACTER_SPACE_LENGTH]
tag_space = tag_space.view(self._max_captcha_length, batch_size, -1)
if not self._use_ctc:
tag_score = F.log_softmax(tag_space, dim=2) # [MAX_CAPTCHA_LENGTH, BATCH_SIZE, CHARACTER_SPACE_LENGTH]
else:
tag_score = tag_space
return tag_score
def train_net(self, data_loader, eval_data_loader=None, learning_rate=0.008, epoch_num=400):
try:
if self._use_ctc:
loss_function = warpctc_pytorch.warp_ctc.CTCLoss()
else:
loss_function = nn.NLLLoss()
# optimizer = optim.SGD(filter(lambda p: p.requires_grad, self.parameters()), momentum=0.9, lr=learning_rate)
# optimizer = MultiStepLR(optimizer, milestones=[10,15], gamma=0.5)
# optimizer = optim.Adadelta(filter(lambda p: p.requires_grad, self.parameters()))
optimizer = optim.Adam(filter(lambda p: p.requires_grad, self.parameters()))
self.tensorboard_writer.add_scalar("learning_rate", learning_rate)
tensorbard_global_step=0
if os.path.exists(os.path.join(TENSORBOARD_LOG_PATH, "resume_step")):
with open(os.path.join(TENSORBOARD_LOG_PATH, "resume_step"), "r") as file_handler:
tensorbard_global_step = int(file_handler.read()) + 1
for epoch_index, epoch in enumerate(range(epoch_num)):
for index, sample in enumerate(data_loader):
optimizer.zero_grad()
input_image = autograd.Variable(sample["image"]) # batch_size, 3, 255, 255
tag_score = self.forward(input_image)
if self._use_ctc:
tag_score, target, tag_score_sizes, target_sizes = self._loss_preprocess_ctc(tag_score, sample)
loss = loss_function(tag_score, target, tag_score_sizes, target_sizes)
loss = loss / tag_score.size(1)
else:
target = sample["padded_label_idx"]
tag_score, target = self._loss_preprocess(tag_score, target)
loss = loss_function(tag_score, target)
print("Training loss: {}".format(float(loss)))
self.tensorboard_writer.add_scalar("training_loss", float(loss), tensorbard_global_step)
loss.backward()
optimizer.step()
if index % 250 == 0:
print(u"Processing batch: {} of {}, epoch: {}".format(index, len(data_loader), epoch_index))
self.evaluate(eval_data_loader, loss_function, tensorbard_global_step)
tensorbard_global_step += 1
self.save_model(MODEL_PATH + "_epoch_{}".format(epoch_index))
except KeyboardInterrupt:
print("Exit for KeyboardInterrupt, save model")
self.save_model(MODEL_PATH)
with open(os.path.join(TENSORBOARD_LOG_PATH, "resume_step"), "w") as file_handler:
file_handler.write(str(tensorbard_global_step))
except Exception as excp:
logger.error(str(excp))
logger.error(traceback.format_exc())
def predict(self, image):
# TODO ctc version
'''
:param image: [batch_size, channel, height, width]
:return:
'''
tag_score = self.forward(image)
# TODO ctc
# if self._use_ctc:
# tag_score = F.softmax(tag_score, dim=-1)
# decoder.decode(tag_score)
confidence_log_probability, indexes = tag_score.max(2)
predicted_labels = []
for batch_index in range(indexes.size(1)):
label = ""
for character_index in range(self._max_captcha_length):
if int(indexes[character_index, batch_index]) != 1:
label += id_to_character[int(indexes[character_index, batch_index])]
predicted_labels.append(label)
return predicted_labels, tag_score
def predict_pil_image(self, pil_image):
try:
self.eval()
processed_image = normalizer(rescale({"image": pil_image}))["image"].view(1, 3, 255, 255)
result, tag_score = self.predict(processed_image)
self.train()
except Exception as excp:
logger.error(str(excp))
logger.error(traceback.format_exc())
return [""], None
return result, tag_score
def evaluate(self, eval_dataloader, loss_function, step=0):
total = 0
sequence_correct = 0
character_correct = 0
character_total = 0
loss_total = 0
batch_size = eval_data_loader.batch_size
true_predicted = {}
self.eval()
for sample in eval_dataloader:
total += batch_size
input_images = sample["image"]
predicted_labels, tag_score = self.predict(input_images)
for predicted, true_label in zip(predicted_labels, sample["label"]):
if predicted == true_label: # dataloader is making label a list, use batch_size=1
sequence_correct += 1
for index, true_character in enumerate(true_label):
character_total += 1
if index < len(predicted) and predicted[index] == true_character:
character_correct += 1
true_predicted[true_label] = predicted
if self._use_ctc:
tag_score, target, tag_score_sizes, target_sizes = self._loss_preprocess_ctc(tag_score, sample)
loss_total += float(loss_function(tag_score, target, tag_score_sizes, target_sizes) / batch_size)
else:
tag_score, target = self._loss_preprocess(tag_score, sample["padded_label_idx"])
loss_total += float(loss_function(tag_score, target)) # averaged over batch index
print("True captcha to predicted captcha: ")
pprint(true_predicted)
self.tensorboard_writer.add_text("eval_ture_to_predicted", str(true_predicted), global_step=step)
accuracy = float(sequence_correct) / total
avg_loss = float(loss_total) / (total / batch_size)
character_accuracy = float(character_correct) / character_total
self.tensorboard_writer.add_scalar("eval_sequence_accuracy", accuracy, global_step=step)
self.tensorboard_writer.add_scalar("eval_character_accuracy", character_accuracy, global_step=step)
self.tensorboard_writer.add_scalar("eval_loss", avg_loss, global_step=step)
self.zero_grad()
self.train()
def _loss_preprocess(self, tag_score, target):
'''
:param tag_score: value return by self.forward
:param target: sample["padded_label_idx"]
:return: (processed_tag_score, processed_target) ready for NLLoss function
'''
target = target.transpose(0, 1)
target = target.contiguous()
target = target.view(target.size(0) * target.size(1))
tag_score = tag_score.view(-1, self.chracter_space_length)
return tag_score, target
def _loss_preprocess_ctc(self, tag_score, sample):
target_2d = [
[int(ele) for ele in sample["padded_label_idx"][row, :] if int(ele) != 0 and int(ele) != 1]
for row in range(sample["padded_label_idx"].size(0))]
target = []
for ele in target_2d:
target.extend(ele)
target = autograd.Variable(torch.IntTensor(target))
# tag_score = F.softmax(F.sigmoid(tag_score), dim=-1)
tag_score_sizes = autograd.Variable(torch.IntTensor([self._max_captcha_length] * tag_score.size(1)))
target_sizes = autograd.Variable(sample["captcha_length"].int())
return tag_score, target, tag_score_sizes, target_sizes
# def visualize_graph(self, dataset):
# '''Since pytorch use dynamic graph, an input is required to visualize graph in tensorboard'''
# # warning: Do not run this, the graph is too large to visualize...
# sample = dataset[0]
# input_image = autograd.Variable(sample["image"].view(1, 3, 255, 255))
# tag_score = self.forward(input_image)
# self.tensorboard_writer.add_graph(self, tag_score)
def save_model(self, model_path):
self.tensorboard_writer.close()
self.tensorboard_writer = None # can't be pickled
torch.save(self, model_path)
self.tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_PATH)
#classmethod
def load_model(cls, model_path=MODEL_PATH, *args, **kwargs):
net = cls(*args, **kwargs)
if os.path.exists(model_path):
model = torch.load(model_path)
if model:
model.tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_PATH)
net = model
return net
def __del__(self):
if self.tensorboard_writer:
self.tensorboard_writer.close()
if __name__ == "__main__":
from dataset_util.dataset import dataset, eval_dataset
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)
eval_data_loader = DataLoader(eval_dataset, batch_size=2, shuffle=True)
net = CNN_RNN.load_model()
net.train_net(data_loader, eval_data_loader=eval_data_loader)
# net.predict(dataset[0]["image"].view(1, 3, 255, 255))
# predict_pil_image test code
# from config.config import IMAGE_PATHS
# import glob
# from PIL import Image
#
# image_paths = glob.glob(os.path.join(IMAGE_PATHS.get("EVAL"), "*.png"))
# for image_path in image_paths:
# pil_image = Image.open(image_path)
# predicted, score = net.predict_pil_image(pil_image)
# print("True value: {}, predicted: {}".format(os.path.split(image_path)[1], predicted))
print("Done")
The above codes are main part. If you need other components that makes it running, leave a comment. Got stuck here for quite long. Any advice for training crnn + ctc is appreciated.

I've been training with ctc loss and encountered the same problem. I know this is a rather late answer but hopefully it'll help someone else who's researching on this. After trial and error and a lot of research there are a few things that's worth knowing when it comes to training with ctc (if your model is set up correctly):
The quickest way for the model to lower cost is to predict only blanks. This is noted in a few papers and blogs: see http://www.tbluche.com/ctc_and_blank.html
The model learns to predict only blanks first, then it starts picking up on the error signal in regards to the correct underlying labels. This is also explained in the above link. In practice, I noticed that my model starts to learn the real underlying labels/targets after a couple hundred epochs and the loss starts decreasing dramatically again. Similar to what is shown for the toy example here: https://thomasmesnard.github.io/files/CTC_Poster_Mesnard_Auvolat.pdf
These parameters have a great impact on whether your model converges or not - learning rate, batch size and epoch number.

You have a few questions, so I will try to answer them one by one.
First, why does adding spaces to the captcha break the model?
A neural network learns to deal with the data it is trained on. If you change the distribution of the data (by for example adding spaces between characters) there is no guarantee that the network will generalize. As you hint at in your question. It is possible that the captchas you train on always have the characters in the same positions, or at the same distance from one another, thus your model learns that and learns to exploit this by looking in those positions. If you want your network to generalize a specific scenario, you should explicitly train on that scenario. So in your case, you should add random spaces also during training.
Second, why does the loss not go below 16?
Clearly, from the fact that your training loss is also stalled at 16 (like your validation loss), the problem is that your model simply doesn't have the capacity to deal with the complexity of the problem. In other words, your model is underfitting. You had the correct reflex to try to increase the capacity of your network. You tried to increase the capacity of the LSTM and it didn't help. Thus, the next logical step is that the convolution part of your network is not powerful enough. So here are a few things that you might want to try, from most likely to succeed in my opinion to least likely:
Make convnet trainable: I notice that you are using a pretrained convnet and that you are not fine-tuning the weights of that convnet. That could be a problem. Whatever your convnet was trained on, it might not develop the required features to deal with captchas. You should try learning the weights of the convnet too, in order to develop useful features for captchas.
Use deeper convnet: This is the naive thing to do. Your convnet doesn't have good enough features, try a more powerful deeper one. (But you should definitely use this only after you've made the convnet trainable).

From my experience, training RNN model with CTC loss is not an EASY task. The model may not converge at all if the training is not carefully setup-ed. Here are my suggestions:
Check the CTC loss output along training. For a model would converge, the CTC loss at each batch fluctuates notably. If you observed that the CTC loss shrinks almost monotonically to a stable value, then the model is most likely stuck at a local minima
Use short samples to pretrain your model. Though we have advanced RNN strucures like LSTM and GRU, it's still hard to back-propagate the RNN for long steps.
Enlarge sample variety. You can even add artificial samples to help your model escape from local minima.
F.Y.I., we've just open-sourced a new deep learning framework Dandelion which has built-in CTC objective, and interface pretty much like pytorch. You can try your model with Dandelion and compare it with your current implementation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.