I am new to PyTorch and LSTMs and I am trying to train a classification model that takes a sentences where each word is encoded via word2vec (pre-trained vectors) and outputs one class after it saw the full sentence. I have four different classes. The sentences have variable length.
My code is running without errors, but it always predicts the same class, no matter how many epochs I train my model. So I think the gradients are not properly backpropagated. Here is my code:
class LSTM(nn.Module):
def __init__(self, embedding_dim, hidden_dim, tagset_size):
super(LSTM, self).__init__()
self.hidden_dim = hidden_dim
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
self.hidden = self.init_hidden()
def init_hidden(self):
# The axes semantics are (num_layers, minibatch_size, hidden_dim)
return (torch.zeros(1, 1, self.hidden_dim).to(device),
torch.zeros(1, 1, self.hidden_dim).to(device))
def forward(self, sentence):
lstm_out, self.hidden = self.lstm(sentence.view(len(sentence), 1, -1), self.hidden)
tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
EMBEDDING_DIM = len(training_data[0][0][0])
HIDDEN_DIM = 256
model = LSTM(EMBEDDING_DIM, HIDDEN_DIM, 4)
model.to(device)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in tqdm(range(n_epochs)):
for sentence, tag in tqdm(training_data):
model.zero_grad()
model.hidden = model.init_hidden()
sentence_in = torch.tensor(sentence, dtype=torch.float).to(device)
targets = torch.tensor([label_to_idx[tag]], dtype=torch.long).to(device)
tag_scores = model(sentence_in)
res = torch.tensor(tag_scores[-1], dtype=torch.float).view(1,-1).to(device)
# I THINK THIS IS WRONG???
print(res) # tensor([[-10.6328, -10.6783, -10.6667, -0.0001]], device='cuda:0', grad_fn=<CopyBackwards>)
print(targets) # tensor([3], device='cuda:0')
loss = loss_function(res, targets)
loss.backward()
optimizer.step()
The code is largely inspired by https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
The difference is that they have a sequence-to-sequence model and I have a sequence-to-ONE model.
I am not sure what the problem is, but I guess that the scores returned by the model contain a score for each tag and my ground truth only contains the index of the correct class? How would this be handled correctly?
Or is the loss function maybe not the correct one for my use case? Also I am not sure if this is done correctly:
res = torch.tensor(tag_scores[-1], dtype=torch.float).view(1,-1).to(device)
By taking tag_scores[-1] I want to get the scores after the last word has been given to the network because tag_scores contains the scores after each step, if I understand correctly.
And this is how I evaluate:
with torch.no_grad():
preds = []
gts = []
for sentence, tag in tqdm(test_data):
inputs = torch.tensor(sentence, dtype=torch.float).to(device)
tag_scores = model(inputs)
# find index with max value (this is the class to be predicted)
pred = [j for j,v in enumerate(tag_scores[-1]) if v == max(tag_scores[-1])][0]
print(pred, idx_to_label[pred], tag)
preds.append(pred)
gts.append(label_to_idx[tag])
print(f1_score(gts, preds, average='micro'))
print(classification_report(gts, preds))
EDIT:
When shuffling the data before training it seems to work. But why?
EDIT 2:
I think the reason why shuffling is needed is that my training data contains samples for each class in groups. So when training them each after the other, the model will only see the same class in the last N iterations and therefore it will only predict this class. Another reason might also be that I am currently using mini-batches of only one sample because I haven't figured out yet how to use other sizes.
Because you are trying to use a whole sentence to classify, so the following line:
self.hidden2tag(lstm_out.view(len(sentence), -1))
should be changed to, so it takes the final features to the classifier.
self.hidden2tag(lstm_out.view(sentence[-1], -1))
But I am also not so sure since I am not familiar with LSTM.
Related
For my model I'm using a roberta transformer model and the Trainer from the Huggingface transformer library.
I calculate two losses:
lloss is a Cross Entropy Loss and dloss calculates the loss inbetween hierarchy layers.
The total loss is the sum of lloss and dloss. (Based on this)
When calling total_loss.backwards() however, I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed
Any idea why that happens? Can I force it to only call backwards once? Here is the loss calculation part:
dloss = calculate_dloss(prediction, labels, 3)
lloss = calculate_lloss(predeiction, labels, 3)
total_loss = lloss + dloss
total_loss.backward()
def calculate_lloss(predictions, true_labels, total_level):
'''Calculates the layer loss.
'''
loss_fct = nn.CrossEntropyLoss()
lloss = 0
for l in range(total_level):
lloss += loss_fct(predictions[l], true_labels[l])
return self.alpha * lloss
def calculate_dloss(predictions, true_labels, total_level):
'''Calculate the dependence loss.
'''
dloss = 0
for l in range(1, total_level):
current_lvl_pred = torch.argmax(nn.Softmax(dim=1)(predictions[l]), dim=1)
prev_lvl_pred = torch.argmax(nn.Softmax(dim=1)(predictions[l-1]), dim=1)
D_l = self.check_hierarchy(current_lvl_pred, prev_lvl_pred, l) #just a boolean tensor
l_prev = torch.where(prev_lvl_pred == true_labels[l-1], torch.FloatTensor([0]).to(self.device), torch.FloatTensor([1]).to(self.device))
l_curr = torch.where(current_lvl_pred == true_labels[l], torch.FloatTensor([0]).to(self.device), torch.FloatTensor([1]).to(self.device))
dloss += torch.sum(torch.pow(self.p_loss, D_l*l_prev)*torch.pow(self.p_loss, D_l*l_curr) - 1)
return self.beta * dloss
There is nothing wrong with having a loss that is the sum of two individual losses, here is a small proof of principle adapted from the docs:
import torch
import numpy
from sklearn.datasets import make_blobs
class Feedforward(torch.nn.Module):
def __init__(self, input_size, hidden_size):
super(Feedforward, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.fc1 = torch.nn.Linear(self.input_size, self.hidden_size)
self.relu = torch.nn.ReLU()
self.fc2 = torch.nn.Linear(self.hidden_size, 1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
hidden = self.fc1(x)
relu = self.relu(hidden)
output = self.fc2(relu)
output = self.sigmoid(output)
return output
def blob_label(y, label, loc): # assign labels
target = numpy.copy(y)
for l in loc:
target[y == l] = label
return target
x_train, y_train = make_blobs(n_samples=40, n_features=2, cluster_std=1.5, shuffle=True)
x_train = torch.FloatTensor(x_train)
y_train = torch.FloatTensor(blob_label(y_train, 0, [0]))
y_train = torch.FloatTensor(blob_label(y_train, 1, [1,2,3]))
x_test, y_test = make_blobs(n_samples=10, n_features=2, cluster_std=1.5, shuffle=True)
x_test = torch.FloatTensor(x_test)
y_test = torch.FloatTensor(blob_label(y_test, 0, [0]))
y_test = torch.FloatTensor(blob_label(y_test, 1, [1,2,3]))
model = Feedforward(2, 10)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)
model.eval()
y_pred = model(x_test)
before_train = criterion(y_pred.squeeze(), y_test)
print('Test loss before training' , before_train.item())
model.train()
epoch = 20
for epoch in range(epoch):
optimizer.zero_grad() # Forward pass
y_pred = model(x_train) # Compute Loss
lossCE= criterion(y_pred.squeeze(), y_train)
lossSQD = (y_pred.squeeze()-y_train).pow(2).mean()
loss=lossCE+lossSQD
print('Epoch {}: train loss: {}'.format(epoch, loss.item())) # Backward pass
loss.backward()
optimizer.step()
There must be a real second time that you call directly or indirectly backward on some varaible that then traverses through your graph. It is a bit too much to ask for the complete code here, only you can check this or at least reduce it to a minimal example (while doing so, you might already find the issue). Apart from that, I would start checking:
Does it already occur in the first iteration of training? If not: are you reusing any calculation results for the second iteration without a detach?
When you do backward on your losses individually lloss.backward() followed by dloss.backward() (this has the same effect as adding them together first as gradients are accumulated): what happens? This will let you track down for which of the two losses the error occurs.
After backward() your comp. graph is freed so for the second backward you need to create a new graph by providing inputs again. If you want to reiterate the same graph after backward (for some reason) you need to specify retain_graph flag in backward as True. see retain_graph here.
P.S. As the summation of Tensors is automatically differentiable, summing the losses would not cause any issue in the backward.
I’m trying to solve an nlp classification problem with a LSTM. The code for the model is defined here:
class LSTM(nn.Module):
def __init__(self, hidden_size, embedding_size=66 ):
super().__init__()
self.lstm = nn.LSTM(embedding_size, hidden_size, batch_first = True, bidirectional = True)
self.fc = nn.Linear(2*hidden_size,2)
def forward(self, input_seq):
output, (hidden_state, cell_state) = self.lstm(input_seq)
hidden_state = torch.cat((hidden_state[-1,:], hidden_state[-2,:]), -1)
logits = self.fc(hidden_state)
return nn.LogSoftmax(dim=1)(logits)
And the function I’m using to train this model is here:
def train_loop(dataloader, model, loss_fn, optimizer):
loss_fn = loss_fn
size = len(dataloader.dataset)
model.train()
zeros = 0
for batch, (X, y) in enumerate(dataloader):
# Transform string into tensor
tensor = torch.zeros(1,len(X[0]),66)
for i in range(len(X[0])):
tensor[0][i][ctoi[X[0][i]]] = 1
pred = model(tensor)
target = torch.zeros(2, dtype=torch.long)
target[y] = 1
if batch % 100 == 0:
print(pred.squeeze(), target)
loss = loss_fn(pred.squeeze(), target)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if pred.squeeze().argmax() == 0:
zeros += 1
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
print(f'In trainning predicted {zeros} zeroes out of {size} samples')
The X’s are still strings, that’s why I need to convert them to tensors before running it through the model. The y’s are either a 0 or 1 (since its a binary classification problem), that I need to convert to a tensor of shape (2,) to run through the loss function.
For some reason I keep getting the same class predicted for every input. The classes are not even that unbalanced (~45% to 55%), and I’ve tried changing the weights of the classes in the loss function with no improvements, it either converges to predicting always a 0 or always a 1. Most of the time it it converges to predicting always a 0, which makes even less sense because what happens usually is that the class 0 has less samples than class 1.
Since you're training a binary classification model, your output dim should be 1 (corresponding to a single probability P(y|x)). This means that the y you're retrieving from your dataloader should be the y used in your loss function (assuming a cross-entropy loss). The predicted class is therefore y_hat = round(pred) (i.e., is the prediction >= 0.5).
As a point of clarity, it would be much easier to follow your logic if the one-hot encoding happened within your dataset (either in __getitem__ or __iter__). It's also worth noting that you don't use embeddings, so the code of your classifier is a bit misleading.
This is my first time posting in stack overflow so forgive me if I do any sort of mistake.
I have 10000 data, and each data has a label of 0 and 1. I want to perform classification using LSTM as this is time series data.
input_dim = 1
hidden_dim = 32
num_layers = 2
output_dim = 1
# Here we define our model as a class
class LSTM(nn.Module):
def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
super(LSTM, self).__init__()
self.hidden_dim = hidden_dim
self.num_layers = num_layers
self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
#Initialize hidden layer and cell state
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_()
c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_()
# We need to detach as we are doing truncated backpropagation through time (BPTT)
# If we don't, we'll backprop all the way to the start even after going through another batch
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
# Index hidden state of last time step
# out.size() --> 100, 32, 100
# out[:, -1, :] --> 100, 100 --> just want last time step hidden states!
out = self.fc(out[:, -1, :])
# For binomial Classification
m = torch.sigmoid(out)
return m
model = LSTM(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_layers=num_layers)
loss = nn.BCELoss()
optimiser = torch.optim.Adam(model.parameters(), lr=0.00001, weight_decay=0.00006)
num_epochs = 100
# Number of steps to unroll
seq_dim =look_back-1
for t in range(num_epochs):
y_train_class = model(x_train)
output = loss(y_train_class, y_train)
# Zero out gradient, else they will accumulate between epochs
optimiser.zero_grad(set_to_none=True)
# Backward pass
output.backward()
# Update parameters
optimiser.step()
This is an example of what the result looks like
This code is initially from kaggle, I edited them for classification. Please, can you tell me what I am doing wrong?
EDIT 1:
Add dataloader
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
x_train = torch.from_numpy(x_train).type(torch.Tensor)
y_train = torch.from_numpy(y_train).type(torch.Tensor)
x_test = torch.from_numpy(x_test).type(torch.Tensor)
y_test = torch.from_numpy(y_test).type(torch.Tensor)
train_dataloader = DataLoader(TensorDataset(x_train, y_train), batch_size=128, shuffle=True)
test_dataloader = DataLoader(TensorDataset(x_test, y_test), batch_size=128, shuffle=True)
I realized I had forgotten to inverse the transformation before checking the result. When I did that, I got different values from classification, however all values are in the scale of 0.001-0.009, so when I round them, the result is same. Label 0 for all classification.
A common phenomenon in NN training is that they will initially converge to a very naive solution to the problem where they output a constant prediction that minimizes the error on the training data. My guess is that in your training data, the ratio between 0 and 1 classes is close to 0.5423. Depending on whether your model is of sufficient complexity, it might learn to make more specific predictions based on the input when given more learning steps.
While increasing the number of epochs could help, there is something better you can do with your current setup. Currently, you are only performing a single optimizer step per epoch. Typically, you would want a step per batch and loop over your data in (mini)batches of, say, 32 inputs for example. To do this, it would be best to use a DataLoader where you can define a batch size, and loop over the dataloader inside your epoch loop similar to this example.
I am trying to use RNN to do a binary classification. But when my model is training, it gets stuck at loss.backward().
Here is my model:
class RNN2(nn.Module):
def __init__(self, input_size, hidden_size, output_size=2, num_layers=1):
super(RNN2, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, num_layers)
self.reg = nn.Linear(hidden_size, output_size)
#self.softmax = nn.LogSoftmax(dim=1)
def forward(self,x):
x, hidden = self.rnn(x)
return self.reg(x[:,2])
rnn = RNN2(13,10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
for e in range(10):
out = rnn(train_X)
optimizer.zero_grad()
print(out[0])
print(out.shape)
print(train_Y.shape)
loss = criterion(out, train_Y)
print(loss)
loss.backward()
print("1")
optimizer.step()
print("2")
The shape of train_X is 420000*3*13 and the shape of train_Y is 420000
So it can print loss. Can anyone tell me why it gets stuck at loss.backward(). It can't print 1.
You have to know that in RRNs, computing the backward function for a sequence of length 420000 is extremely slow. If you run your code on a machine with a GPU (or google colab) and add the following lines before the for loop, your code finishes executing in less than two minutes.
rnn = rnn.cuda()
train_X = train_X.cuda()
train_Y = train_Y.cuda()
Note that by default, the second input dimension passed to RNN will be treated as the batch size. Therefore, if the 420000 is the number of batches, pass batch_first=True to the RNN constructor.
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
This would significantly speed up the process (less than one second in google colab). However, if that is not the case, you should try chunking the sequences into smaller parts and increasing the batch size from 3 to a larger value.
I have been following the ants and bees transfer learning tutorial from the official PyTorch Docs (http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). I am trying to finetune a VGG19 model by changing the final layer to predict one of two classes. I am able to modify the last fc layer using the following code.
But I get an error when executing the train_model function. The error is “size mismatch at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMathBlas.cu:243”. Any idea what the issue is ?
model_conv = torchvision.models.vgg19(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
model_conv = nn.Sequential(*list(model_conv.classifier.children())[:-1] +
[nn.Linear(in_features=4096, out_features=2)])
if use_gpu:
model_conv = model_conv.cuda()
criterion = nn.CrossEntropyLoss()
optimizer_conv = optim.SGD(model_conv._modules['6'].parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)
model_conv = train_model(model_conv, criterion, optimizer_conv, exp_lr_scheduler, num_epochs=25)
When you are defining your model you are just considering the classifier which consists on the fully connected part of the network only. Then, when feeding the 224*224*3 image to the model it tries to "go through" a linear layer with 25K features as the input. To solve it you just need to add the convolutional part before, to do so redefine the model like this:
class newModel(nn.Module):
def __init__(self, old_model):
super(newModel, self).__init__()
self.features = old_model.features
self.classifier = nn.Sequential(*list(old_model.classifier.children())[:-1] +
[nn.Linear(in_features=4096, out_features=2)])
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
model_conv = newModel(model_conv)
Now you just also tell the parameters to optimize, if you just want to train the last layer (the one that is newly added) do :
optimizer_conv = optim.SGD(model_conv.classifier._modules['6'].parameters(), lr=0.001, momentum=0.9)
The rest of the code remains the same.
Hope it helps!