Why is BERT model with pytorch native approach not learning?

Why is BERT model with pytorch native approach not learning? - python

My custom BERT model's architecture:
class BertArticleClassifier(nn.Module):
def __init__(self, n_classes, freeze_bert_weights=False):
super(BertArticleClassifier, self).__init__()
self.bert = AutoModel.from_pretrained('bert-base-uncased')
if freeze_bert_weights:
for param in self.bert.parameters():
param.requires_grad = False
self.dropout = nn.Dropout(0.1)
self.fc_1 = nn.Linear(768, 256)
self.leaky_relu = nn.LeakyReLU()
self.fc_out = nn.Linear(256, n_classes)
def forward(self, input_ids, attention_mask):
output = self.bert(input_ids, attention_mask)
return self.fc_out(self.leaky_relu(self.fc_1(self.dropout(output['pooler_output']))))
self.bert is a model from transformers library.
Training script:
def train_my_model(model, optimizer, criterion, scheduler, epochs, dataloader_train, dataloader_validation, device, pretrained_weights=None):
if pretrained_weights:
torch.save(model.state_dict(), pretrained_weights)
for epoch in tqdm(range(1, epochs + 1)):
model.train()
loss_train_total = 0
progress_bar = tqdm(dataloader_train, desc=f'Epoch {epoch :1d}', leave=False, disable=False)
for batch in progress_bar:
optimizer.zero_grad()
batch = tuple(batch[b].to(device) for b in batch)
input_ids, mask, labels = batch
predictions = model(input_ids, mask)
loss = criterion(predictions, labels)
loss.backward()
loss_train_total += loss.item()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item() / len(batch))})
torch.save(model.state_dict(), f'models_data/bert_my_model/finetuned_BERT_epoch_{epoch}.model')
tqdm.write(f'\nEpoch {epoch}')
loss_train_avg = loss_train_total / len(dataloader_train)
tqdm.write(f'Training loss: {loss_train_avg}')
val_loss, predictions, true_vals = evaluate(model, dataloader_validation, criterion, device)
val_f1 = f1_score_func(predictions, true_vals)
tqdm.write(f'Validation loss: {val_loss}')
tqdm.write(f'F1 Score (Weighted): {val_f1}')
Optimizer and Criterion:
optimizer = AdamW(model.parameters(),
lr=1e-4,
eps=1e-6)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)
criterion = nn.CrossEntropyLoss(weight=class_weights).to(device)
After 5 epochs I get the same validation loss ~3.1. I know that my data is preprocessed in the correct way because if I train this transformers BertForSequenceClassification model, the model is learning, but the problem with that approach is that I cannot tweak the loss function to accept the class weights, so that is the reason for creating my own custom model.
As you can see in the model's forward method, I extract the output['pooler_output'] piece, and disregard the loss (which is returned alongside the output['pooler_output'] element). The problem which I may deduced is that when in the training loop I call loss.backward(), maybe the model's weights aren't updating, because transformers BERT model's return their own loss as an output.
What am I doing wrong?

Related

When retraining a model in pytorch should the optimizer be defined outside the train method? Will it get rid of previous weights if not?

So I am training a GNN on pytorch and after training it, with it's saved weights, I want to train it more with a separate dataset. When re training with the new dataset I don't want the weights to be reset, I want the weights to update from my last training session. Currently, my training code looks like this:
def train(data,model):
train_loader, val_loader, test_loader, feature_len = data
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
loss_fn = torch.nn.MSELoss()
epoch = 17
print('start training\n')
evaluate(model, 'train', train_loader)
evaluate(model, 'val', val_loader)
evaluate(model, 'test', test_loader)
for i in range(epoch):
print('epoch %d:' % i)
model.train()
for graph1, graph2, target in train_loader:
pred = torch.squeeze(model(graph1, graph2))
loss = loss_fn(pred, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
evaluate(model, 'train', train_loader)
evaluate(model, 'val', val_loader)
evaluate(model, 'test', test_loader)
print()
At the moment, I create my model object outside of the function, and then train it using that code above(I also have an evaluate function but it is left out so to be more specific with my question). My question is, if after using this train method, I decide to train again on more data, will the fact I have the optimizer definition within the method mean it will train from scratch again? If so, to avoid this would I just define my optimizer outside of the train method? I'm slightly confused about retraining my model with saved weights, the pytorch tutorials didn't help.

You can define the optimizer and model wherever you want (both inside and outside the train() method) as long as you are loading the weights correctly before the training loop. What you are missing probably is loading weights!!
From Pytorch tutorial,
Defining model and optimizer:
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)
Loading the weights:
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.eval()
# - or -
model.train()

Changing BatchNormalization momentum while training in Tensorflow 2

I want batch normalization running statistics (mean and variance) to converge in the end of training, which requires to increase batch norm momentum from some initial value to 1.0. I managed to change momentum using a custom Callback, but it works only if my model is compiled in eager mode. Toy example (it sets momentum=1.0 after epoch zero due to which moving_mean should stop updating):
import tensorflow as tf # version 2.3.1
import tensorflow_datasets as tfds
ds_train, ds_test = tfds.load("mnist", split=["train", "test"], shuffle_files=True, as_supervised=True)
ds_train = ds_train.batch(128)
ds_test = ds_test.batch(128)
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Dense(10),
]
)
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
# run_eagerly=True,
)
class BatchNormMomentumCallback(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
last_bn_layer = None
for layer in self.model.layers:
if isinstance(layer, tf.keras.layers.BatchNormalization):
if epoch == 0:
layer.momentum = 0.99
else:
layer.momentum = 1.0
last_bn_layer = layer
if last_bn_layer:
tf.print("Momentum=" + str(last_bn_layer.moving_mean[-1].numpy())) # Should not change after epoch 1
batchnorm_decay = BatchNormMomentumCallback()
model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[batchnorm_decay], verbose=0)
Output (get this when run_eagerly=False)
Momentum=0.0
Momentum=-102.20184
Momentum=-106.04614
Momentum=-116.36204
Momentum=-129.995
Momentum=-123.70443
Expected output (get it when run_eagerly=True)
Momentum=0.0
Momentum=-5.9038606
Momentum=-5.9038606
Momentum=-5.9038606
Momentum=-5.9038606
Momentum=-5.9038606
I guess this happens because in graph mode TF compiles the model as graph with a momentum defined as 0.99, and the uses this value in the graph (so momentum is not updated by BatchNormMomentumCallback).
Question:
Is there a way to update that compiled momentum variable inside the graph while training? I want to update momentum not in eager mode (i.e. using run_eagerly=False) because training efficiency is important.

I would recommend simply using a custom training loop for your use case. You will have all the flexibility you need:
import tensorflow as tf # version 2.3.1
import tensorflow_datasets as tfds
ds_train, ds_test = tfds.load("mnist", split=["train", "test"], shuffle_files=True, as_supervised=True)
ds_train = ds_train.batch(128)
ds_test = ds_test.batch(128)
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Dense(10),
]
)
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
train_acc_metric = tf.keras.metrics.SparseCategoricalAccuracy()
batch_norm_layer = model.layers[2]
#tf.function
def train_step(epoch, model, batch):
if epoch == 0:
batch_norm_layer.momentum = 0.99
else:
batch_norm_layer.momentum = 1.0
with tf.GradientTape() as tape:
x_batch_train, y_batch_train = batch
logits = model(x_batch_train, training=True)
loss_value = loss_fn(y_batch_train, logits)
train_acc_metric.update_state(y_batch_train, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
epochs = 6
for epoch in range(epochs):
tf.print("\nStart of epoch %d" % (epoch,))
tf.print("Momentum = ", batch_norm_layer.moving_mean[-1], summarize=-1)
for batch in ds_train:
train_step(epoch, model, batch)
train_acc = train_acc_metric.result()
tf.print("Training acc over epoch: %.4f" % (float(train_acc),))
train_acc_metric.reset_states()
Start of epoch 0
Momentum = 0
Training acc over epoch: 0.9158
Start of epoch 1
Momentum = -20.2749767
Training acc over epoch: 0.9634
Start of epoch 2
Momentum = -20.2749767
Training acc over epoch: 0.9755
Start of epoch 3
Momentum = -20.2749767
Training acc over epoch: 0.9826
Start of epoch 4
Momentum = -20.2749767
Training acc over epoch: 0.9876
Start of epoch 5
Momentum = -20.2749767
Training acc over epoch: 0.9915
A simple test shows that the function with the tf.function decorator performs way better:
import tensorflow as tf # version 2.3.1
import tensorflow_datasets as tfds
import timeit
ds_train, ds_test = tfds.load("mnist", split=["train", "test"], shuffle_files=True, as_supervised=True)
ds_train = ds_train.batch(128)
ds_test = ds_test.batch(128)
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.ReLU(),
tf.keras.layers.Dense(10),
]
)
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
train_acc_metric = tf.keras.metrics.SparseCategoricalAccuracy()
batch_norm_layer = model.layers[2]
#tf.function
def train_step(epoch, model, batch):
if epoch == 0:
batch_norm_layer.momentum = 0.99
else:
batch_norm_layer.momentum = 1.0
with tf.GradientTape() as tape:
x_batch_train, y_batch_train = batch
logits = model(x_batch_train, training=True)
loss_value = loss_fn(y_batch_train, logits)
train_acc_metric.update_state(y_batch_train, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
def train_step_without_tffunction(epoch, model, batch):
if epoch == 0:
batch_norm_layer.momentum = 0.99
else:
batch_norm_layer.momentum = 1.0
with tf.GradientTape() as tape:
x_batch_train, y_batch_train = batch
logits = model(x_batch_train, training=True)
loss_value = loss_fn(y_batch_train, logits)
train_acc_metric.update_state(y_batch_train, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
epochs = 6
for epoch in range(epochs):
tf.print("\nStart of epoch %d" % (epoch,))
tf.print("Momentum = ", batch_norm_layer.moving_mean[-1], summarize=-1)
test = True
for batch in ds_train:
train_step(epoch, model, batch)
if test:
tf.print("TF function:", timeit.timeit(lambda: train_step(epoch, model, batch), number=10))
tf.print("Eager function:", timeit.timeit(lambda: train_step_without_tffunction(epoch, model, batch), number=10))
test = False
train_acc = train_acc_metric.result()
tf.print("Training acc over epoch: %.4f" % (float(train_acc),))
train_acc_metric.reset_states()
Start of epoch 0
Momentum = 0
TF function: 0.02285163299893611
Eager function: 0.11109527599910507
Training acc over epoch: 0.9229
Start of epoch 1
Momentum = -88.1852188
TF function: 0.024091466999379918
Eager function: 0.1109461480009486
Training acc over epoch: 0.9639
Start of epoch 2
Momentum = -88.1852188
TF function: 0.02331122400210006
Eager function: 0.11751473100230214
Training acc over epoch: 0.9756
Start of epoch 3
Momentum = -88.1852188
TF function: 0.02656845700039412
Eager function: 0.1121610670015798
Training acc over epoch: 0.9830
Start of epoch 4
Momentum = -88.1852188
TF function: 0.02821972700257902
Eager function: 0.15709391699783737
Training acc over epoch: 0.9877
Start of epoch 5
Momentum = -88.1852188
TF function: 0.02441513300072984
Eager function: 0.10921925399816246
Training acc over epoch: 0.9917

Another option is to declare the momentum as a variable
momentum = tf.Variable(0.99, trainable=False)
# pass into the BN layer
tf.keras.layers.BatchNormalization(momentum=momentum)
Then you can have a callback that updates the momentum
class BNMomentumUpdate(tf.keras.callbacks.Callback):
def __init__(self, momentum):
super().__init__()
self.momentum = momentum
def on_epoch_end(self, epoch, logs=None):
if epoch > 0:
self.momentum.assign(1.)

PyTorch classifier for simple images of letters: CNN model design questions

Looking for tips on building a simple image classifier for CAPTCHA images of text where there are only two possible fonts per letter. Here's an example image:
Approach thus far has been to try to break up the image into 6 equal-size images to try to get individual character images, and build a classifier for these (example below).
Is there a simpler way to go about this? Any tips on how to design the actual model? (A relatively simple CNN should suffice here perhaps?)
Edit: questions on building a suitable model below.
I've tried to build a cursory model on top of resnet50 to subpar effect... this seems like the kind of image classification task that should be relatively trivial.
Any tips on model design greatly appreciated.
Code below:
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(nn.Linear(2048, 512),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 26),
nn.LogSoftmax(dim=1))
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.003)
model.to(device)
epochs = 10
steps = 0
running_loss = 0
print_every = 10
train_losses, test_losses = [], []
for epoch in range(epochs):
for inputs, labels in train_loader:
steps += 1
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
logps = model.forward(inputs)
loss = criterion(logps, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if steps % print_every == 0:
test_loss = 0
accuracy = 0
model.eval()
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
logps = model.forward(inputs)
batch_loss = criterion(logps, labels)
test_loss += batch_loss.item()
ps = torch.exp(logps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
train_losses.append(running_loss/len(train_loader))
test_losses.append(test_loss/len(val_loader))
print(f"Epoch {epoch+1}/{epochs}.. "
f"Train loss: {running_loss/print_every:.3f}.. "
f"Test loss: {test_loss/len(val_loader):.3f}.. "
f"Test accuracy: {accuracy/len(val_loader):.3f}")
running_loss = 0
model.train()
Results from the above look as follows, hitting <50% accuracy:

PyTorch Input and hidden tensors not on the same device

I'm creating a simple LSTM model to predict some sales data. I am trying to train it on a GPU, but there seems to be a problem with casting the hidden state tensor to cuda.
I get the following error:
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu.
How can I train the model on a GPU? I cast the training data, initial hidden states, and the model to cuda, yet I still get the error.
Here's my code:
# Convert train_norm from an array to a tensor
train_norm = torch.FloatTensor(train_norm).view(-1).cuda()
# define a window size
window_size = 12
# Define function to create seq/label tuples
def input_data(seq, ws): # ws is window size
out = []
L = len(seq)
for i in range(L-ws):
window = seq[i:i+ws]
label = seq[i+ws:i+ws+1]
out.append((window, label))
return out
# Apply the input_data function to the train_norm
train_data = input_data(train_norm, window_size)
class LSTM(nn.Module):
def __init__(self, input_size=1, hidden_size=100, output_size=1):
super().__init__()
self.hidden_size = hidden_size
# Add an LSTM layer:
self.lstm = nn.LSTM(input_size, hidden_size)
# Add a fully connected linear layer:
self.linear = nn.Linear(hidden_size, output_size)
# Initialize h0 and c0:
self.hidden = (torch.zeros(1, 1, hidden_size).cuda(), torch.zeros(1, 1, hidden_size).cuda())
def forward(self, seq):
lstm_out, self.hidden = self.lstm(seq.view(len(seq), 1, -1), self.hidden)
pred = self.linear(lstm_out.view(len(seq), -1))
return pred[-1] # get only the last value
model = LSTM().cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
epochs = 200
import time
start_time = time.time()
for epoch in range(epochs):
# Extract the sequence and label from the training data
for seq, y_train in train_data:
# Reset the parameters and hidden states
optimizer.zero_grad()
hidden = (torch.zeros(1, 1, model.hidden_size),
torch.zeros(1, 1, model.hidden_size))
model.hidden = hidden
# Predict the values
y_pred = model(seq)
# Calculate loss and perform backpropagation
loss = criterion(y_pred, y_train)
loss.backward()
optimizer.step()
print(f'epoch: {epoch+1:2} loss: {loss.item():10.8f}')
print(f'Training took {time.time() - start_time:.0f} seconds')

First of all you are initializing hidden when there is absolutely no point to do it. If hidden isn't passed to LSTM layer it will be zero by default, please see documentation. This gives us the following model:
class LSTM(nn.Module):
def __init__(self, input_size=1, hidden_size=100, output_size=1):
super().__init__()
self.hidden_size = hidden_size
# Add an LSTM layer:
self.lstm = nn.LSTM(input_size, hidden_size)
# Add a fully connected linear layer:
self.linear = nn.Linear(hidden_size, output_size)
def forward(self, seq):
lstm_out, _ = self.lstm(seq.view(len(seq), 1, -1))
return self.linear(lstm_out.view(len(seq), -1))
Your pred[-1] is probably wrong as well as you are only returning the last element of batch from linear layer...
Also your training should be this (see hidden removed and added cuda to seq and y_train):
for epoch in range(epochs):
# Extract the sequence and label from the training data
for seq, y_train in train_data:
# Reset the parameters and hidden states
optimizer.zero_grad()
# Predict the values
# Add cuda to sequence
y_pred = model(seq.cuda())
# Calculate loss and perform backpropagation
loss = criterion(y_pred, y_train.cuda())
loss.backward()
optimizer.step()
print(f'epoch: {epoch+1:2} loss: {loss.item():10.8f}')
print(f'Training took {time.time() - start_time:.0f} seconds')
This alleviates problems with cuda (it's not a solution to hardcode it everywhere you possibly can...) and makes your code more readable.

training and evaluating an stacked auto-encoder model in pytorch

I am trying to train a model in pytorch.
input: 686-array
first layer: 64-array
second layer: 2-array
output: predition either 1 or 0
this is what I have so far:
class autoencoder(nn.Module):
def __init__(self):
super(autoencoder, self).__init__()
self.encoder_softmax = nn.Sequential(
nn.Linear(686, 256),
nn.ReLU(True),
nn.Linear(256, 2),
nn.Softmax()
)
def forward(self, x):
x = self.encoder_softmax(x)
return x
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
net = net.to(device)
iterations = 10
learning_rate = 0.98
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
net.parameters(), lr=learning_rate, weight_decay=1e-5)
for epoch in range(iterations):
loss = 0.0
print("train_dl len: ", len(train_dl))
# net.train()
for i, data in enumerate(train_dl, 0):
inputs, labels, vectorize = data
labels = labels.long().to(device)
inputs = inputs.float().to(device)
optimizer.zero_grad()
outputs = net(inputs)
train_loss = criterion(outputs, labels)
train_loss.backward()
optimizer.step()
loss += train_loss.item()
loss = loss / len(train_dl)
but when I train the model, the loss is not going down. What am I doing wrong?

You're using nn.CrossEntropyLoss as the loss function, which applies log-softmax, but you also apply softmax in the model:
self.encoder_softmax = nn.Sequential(
nn.Linear(686, 256),
nn.ReLU(True),
nn.Linear(256, 2),
nn.Softmax() # <- needs to be removed
)
The output of your model should be the raw logits, without the nn.Softmax.
You should also lower the learning rate, because a learning rate of 0.98 is very high, which makes the training much less stable and you'll likely see the loss oscillate. Are more appropriate learning rate would be in the magnitude of 0.01 or 0.001.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is BERT model with pytorch native approach not learning? - python

Related

When retraining a model in pytorch should the optimizer be defined outside the train method? Will it get rid of previous weights if not?

Changing BatchNormalization momentum while training in Tensorflow 2

PyTorch classifier for simple images of letters: CNN model design questions

PyTorch Input and hidden tensors not on the same device

training and evaluating an stacked auto-encoder model in pytorch

Categories

Resources