Doing Multi-Label classification with BERT

Doing Multi-Label classification with BERT - python

I want to use BERT model to do multi-label classification with Tensorflow.
To do so, I want to adapt the example run_classifier.py from BERT github repository, which is an example on how to use BERT to do simple classification, using the pre-trained weights given by Google Research. (For example with BERT-Base, Cased)
I have X different labels which have value of either 0 or 1, so I want to add to the original BERT model a new Dense layer of size X and using the sigmoid_cross_entropy_with_logits activation function.
So, for the theorical part I think I am OK.
The problem is that I don't know how I can append a new output layer and retrain only this new layer with my dataset, using the existing BertModel class.
Here is the original create_model() function from run_classifier.py where I guess I have to do my modifications. But I am a bit lost on what to do.
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
And here is the same function, with some of my modifications, but where there is things missing (and wrong things too? )
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable("output_weights", [num_labels, hidden_size],initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
The other things I have adapted in the code and for which I had no problem :
DataProcessor to load and parse my custom dataset
Changing the type of labels variable from numerical values to arrays everywhere it is used
So, if anyone knows what I should do to resolve my problem, or even point out some obvious mistake I may have done, I would be glad to hear it.
Notes :
I found this article that correspond pretty well to what I am trying to do, but it use PyTorch, and I can not translate it into Tensorflow.

You want to replace the softmax that models a single distribution over possible outputs (all scores sum up to one) with sigmoid which models an independent distribution for each class (there is yes/no distribution for each output).
So, you correctly change the loss function, but you also need to change how you compute the probabilities. It should be:
probabilities = tf.sigmoid(logits)
In this case, you don't need the log_probs.

Related

neural network trained with PyTorch outputs the mean value for every input

I am using PyTorch in order to get my neural network to recognize digits from the MNIST database.
import torch
import torchvision
I'd like to implement a very simple design similar to what is shown in 3Blue1Brown's video series about neural networks. The following design in particular achieved an error rate of 1.6%.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = torch.sigmoid(self.layer1(x))
x = torch.sigmoid(self.layer2(x))
return x
The data is gathered using torchvision and organised in mini batches containing 32 images each.
batch_size = 32
training_set = torchvision.datasets.MNIST("./", download=True, transform=torchvision.transforms.ToTensor())
training_loader = torch.utils.data.DataLoader(training_set, batch_size=32)
I am using the mean squared error as a loss funtion and stochastic gradient descent with a learning rate of 0.001 as my optimization algorithm.
net = Net()
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
Finally the network gets trained and saved using the following code:
for images, labels in training_loader:
optimizer.zero_grad()
for i in range(batch_size):
output = net(torch.flatten(images[i]))
desired_output = torch.tensor([float(j == labels[i]) for j in range(10)])
loss = loss_function(output, desired_output)
loss.backward()
optimizer.step()
torch.save(net.state_dict(), "./trained_net.pth")
However, here are the outputs of some test images:
tensor([0.0978, 0.1225, 0.1018, 0.0961, 0.1022, 0.0885, 0.1007, 0.1077, 0.0994,
0.1081], grad_fn=<SigmoidBackward>)
tensor([0.0978, 0.1180, 0.1001, 0.0929, 0.1006, 0.0893, 0.1010, 0.1051, 0.0978,
0.1067], grad_fn=<SigmoidBackward>)
tensor([0.0981, 0.1227, 0.1018, 0.0970, 0.0979, 0.0908, 0.1001, 0.1092, 0.1011,
0.1088], grad_fn=<SigmoidBackward>)
tensor([0.1061, 0.1149, 0.1037, 0.1001, 0.0957, 0.0919, 0.1044, 0.1022, 0.0997,
0.1052], grad_fn=<SigmoidBackward>)
tensor([0.0996, 0.1137, 0.1005, 0.0947, 0.0977, 0.0916, 0.1048, 0.1109, 0.1013,
0.1085], grad_fn=<SigmoidBackward>)
tensor([0.1008, 0.1154, 0.0986, 0.0996, 0.1031, 0.0952, 0.0995, 0.1063, 0.0982,
0.1094], grad_fn=<SigmoidBackward>)
tensor([0.0972, 0.1235, 0.1013, 0.0984, 0.0974, 0.0907, 0.1032, 0.1075, 0.1001,
0.1080], grad_fn=<SigmoidBackward>)
tensor([0.0929, 0.1258, 0.1016, 0.0978, 0.1006, 0.0889, 0.1001, 0.1068, 0.0986,
0.1024], grad_fn=<SigmoidBackward>)
tensor([0.0982, 0.1207, 0.1040, 0.0990, 0.0999, 0.0910, 0.0980, 0.1051, 0.1039,
0.1078], grad_fn=<SigmoidBackward>)
As you can see the network seems to approach a state where the answer for every input is:
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
This neural network is not better than just guessing. Where did I go wrong in my design or code?

Here are a few points that would be useful for you:
At first glance your model is not learning since your prediction are as good as a random guess. The first initiative would be to monitor your loss, here you only have a single epoch. At least you could evaluate your model on unseen data:
validation_set = torchvision.datasets.MNIST('./',
download=True, train=False, transform=T.ToTensor())
validation_loader = DataLoader(validation_set, batch_size=32)
You are using a MSE loss (the L2-norm) to train a classification task which is not the right tool for this kind of task. You could instead be using the negative log-likelihood. PyTorch offers nn.CrossEntropyLoss which includes a log-softmax and the negative log-likelihood loss in one module. This change can be implemented by adding in:
loss_function = nn.CrossEntropyLoss()
and using the right target shapes when applying loss_function (see below). Since the loss function will apply a log-softmax, you shouldn't have an activation function on your model's output.
You are using sigmoid as an activation function, intermediate non-linearities will work better as ReLU (see related post). A sigmoid is more suited for a binary classification task. Again, since we are using nn.CrossEntropyLoss, we have to remove the activation after layer2.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
A less crucial point is the fact that you could infer estimations on a whole batch instead of looping through each batch one element at a time. A typical training loop for one epoch would look like:
for images, labels in training_loader:
optimizer.zero_grad()
output = net(images)
loss = loss_function(output, labels)
loss.backward()
optimizer.step()
With these kinds of modifications, you can expect to have a validation of around 80% after a single epoch.

Classify single image based on trained tensorflow model

I'm using convolutional neural networks to classify images in 3 labels. I did all the training and test and got an accuracy of 60%. Then, I saved this model and I want to load a single image and classify it into one of those labels. The code I'm using:
X_new = process_data() # That's my input image after some processing
pred = convolutional_neural_network(x) # That's my CNN
with tf.Session() as sess:
# Here I restore the trained model
saver = tf.train.import_meta_graph('modelo.meta')
saver.restore(sess, 'modelo')
print('Model loaded')
sess.run(tf.initialize_all_variables())
# Here I'm trying to predict the label of my image
c = sess.run(pred, feed_dict={x: X_new})
print(c)
When I print c, it returns me something like this:
[[ 1.5495030e+07 -2.3345528e+08 -1.5847101e+08]]
But I couldn't find out what that means and what I should do with it.
Anyway, what I'm trying to do is to get a percentage of how much an image belongs to some label. If anyone can help me with that, I will be so grateful! I'm new to tensorflow and I'm having difficulties.
Thank you so much!
EDIT:
The convolutional_neural_network method:
def convolutional_neural_network(x):
weights = {'W_conv1': tf.Variable(tf.random_normal([3, 3, 3, 1, 32])),
'W_conv2': tf.Variable(tf.random_normal([3, 3, 3, 32, 64])),
'W_fc': tf.Variable(tf.random_normal([54080, 1024])),
'out': tf.Variable(tf.random_normal([1024, n_classes]))}
biases = {'b_conv1': tf.Variable(tf.random_normal([32])),
'b_conv2': tf.Variable(tf.random_normal([64])),
'b_fc': tf.Variable(tf.random_normal([1024])),
'out': tf.Variable(tf.random_normal([n_classes]))}
x = tf.reshape(x, shape=[-1, IMG_PX_SIZE, IMG_PX_SIZE, HM_SLICES, 1])
conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
conv1 = maxpool3d(conv1)
conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
conv2 = maxpool3d(conv2)
fc = tf.reshape(conv2, [-1, 54080])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc']) + biases['b_fc'])
fc = tf.nn.dropout(fc, keep_rate)
output = tf.matmul(fc, weights['out']) + biases['out']
return output

Depends on the type of activation layer used at the last layer of the model. I am assuming softmax as there are 3 labels to predict. Based on the values, I think the network was not able to predict the label with high certainty hence the low values. If you just want to know the label of the image use argmax function of numpy library like this
import numpy as np
c=np.argmax(c)
print(c)
Hope this helps.

Right now you seem to just have a linear regression output layer (or logits layer), so it is hard to interpret the outputs. We would also need to see your loss function to understand the outputs you are seeing.
However, for classification into one of 3 labels you would normally want to terminate your network with a softmax layer, then the outputs will be interpretable as a probabilities for each label (multiply by 100 if you need percentages).
probabilities = tf.nn.softmax(output)
You would still train your classifier using your output/logits layer with softmax cross entropy loss with your grounth-truth 'y' values.
losses = tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y)

Differentiating a trained simple neural net with tensorflow gives the wrong result and changes model weights

I am trying to teach a neural net a simple function (f(t) = 2t) and then compute derivative with respect to input (df/dt = 2). I use a net with one dense layer and no activation:
model = Sequential()
model.add(Dense(output_dim=1, input_shape=(1,), bias_initializer='ones'))
opt = RMSprop(lr=0.01, rho=0.9, epsilon=None, decay=0.0)
model.compile(optimizer=opt, loss='mse', metrics=['mae'])#optimizer=opt,
model.summary()
My data consist of pairs t -> f(t), where t is chosen randomly on [0, 1] to compute df/dt of my net I found this code:
https://groups.google.com/forum/#!msg/keras-users/g2JmncAIT9w/36MJZI7NBQAJ
and https://colab.research.google.com/drive/1l9YdIa2N40Fj3Y09qb3r3RhqKPXoaVJC
This is my full code on colab:
model.fit(train_x ,train_y, epochs=100,validation_data=(test_x, test_y),shuffle=False, batch_size=32)
model.layers[0].get_weights()# this displays 2.0069, quite right
outputTensor = model.output
listOfVariableTensors = model.inputs[0]
gradients = k.gradients(outputTensor, listOfVariableTensors)
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
evaluated_gradients = sess.run(gradients,feed_dict={model.input:np.array([[10]])})
evaluated_gradients # this displays kinda random number
model.layers[0].get_weights() # this displays same number as above
I believe my model performs a simple w*t + b transform, so its derivative should be just w. But the code I found provides wrong results and breaks trained weights. I actually think it resets them to initial weights because if I initialize dense layer weights with kernel_initializer= "ones", the code returns 1 as a derivative.
So, I need help with correct derivation of neural net.

sess = tf.InteractiveSession() # run this before anything happens
#....
model.fit(train_x ,train_y, epochs=100,validation_data=(test_x, test_y),shuffle=False, batch_size=32)
model.layers[0].get_weights()# this displays 2.0069, quite right
outputTensor = model.output
listOfVariableTensors = model.inputs[0]
gradients = k.gradients(outputTensor, listOfVariableTensors)
evaluated_gradients = sess.run(gradients,feed_dict={model.input:np.array([[10]])})
evaluated_gradients

Inconsistency between GRU and RNN implementation

I'm trying to implement some custom GRU cells using Tensorflow. I need to stack those cells, and I wanted to inherit from tensorflow.keras.layers.GRU. However, when looking at the source code, I noticed that you can only pass a units argument to the __init__ of GRU, while RNN has an argument that is a list of RNNcell, and leverages it to stack those cells calling StackedRNNCells. Meanwhile, GRU only create one GRUCell.
For the paper I'm trying to implement, I actually need to stack GRUCell. Why are the implementation of RNN and GRU different?

While searching for the documentation for these classes to add links, I noticed something that may be tripping you up: there are (currently, just before the official TF 2.0 release) two GRUCell implementations in TensorFlow! There is a tf.nn.rnn_cell.GRUCell and a tf.keras.layers.GRUCell. It looks like the one from tf.nn.rnn_cell is deprecated, and the Keras one is the one you should use.
From what I can tell, the GRUCell has the same __call__() method signature as tf.keras.layers.LSTMCell and tf.keras.layers.SimpleRNNCell, and they all inherit from Layer. The RNN documentation gives some requirements on what the __call__() method of the objects you pass to its cell argument must do, but my guess is that all three of these should meet those requirements. You should be able to just use the same RNN framework and pass it a list of GRUCell objects instead of LSTMCell or SimpleRNNCell.
I can't test this right now, so I'm not sure if you pass a list of GRUCell objects or just GRU objects into RNN, but I think one of those should work.

train_graph = tf.Graph()
with train_graph.as_default():
# Initialize input placeholders
input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
lr = tf.placeholder(tf.float32, name='learning_rate')
# Calculate text attributes
vocab_size = len(int_to_vocab)
input_text_shape = tf.shape(input_text)
# Build the RNN cell
lstm = tf.contrib.rnn.BasicLSTMCell(num_units=rnn_size)
drop_cell = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop_cell] * num_layers)
# Set the initial state
initial_state = cell.zero_state(input_text_shape[0], tf.float32)
initial_state = tf.identity(initial_state, name='initial_state')
# Create word embedding as input to RNN
embed = tf.contrib.layers.embed_sequence(input_text, vocab_size, embed_dim)
# Build RNN
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, dtype=tf.float32)
final_state = tf.identity(final_state, name='final_state')
# Take RNN output and make logits
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn=None)
# Calculate the probability of generating each word
probs = tf.nn.softmax(logits, name='probs')
# Define loss function
cost = tf.contrib.seq2seq.sequence_loss(
logits,
targets,
tf.ones([input_text_shape[0], input_text_shape[1]])
)
# Learning rate optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
# Gradient clipping to avoid exploding gradients
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)

tensorflow rnn_decoder perform softmax on each decoder_output

I tried to write my own estimator model_fn() for a GCP ML Engine package. I decoded a sequence of outputs using embedding_rnn_decoder as shown below:
outputs, state = tf.contrib.legacy_seq2seq.embedding_rnn_decoder(
decoder_inputs = decoder_inputs,
initial_state = curr_layer,
cell = tf.contrib.rnn.GRUCell(hidden_units),
num_symbols = n_classes,
embedding_size = embedding_dims,
feed_previous = False)
I know that outputs is "A list of the same length as decoder_inputs of 2D Tensors" but I am wondering how I can use this list to calculate the loss function for the entire sequence?
I know that if I grab outputs[0] (ie. grab only the first sequence output) then I could loss by following:
logits = tf.layers.dense(
outputs[0],
n_classes)
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=labels)
Is it appropriate to generate a loss value for each out the items in output and then pass these all to tf.reduce_mean? This feels inefficient, especially for long sequences -- are there any other ways to calculate the softmax at each step of the sequence that would be more efficient?

I think you're looking for sequence_loss (currently in contrib/ a type of incubation).

It looks like the solution to my problem is to use sequence_loss_by_example

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Doing Multi-Label classification with BERT - python

Related

neural network trained with PyTorch outputs the mean value for every input

Classify single image based on trained tensorflow model

Differentiating a trained simple neural net with tensorflow gives the wrong result and changes model weights

Inconsistency between GRU and RNN implementation

tensorflow rnn_decoder perform softmax on each decoder_output

Categories

Resources