Finetuning the Universal Sentence Encoder

Finetuning the Universal Sentence Encoder - python

I am new to TensorFlow. I am using Universal Sentence Encoder for text similarity. I would like to finetune USE with my own corpus.
I currently have:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"
embed = hub.Module(module_url, trainable=True)
According to here, setting trainable=True will "expose the variables as trainable". However, I have no clue what these trainable variables are and how I can use them to finetune the USE with my own corpus.
Please, any guidance or direction would be greatly appreciated.

To finetune a pre-trained model is to allow it's weights to be updated in the downstream training task.
So you have 2 options:
trainable=False
this option will train quicker but the pretrained model weights will never be updated. A sentence embedding will look identical before and after your own training. Only your own model layers will have their weights changed by training.
trainable=True
this adds a computational burden to your training loop but will allow the weights of the embedder to become updated according to your task and training data. This may result in a more accurate final model

Related

Can I train my pretrained model with a totally different architecture?

I have trained a pretrained ResNet18 model with my custom dataset on Pytorch and wondered whether I could transfer my model file to train another one with a different architecture, e.g. ResNet50. I know I have to save my model accordingly (explained well on another post here) but this was a question that I have never thought before.
I was planning to use more advanced models like VisionTransformers (ViT) but I couldn't figure out whether I had to start with a pretrained ViT already or I could just take my previous model file and use it as the pretrained model to train a ViT.
Example Scenario: ResNet18 --> ResNet50 --> Inception v3 --> ViT
My best guess it that it's not possible due to number of weights, neurons and layer structures but I would love to hear that if I miss a crucial point here. Thanks!

Between models that only differ in number of layers (Resnet-18 and Resnet-50), it has been done to initialize some layers of the larger model from the weights of the smaller model's layers. Inversely, you can truncate a larger model by taking a subset of regularly spaced layers and initialize a smaller model. In both cases, you need to retrain everything at the end if you hope to achieve semi-decent performances.
The whole point of using architectures that vastly differ (vision transformers vs CNNs) is to learn different features from the inputs and unlock new levels of semantic understanding. Recent models like BeiT also use new self-supervised training schemes that have nothing to do with the classic ImageNet pretraining. Using trained weights from another model would go against the point.
Having said that,if you want to use a ViT, why not start from the available pretrained weights on HuggingFace and fine-tune it on the data you used to train your ResNet50 ?

If we expand or reduce the layer of the same model, can we still be able to train from pretrained model in Pytorch?

If the pretrained model such as Resnet101 were trained on ImageNet dataset, then I change some layers inside it. Can I still be able to use the pretrained model on different ABC dataset?
Lets say This is ResNet34 Model,
It is pretrained on ImageNet and saved as ResNet.pt file.
If I changed some layers inside it, lets say I made it more deeper by introducing some layers in conv4_x (check image)
model = Resnet34() #I have changes some layers inside this ResNet34()
optimizer = optim.Adam(model.parameters(), lr=0.00005)
model.load_state_dict(torch.load('Resnet.pt')['state_dict']) #This is pretrained model of ResNet before some changes
optimizer.load_state_dict(torch.load('Resnet.pt')['optimizer'])
Can I do this? or there are anyother method?

You can do anything you like - the question is: would it be better than training from scratch?
Here are a few issues you might encounter:
1. A mismatch between weights saved in ResNet.pt (the trained weights of the original ResNet18) and the state_dict of your modified model.
You would probably need to manually make sure that the old weights are correctly assigned to the original layers and only the new layer is not initialized.
2. Initializing the weights of the new layer.
Since you are training a resNet - you can take advantage of the residual connections and init the weights of the new layer such that it would initially make no contribution to the predicted value and only pass the input directly to the output via the residual link.

How to pass word2vec embedding as a Keras Embedding layer?

I am solving a multi-class classification problem using Keras. But I am assuming the accuracy is bad due to poor word embedding of my data (domain-specific data).
Keras has its own Embedding layer, which is a supervised learning method.
So I have 2 questions regarding this :
Can I use word2vec embedding in Embedding layer of Keras, because word2vec is a form of unsupervised learning/self-supervised?
If yes, then can I use transfer learning on word2vec pre-train model to put extra knowledge of my domain specific features.

You can initialize the embeddings layer with word2vec or any other pre-trained embeddings (maybe FastText?) in such a way that you manually construct the embedding matrix, i.e., just load all the numbers form the word2vec files and make an np.array of it. Then you create a constant initializer and pass it as an argument to your embeddings layer constructor.
If you don't want the embeddings to get updated during training, just set trainable to False on the layer object.

How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

I am looking for some heads up to train a conventional neural network model with bert embeddings that are generated dynamically (BERT contextualized embeddings which generates different embeddings for the same word which when comes under different context).
In normal neural network model, we would initialize the model with glove or fasttext embeddings like,
import torch.nn as nn
embed = nn.Embedding(vocab_size, vector_size)
embed.weight.data.copy_(some_variable_containing_vectors)
Instead of copying static vectors like this and use it for training, I want to pass every input to a BERT model and generate embedding for the words on the fly, and feed them to the model for training.
So should I work on changing the forward function in the model for incorporating those embeddings?
Any help would be appreciated!

If you are using Pytorch. You can use https://github.com/huggingface/pytorch-pretrained-BERT which is the most popular BERT implementation for Pytorch (it is also a pip package!). Here I'm just going to outline how to use it properly.
For this particular problem there are 2 approaches - where you obviously cannot use the Embedding layer:
You can incorporate generating BERT embeddings into your data preprocessing pipeline. You will need to use BERT's own tokenizer and word-to-ids dictionary. The repo's README has examples on preprocessing.
You can write a loop for generating BERT tokens for strings like this (assuming - because BERT consumes a lot of GPU memory):
(Note: to be more proper you should also add attention masks - which are LongTensor of 1 & 0 masking the sentence lengths)
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
batch_size = 32
X_train, y_train = samples_from_file('train.csv') # Put your own data loading function here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_train = [tokenizer.tokenize('[CLS] ' + sent + ' [SEP]') for sent in X_train] # Appending [CLS] and [SEP] tokens - this probably can be done in a cleaner way
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model = bert_model.cuda()
X_train_tokens = [tokenizer.convert_tokens_to_ids(sent) for sent in X_train]
results = torch.zeros((len(X_test_tokens), bert_model.config.hidden_size)).long()
with torch.no_grad():
for stidx in range(0, len(X_test_tokens), batch_size):
X = X_test_tokens[stidx:stidx + batch_size]
X = torch.LongTensor(X).cuda()
_, pooled_output = bert_model(X)
results[stidx:stidx + batch_size,:] = pooled_output.cpu()
After which you obtain the results tensor which contains the calculated embeddings, where you can use it as an input to your model.
The full (and more proper) code for this is provided here
This method has the advantage of not having to re-calculate these embeddings every epoch.
With this method, e.g for classification your model should only consist of a Linear(bert_model.config.hidden_size, num_labels) layer, inputs to the model should be the results tensor in the above code
Second, and arguably cleaner method: If you check out the repo, you can find there is wrappers for various tasks (e.g BertForSequenceClassification). It should also be easy to implement your custom classes that inherits from BertPretrainedModel and utilizes the various Bert classes from the repo.
For example, you can use:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', labels=num_labels) # Where num_labels is the number of labels you need to classify.
After which you can continue with the preprocessing, up until generating token ids. Then you can train the entire model (but with a low learning rate e.g Adam 3e-5 for batch_size = 32)
With this you can fine-tune BERT's embeddings itself, or use techniques like freezing BERT for a few epochs to train the classifier only, then unfreeze to fine-tune etc. But it is also more computationally expensive.
An example for this is also provided in the repo

TensorFlow: What is the easiest way to incorporate predictions from one model in the training of a new model?

What is the simplest way to use tf.estimator trained model A during the training of another model B?
The weights in model A are fixed. In model B, I would like to take some inputs, compute, feed these results into model A, then do some more computations on the output.
A simple example:
ModelA returns tf.matmul(input,weights)
In ModelB, I would like to do the following:
x1 = tf.matmul(new_inputs,new_weights1)
x2 = modelA(x1) # with fixed weights
return tf.matmul(x2,new_weights2)
But with more complicated models A and B, each of which is trained as a tf.estimator (though I'm happy to not use estimators if there's another easy solution -- I'm using them because I would like to use ML Engine).
This question is related, but the proposed solution does not work for training model B, because the gradients of tf.py_func are [None]. I have tried registering a gradient for tf.py_func, but this fails with
Unsupported object type Tensor
I have also tried tf.import_graph_def for model A, but this seems to load the pretrained graph, but not the actual weights.

For model composability, Keras works a whole lot better. You can convert a Keras model to estimator:
https://cloud.google.com/blog/products/gcp/new-in-tensorflow-14-converting-a-keras-model-to-a-tensorflow-estimator
So you can still train on ML Engine.
With Keras, it is then just a matter of loading the intermediate layers' weights and biases from a checkpoint and make that layer non-trainable. See:
Is it possible to save a trained layer to use layer on Keras?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finetuning the Universal Sentence Encoder - python

Related

Can I train my pretrained model with a totally different architecture?

If we expand or reduce the layer of the same model, can we still be able to train from pretrained model in Pytorch?

How to pass word2vec embedding as a Keras Embedding layer?

How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

TensorFlow: What is the easiest way to incorporate predictions from one model in the training of a new model?

Categories

Resources