How to pass word2vec embedding as a Keras Embedding layer? - python

I am solving a multi-class classification problem using Keras. But I am assuming the accuracy is bad due to poor word embedding of my data (domain-specific data).
Keras has its own Embedding layer, which is a supervised learning method.
So I have 2 questions regarding this :
Can I use word2vec embedding in Embedding layer of Keras, because word2vec is a form of unsupervised learning/self-supervised?
If yes, then can I use transfer learning on word2vec pre-train model to put extra knowledge of my domain specific features.

You can initialize the embeddings layer with word2vec or any other pre-trained embeddings (maybe FastText?) in such a way that you manually construct the embedding matrix, i.e., just load all the numbers form the word2vec files and make an np.array of it. Then you create a constant initializer and pass it as an argument to your embeddings layer constructor.
If you don't want the embeddings to get updated during training, just set trainable to False on the layer object.

Related

Finetuning the Universal Sentence Encoder

I am new to TensorFlow. I am using Universal Sentence Encoder for text similarity. I would like to finetune USE with my own corpus.
I currently have:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"
embed = hub.Module(module_url, trainable=True)
According to here, setting trainable=True will "expose the variables as trainable". However, I have no clue what these trainable variables are and how I can use them to finetune the USE with my own corpus.
Please, any guidance or direction would be greatly appreciated.
To finetune a pre-trained model is to allow it's weights to be updated in the downstream training task.
So you have 2 options:
trainable=False
this option will train quicker but the pretrained model weights will never be updated. A sentence embedding will look identical before and after your own training. Only your own model layers will have their weights changed by training.
trainable=True
this adds a computational burden to your training loop but will allow the weights of the embedder to become updated according to your task and training data. This may result in a more accurate final model

Use Glove vectors without Embedding layers in LSTM

I want to use the Glove vectors in Language Modeling. But the problem is if I use Embedding layer in the model, I can't predict the output vector and match the word. What I mean here is, I want to give the glove vector representation for my sentences as the Input. and get them out from the Lstm layer and get the vectors and match it with Glove vectors I want to use the glove vectors without embedding layer. can someone propose a method to do it? I am using the keras and python3
What I want is to use the embedding layer as one model1 and return the output vector and give it to another LSTM model2 as the input. where it gives the index of the word vector.

Using trainable word embedding layer with LSTM and dynamic RNN: AdamOptimizer expected float_ref instead of float

I'm using an RNN on sequences of word embeddings to classify sentences. At first I was feeding pre-trained word embeddings and everything worked fine. I made the embeddings matrix a tf.placeholder with dimension (Vocab size, Embedding size) and fed some pre-trained embeddings from GloVe. I also use tf.nn.embedding_lookup to translate my inputs (which are sequences of word IDs) into sequences of embeddings.
Then I wanted to allow the model to train the embeddings as well, so I made the embedding matrix a tf.Variable instead of a placeholder. Now TensorFlow gives me this error -- apparently the AdamOptimizer can't handle the embedding lookup. Any idea what's up or how to fix this?
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node
Adam/update_embeddings/AssignSub was passed float from _recv_embeddings_0:0
incompatible with expected float_ref.
You cannot feed a value to a variable and optimize it at the same time. Instead, you must first run a tf.assign on that variable to initialize it to the fed value, and then run the optimier. Or, more easily, you can just pass the glove vectors as the initializer of the variable and run tf.global_variables_initializer.

Using pre-trained word embeddings in tensorflow's seq2seq function

I am building a seq2seq model using functions in tensorflow's seq2seq.py, where they have a function like this:
embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
embedding_size, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None)
however, it seems that this function does not take pre-trained embeddings as input, are there any ways that I can take pre-trained word embeddings as input in this function?
There is no parameter you just hand over. Read in your embeddings (make sure vocabulary IDs match). Then, once you initialized all variables, find the embedding tensor (iterate through tf.all_variables to find the name). Then use tf.assign to overwrite the randomly initialized embeddings there with your embeddings.

Python: Train a word2vec model using vectors as input

I am using Python to train a word2vec model and get embeddings for each word in vocabulary. I used gensim to do this before, and I also notice that such model can be trained by tools like TensorFlow, Theano, and so on..
However, during these training processes, the inputs are just texts which are basically in string format, then the words will be mapped to index for training. In my case, I want to input arrays for training. These arrays, can be one-hot encoded vectors or other vectors after some designed manipulation.
So, is there existing tool which trains word2vec model by inputting vectors? If there is no such tools, any recommendation for me to learn so that I can write my own code?

Categories