As I understand, Embedding Layers are simply lookup matrices, weights of which are learned by the optimisation problem.
Suppose, for this example, my dataset contains a single categorical variable. For example, I would like to auto encode a sentence of words to itself, to learn the sentence representation.
# example model
input = tf.keras.layers.Input()
embed = tf.keras.layers.Embedding(99)(input)
encoder = tf.keras.layers.LSTM()(embed)
decoder = tf.keras.layers.LSTM()(encoder)
model = tf.keras.models.Model(input, decoder)
The error will minimise the difference between embed and decoder outputs.
However, since embeddings are learned depending on optimisation condition, I think that I will end up learning trivial representations e.g.
the embedding matrix is all ones, and decoder always outputs ones. (Or zeros even), giving me a 100% accuracy in training.
For example, in the embedding matrix all words are just a vector of ones, and the auto encoder simply returns ones.
What I would like to do is to learn a meaningful representation of categorical variables.
Related
I have a set of vectors that represent words and each vector has 300 features meaning that there are 300 floats for each vector. My goal is to reduce to dimensionality i.e. to 50 so that I can gain some space.
How can apply a dimensionality reduction on this vector set using e.g. tensorflow? I couldn't find a method, an implementation etc. that takes a list of vectors as input and reduces it.
You might want to look into convolutional neural networks for text processing. CNNs in general are known for dimensionality reduction of the input vectors. They are usually used for image classification but also work on text and sentence classification. What you are looking for is the embedding of an input vector. Quote:
Now that our words have been replaced by numbers, we could simply do one-hot encoding but that would result in an extremely wide input — there are thousands of unique words in the titles dataset. A better approach is to reduce the dimensionality of the input — this is done through an embedding layer (see full code here):
This is from here:
TowardsDataScience
Another ressource:
AnalyticsVidhya
Let's say i have a dataset consisting of a review column with exactly 100 words for each review, then it may be easy to train my model as i can simply tokenize each of the 100 words for each reviews then convert it into a numerical array and then feed it into a Sequential model with input_shape=(1,100). But in the real world, reviews are never the same size. If I use a function such as CountVectorizer, then the structure of the sentence is not reserved, and one hot encoding may not be efficient enough.
So what is the proper way to preprocess this particular dataset so that i feed it into a trainable NN
A common way to represent text as vectors is by utilizing word embeddings. The main idea is that you used a large text corpus to compute vector representations of all words occurring in that dataset. So now for each review, you could run the following algorithm to compute its vector representation:
For each word in the review, check if a word embedding exists (in other words, that word occurred in the large training corpus) and if it does, add its vector representation to the representation of the review
Once you summed up the vector representations of all words, you compute the average embedding by dividing the summed review vector by the number of words in the document and this results in the final vector representation for that document
This vector can now be fed into a trainable NN
Before performing steps 1-3, you could also apply more preprocessing steps and remove fill words such as "and", "or", etc. as they usually carry no meaning, you could convert words to lower case and apply other standard NLP (natural language processing techniques) which could affect the vector representation of the reviews. But the key idea is to sum up the word vectors of a review and use its averaged vector as the representation of the review. By averaging, the length of the reviews is unimportant. Similarly, in word embeddings, the dimensionality of the word vectors is fixed (100D, 200D, ...), so you can experiment with the most suitable dimensionality.
Note that there are many different models available that compute word embeddings, so you could choose any of them. One that is nicely integrated into Python is word2vec.
And a state-of-the-art model that is currently being used by Google is called BERT.
I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.
For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?
Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?
Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?
Ah, but you see, BERT does not include a Transformer decoder.
It is only the encoder part, with a classifier added on top.
For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words.
Classifying Non-masked is not included in the classification task and does not effect loss.
BERT is also trained on predicting whether a pair of sentences really does precedes one another or not.
I do not remember how the two losses are weighted.
I hope this draws a clearer picture.
i'm trying to build a neural network using pytorch-nlp (https://pytorchnlp.readthedocs.io/en/latest/).
My intent is to build a network like this:
Embedding layer (uses pytorch standard layer and from_pretrained method)
Encoder with LSTM (also uses standard nn.LSTM)
Attention mechanism (uses torchnlp.nn.Attention)
Decoder siwth LSTM (as encoder)
Linear layer standard
I'm encountering a major problem with the dimensions of the input sentences (each word is a vector) but most importantly with the attention layer : I don't know how to declare it because i need the exact dimensions of the output from the encoder, but the sequences have varying dimensions (corresponding to the fact that sentences have different number of words).
I've tried to look at torch.nn.utils.rnn.pad_packed_sequence and torch.nn.utils.rnn.pack_padded_sequence since they're supported by LSTM, but i cannot find the solution.
Can anyone help me?
EDIT
I thought about padding all sequences to a specific dimension, but I don't want to truncate longer sequences because I want to keep all the information.
You are on the right track with padding all sequences to a specific dimension. You will have to pick a dimension that is larger than "most" of your sentences but you will need to cutoff some sentences. This blog article should help.
Traditionally it seems that RNNs use logits to predict next time step in the sequence. In my case I need the RNN to output a word2vec (50 depth) vector prediction. This means that the cost function has be based off 2 vectors: Y the actual vector of the next word in the series and Y_hat, the network prediction.
I've tried using a cosine distance cost function but the network does not seem to learn (I've let it run other 10 hours on a AWS P3 and the cost is always around 0.7)
Is such a model possible at all ? If so what cost function should be used ?
Cosine distance in TF:
cosine_distance = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(targets, 2), axis=2)
Update:
I am trying to predict a word2vec so during sampling I could pick next word based on the closest neighbors of the predicted vector.
What is the reason that you want to predict a word embedding? Where are you getting the "ground truth" word embeddings from? For word2vec models you typically will re-use the trained word-embeddings in future models. If you trained a word2vec model with an embedding size of 50, then you would have 50-d embeddings that you could save and use in future models. If you just want to re-create an existing ground truth word2vec model, then you could just use those values. Typical word2vec would be having regular softmax outputs via continuous-bag-of-words or skip-gram and then saving the resulting word embeddings.
If you really do have a reason for trying to generate a model that creates tries to match word2vec, then looking at your loss function here are a few suggestions. I do not believe that you should be normalizing your outputs or your targets -- you probably want those to remain unaffected (the targets are no longer the "ground truth" targets if you have normalized them. Also, it appears you are using dim=0 which has now been deprecated and replaced with axis. Did you try different values for dim? This should represent the dimension along which to compute the cosine distance and I think that the 0th dimension would be the wrong dimension (as this likely should be the batch size. I would try with values of axis=-1 (last dimension) or axis=1 and see if you observe any difference.
Separately, what is your optimizer/learning rate? If the learning rate is too small then you may not actually be able to move enough in the right direction.