I want to build a recurrent neural network (RNN) in TensorFlow that predicts the next word in a sequence of words. I have looked at several tutorials, e.g. the one of TensorFlow. I know that each word in the training text(s) is mapped to an integer index. However there are still a few things about the input that I don't get:
Networks are trained with batches, e.g. with 128 examples at the same time. Let's say we have 10.000 words in our vocabulary. Is the input to the network a matrix of size (128, sequence_length) or a one-hot encoded tensor (128, sequence_length, 10.000)?
How large is the second dimension, i.e. the sequence length? Do I use one sentence in each row of the batch, padding the sentences that are shorter than others with zeros?
Or can a row correspond to multiple sentences? E.g. can a row stand for "This is a test sentence. How are"? If so, where does the second sentence continue? In the next row of the same batch? Or in the same row in the next batch? How do I guarantee that TensorFlow continues the sentence correctly?
I wasn't able to find answers to these questions even if they are quite simple. I hope someone can help!
Yes. It's 3-dimensional vector (128, sequence_length, 10.000)
Yes. you should pad your sentences to make them same length. AND you can use tf.nn.dynamic_rnn and it can handle sentences of variable length base on tf.while.
There is great article dealt with this problem.
https://danijar.com/variable-sequence-lengths-in-tensorflow/
you can check more detail in Whats the difference between tensorflow dynamic_rnn and rnn?
Possible. but network doesn't know the sentence is connected or not. it just consider one row as one sentence. So, the result will be meaningless.
I hope this answer would help you.
Related
i'm trying to build a neural network using pytorch-nlp (https://pytorchnlp.readthedocs.io/en/latest/).
My intent is to build a network like this:
Embedding layer (uses pytorch standard layer and from_pretrained method)
Encoder with LSTM (also uses standard nn.LSTM)
Attention mechanism (uses torchnlp.nn.Attention)
Decoder siwth LSTM (as encoder)
Linear layer standard
I'm encountering a major problem with the dimensions of the input sentences (each word is a vector) but most importantly with the attention layer : I don't know how to declare it because i need the exact dimensions of the output from the encoder, but the sequences have varying dimensions (corresponding to the fact that sentences have different number of words).
I've tried to look at torch.nn.utils.rnn.pad_packed_sequence and torch.nn.utils.rnn.pack_padded_sequence since they're supported by LSTM, but i cannot find the solution.
Can anyone help me?
EDIT
I thought about padding all sequences to a specific dimension, but I don't want to truncate longer sequences because I want to keep all the information.
You are on the right track with padding all sequences to a specific dimension. You will have to pick a dimension that is larger than "most" of your sentences but you will need to cutoff some sentences. This blog article should help.
If I have to use pretrained word vectors as embedding layer in Neural Networks (eg. say CNN), How do I deal with index 0?
Detail:
We usually start with creating a zero numpy 2D array. Later we fill in the indices of words from the vocabulary.
The problem is, 0 is already the index of another word in our vocabulary (say, 'i' is index at 0). Hence, we are basically initializing the whole matrix filled with 'i' instead of empty words. So, how do we deal with padding all the sentences of equal length?
One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size? [Help me!]
One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size?
Nope! That's the same size.
a=np.full((5000,5000), 7)
a.nbytes
200000000
b=np.zeros((5000,5000))
b.nbytes
200000000
Edit: Typo
If I have to use pretrained word vectors as embedding layer in Neural
Networks (eg. say CNN), How do I deal with index 0?
Answer
In general, empty entries can be handled via a weighted cost of the model and the targets.
However, when dealing with words and sequential data, things can be a little tricky and there are several things that can be considered. Let's make some assumptions and work with that.
Assumptions
We begin with a pre-trained word2vec model.
We have sequences with varying lengths, with at most max_lenght words.
Details
Word2Vec is a model that learns a mapping (embedding) from discrete variables (word token = word unique id) to a continuous vector space.
The representation in the vector space is such that the cost function (CBOW, Skip-gram, essentially it is predicting word from context in bi-directional way) is minimized on the corpus.
Reading basic tutorials (like Google's word2vec tutorial on Tensorflow tutorials) reveals some details on the algorithm, including negative sampling.
The implementation is a lookup table. It is faster than the alternative one-hot encoding technique, since the dimensions of a one-hot encoded matrix are huge (say 10,000 columns for 10,000 words, n row for n sequential words). So the lookup (hash) table is significantly faster, and it selects rows from the embedding matrix (for row vectors).
Task
Add missing entries (no words) and use it in the model.
Suggestions
If there is some use for the cost of missing data, such as using a prediction from that entry and there is a label for that entry, you can add a new value as suggested (can be the 0 index, but all indexes must move i=i+1 and the embedding matrix should have new row at position 0).
Following the first suggestion, you need to train the added row. You can use negative sampling for the NaN class vs all. I do not suggest it for handling missing values. It is a good trick to handle an "Unknown word" class.
You can weight the cost of those entries by constant 0 for each sample that is shorter that max_length. That is, if we have a sequence of word tokens [0,5,6,2,178,24,0,NaN,NaN], the corresponding weight vector is [1,1,1,1,1,1,1,0,0]
You should worry about re-indexing the words and the cost of it. In memory, there is almost no difference (1 vs N words, N is large). In complexity, it is something that can be later incorporated in the initial tokenize function. The predictions and model complexity is a larger issue and more important requirement from the system.
There are numerous ways to tackle varying lengths (LSTM, RNNs, now we try CNNs and costs tricks). Read state-of-the-art literature on that issue, I'm sure there is much work. For example, see A Convolutional Neural Network for Modelling Sentences paper.
My dataset consists of sentences. Each sentence has a variable length and is initially encoded as a sequence of vocabulary indexes, ie. a tensor of shape [sentence_len]. The batch size is also variable.
I have grouped sentences of similar lengths into buckets and padded where necessary, to bring each sentence in a bucket to the same length.
How could I deal with having both an unknown sentence length AND batch size?
My data provider would tell me what the sentence length is at every batch, but I don't know how to feed that -> the graph is already built at that point. The input is represented with a placeholder x = tf.placeholder(tf.int32, shape=[batch_size, sentence_length], name='x'). I can turn batch_size or sentence_length to None, but not both.
UPDATE: in fact, interestingly, I can set both to None, but I get Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.Note: the next layer is an embedding_lookup.
I'm not sure what this means and how to avoid it. I assume it has something to do with using tf.gather later, which I need to use.
Alternatively is there any other way to achieve what I need?
Thank you.
Unfortunately there is no workaround here unless you provide a tf.Variable() (which is not possible in your case) to the parameter of tf.nn.embedding_lookup()/tf.gather().
This is happening because, When you declare them with a placeholder of shape [None, None], from tf.gather() function tf.IndexedSlices() become a sparse tensor.
I have already done projects facing this warning. What I can tell you that if there is a tf.nn.dynamic_rnn() next to the embedding_lookup then make the parameter named swap_memory of tf.nn.dynamic_rnn() to True. Also to avoid OOM or Resource Exhausted error make the batch size smaller (test for different batch size).
There are already some good explanation on this. Please refer to the following Question of the Stackoverflow.
Tensorflow dense gradient explanation?
Traditionally it seems that RNNs use logits to predict next time step in the sequence. In my case I need the RNN to output a word2vec (50 depth) vector prediction. This means that the cost function has be based off 2 vectors: Y the actual vector of the next word in the series and Y_hat, the network prediction.
I've tried using a cosine distance cost function but the network does not seem to learn (I've let it run other 10 hours on a AWS P3 and the cost is always around 0.7)
Is such a model possible at all ? If so what cost function should be used ?
Cosine distance in TF:
cosine_distance = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(targets, 2), axis=2)
Update:
I am trying to predict a word2vec so during sampling I could pick next word based on the closest neighbors of the predicted vector.
What is the reason that you want to predict a word embedding? Where are you getting the "ground truth" word embeddings from? For word2vec models you typically will re-use the trained word-embeddings in future models. If you trained a word2vec model with an embedding size of 50, then you would have 50-d embeddings that you could save and use in future models. If you just want to re-create an existing ground truth word2vec model, then you could just use those values. Typical word2vec would be having regular softmax outputs via continuous-bag-of-words or skip-gram and then saving the resulting word embeddings.
If you really do have a reason for trying to generate a model that creates tries to match word2vec, then looking at your loss function here are a few suggestions. I do not believe that you should be normalizing your outputs or your targets -- you probably want those to remain unaffected (the targets are no longer the "ground truth" targets if you have normalized them. Also, it appears you are using dim=0 which has now been deprecated and replaced with axis. Did you try different values for dim? This should represent the dimension along which to compute the cosine distance and I think that the 0th dimension would be the wrong dimension (as this likely should be the batch size. I would try with values of axis=-1 (last dimension) or axis=1 and see if you observe any difference.
Separately, what is your optimizer/learning rate? If the learning rate is too small then you may not actually be able to move enough in the right direction.
I want to create a Neural Network in Keras for converting handwriting into computer letters.
My first step is to convert a sentence into an Array. My Array has the shape (1, number of letters,27). Now I want to input it in my Deep Neural Network and train.
But how do I input it properly if the dimension doesn't fit those from my image? And how do I achieve that my predict function gives me an output array of (1, number of letters,27)?
Seems like you are attempting to do Handwritten Recognition or similarly Optical Character Recognition or OCR. This is quite a broad field, and there are many ways to proceed. Even though, one approach I suggest is the following:
It is commonly known that Neural Networks have fixed size inputs, that is if you build it to take, say, inputs of shape (28,28,1) then the model will expect that shape as their inputs. Therefore, having a dimension in your samples that depends on the number of letters in a sentence (something variable) is not recommended, as you will not be able to train a model in such way with NNs.
Training such a model could be possible if you design it to predict one character at a time, instead a whole sentence that can have different lengths, and then group the predicted characters. The steps you could try to achieve this could be:
Obtain training samples for the characters you wish to recognize (like the MNIST database for example), and design and train your model to predict one character at a time.
Take the image with writing to classify and pass a Sliding Window over it that matches your expected input size (say a 28x28 window). Then, classify each of those windows to a character. Instead of Sliding Window, you could try isolating your desired features somehow and just classify those 28x28 segments instead.
Group the predicted characters somehow so you get words (probably grouping those separated by empty spaces) or do whatever you want with the predictions.
You can also try searching for tutorials or guides for Handwriting recognition like this one I have found quite useful. Hope this helps you get on track, good luck.