How to structure an LSTM neural network for classification - python

I have data that has various conversations between two people. Each sentence has some type of classification. I am attempting to use an NLP net to classify each sentence of the conversation. I tried a convolution net and get decent results (not ground breaking tho). I figured that since this a back and forth conversation, and LSTM net may produce better results, because what was previously said may have a large impact on what follows.
If I follow the structure above, I would assume that I am doing a many-to-many. My data looks like.
X_train = [[sentence 1],
[sentence 2],
[sentence 3]]
Y_train = [[0],
[1],
[0]]
Data has been processed using word2vec. I then design my network as follows..
model = Sequential()
model.add(Embedding(len(vocabulary),embedding_dim,
input_length=X_train.shape[1]))
model.add(LSTM(88))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',
metrics['accuracy'])
model.fit(X_train,Y_train,verbose=2,nb_epoch=3,batch_size=15)
I assume that this setup will feed one batch of sentences in at a time. However, if in model.fit, shuffle is not equal to false its receiving shuffled batches, so why is an LSTM net even useful in this case? From research on the subject, to achieve a many-to-many structure one would need to change the LSTM layer too
model.add(LSTM(88,return_sequence=True))
and the output layer would need to be...
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
When switching to this structure I get an error on the input size. I'm unsure of how to reformat the data to meet this requirement, and also how to edit the embedding layer to receive the new data format.
Any input would be greatly appreciated. Or if you have any suggestions on a better method, I am more than happy to hear them!

Your first attempt was good. The shuffling takes place between sentences, the only shuffle the training samples between them so that they don't always come in in the same order. The words inside sentences are not shuffled.
Or maybe I didn't understand the question correctly?
EDIT :
After a better understanding of the question, here is my proposition.
Data preparation : You slice your corpus in blocks of n sentences (they can overlap).
You should then have a shape like (number_blocks_of_sentences, n, number_of_words_per_sentence) so basically a list of 2D arrays which contain blocks of n sentences. n shouldn't be too big because LSTM can't handle huge number of elements in the sequence when training (vanishing gradient).
Your targets should be an array of shape (number_blocks_of_sentences, n, 1) so also a list of 2D arrays containing the class of each sentence in your block of sentences.
Model :
n_sentences = X_train.shape[1] # number of sentences in a sample (n)
n_words = X_train.shape[2] # number of words in a sentence
model = Sequential()
# Reshape the input because Embedding only accepts shape (batch_size, input_length) so we just transform list of sentences in huge list of words
model.add(Reshape((n_sentences * n_words,),input_shape = (n_sentences, n_words)))
# Embedding layer - output shape will be (batch_size, n_sentences * n_words, embedding_dim) so each sample in the batch is a big 2D array of words embedded
model.add(Embedding(len(vocabaulary), embedding_dim, input_length = n_sentences * n_words ))
# Recreate the sentence shaped array
model.add(Reshape((n_sentences, n_words, embedding_dim)))
# Encode each sentence - output shape is (batch_size, n_sentences, 88)
model.add(TimeDistributed(LSTM(88)))
# Go over lines and output hidden layer which contains info about previous sentences - output shape is (batch_size, n_sentences, hidden_dim)
model.add(LSTM(hidden_dim, return_sequence=True))
# Predict output binary class - output shape is (batch_size, n_sentences, 1)
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
...
This should be a good start.
I hope this helps

Related

What are the input and the output of the Transformer?

I have questions about google implementation of the Transformer here.
In train_step(input, tar) function: the inp dimension is a 256*40 tensor and the transformer returns a 256*39*8089 tensor. Is each row in inp a sentence? I expected Transformer to take a batch of sentences (a batch_size of 2D matrix in which each row is a word) and calculate attention weights and outputs at once and then pass them to decoder (see here. ). However, I cannot see that being implemented in the code.
In train_step(input, tar) function: the "predictions" is a 256*39*8089 tensor. Is it [batch size, max number of words in a sentence, target vocab size]? How does loss_function calculate loss while this format is different from ```tar_real`` which is [256 * 39]?
In def evaluate(inp_sentence): Why in each iteration it sends the Transformer the entire encoder input? What I expect is that the encoder calculates attention weights and output once and then inside the for loop we send the output of the attentions and the predictions so far.
Thank you

How do you make predictions with a stateful LSTM?

Okay, so I trained a stateful LSTM characterwise on https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt. It didn't seem to do too bad in terms of accuracy, but know I want to generate my own shakespeare works.
The question is, how do I go about actually generating predictions from it?
In particular, the models batch input shape is (128, 128, 63) and the output shape is (128, 128, 63). (The first number is the batch size, the second number is the length of the prediction input and output, and the third number is the number of distinct characters in the text.)
For example, I would like to:
Generate various predictions starting from empty text
Generate predictions starting from a small starting text (such as "PYRULEZ:")
This should be possible given how LSTMs work.
Here's a snippet of the code used to generate and fit the model:
model = Sequential()
model.add(LSTM(dataY.shape[2], batch_input_shape=(128, dataX.shape[1], dataX.shape[2]), return_sequences = True, stateful=True, activation = "softmax"))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['acc'])
model.fit(dataX, dataY, epochs = 1, batch_size = 128, verbose=1, shuffle = False)
Looking at other code samples, it appears I'll need to modify this somehow, but I'm not sure in how specifically.
I can include the whole code sample if that would be helpful. It is self contained.
Simple. Put your input into model.predict() with appropriate parameters (see documentation), concatenate input and output (the model predicts on progressively longer chains). Depending on how you organised training, output will add one character at a time. To be more precise, if you train sequence to sequence shifted by one, your output sequence will ideally be your input sequence shifted by one element; PYRULEZ -> YRULEZ* Hence you need to take the last character of the output and add it to your prior (input) sequence.
If you want long lines of text, you might want to limit the length of your sequence to some number of characters in the loop. Much of the long term dependencies in the text is carried through the stateful vector of the LSTM cell anyway (Not something you interact with).
Pseudocode-ish:
for counter in range(output_length):
output = model.predict(input_)
input_ = np.concatenate((input_, output[:,-1,:]), axis=1)

Input to Bidirectional LSTM in tensorflow

Normally all inputs fed to BiLSTM are of shape [batch_size, time_steps, input_size].
However, I'm working on a problem of Automatic Grading of an Essay in which there's an extra dimension called number of sentences in each essay. So in my case, a typical batch after embedding using word2vec, is of shape [2,16,25,300].
Here, there are 2 essays in each batch (batch_size=2), each essay has 16 sentences, each sentence is 25 words long(time_step=25) and I'm using word2vec of 300 dimensions (input_size=300).
So clearly I need to loop this batch over dimension 16 somehow such that the shape of input becomes [2,25,300] in each iteration. I have tried for a long time but I haven't been able to find a way to do it. For example, if you make a loop over tf.nn.bidirectional_dynamic_rnn(), it'll give error in second iteration saying that tf.nn.bidirectional_dynamic_rnn() kernel already exists. I can't directly make a for loop over sentence_dimension because those are tensors of shape [None,None,None,300] and I gave values just for the sake of simplicity. If there any other way to do it? Thanks. Please note that I am not using Keras or any other framework.
Here's a sample encoding layer for reference.
def bidirectional_lstm(input_data):
cell = tf.nn.rnn_cell.LSTMCell(num_units=200, state_is_tuple=True)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=cell,
cell_bw=cell,
dtype=tf.float32,
inputs=input_data)
return tf.concat(outputs,2)
embedded is of shape [2,16,25,300].
And here's a sample input
for batch_i, texts_batch in enumerate(get_batches(X_train, batch_size)): ## get_batches() is yielding batches one by one
## X_train is of shape [2,16,25] ([batch_size,sentence_length,num_words])
## word2vec
embeddings = tf.nn.embedding_lookup(word_embedding_matrix, texts_batch)
## embeddings shape is now [2,16,25,300] ([batch_size,sentence_length,num_words,word2vec_dim])
## Now I need some kind of loop here to loop it over sentence dimension. Can't use for loop since these are all placeholders with dimensions None
## ??? output = bidirectional_lstm(embeddings,3,200,0.7) ??? This would be correct if there was no sentence dimension.

Keras Bidirectional LSTM - Layer grouping

While working to implement a paper (Dialogue Act Sequence Labeling using Hierarchical encoder with CRF) using Keras, I need to implement a specific Bidirectional LSTM architecture.
I have to train the network on the concept of a Conversation. Conversations are composed of Utterances, and Utterances are composed of Words. Words are N-dimensional vectors. The model represented in the paper first reduces each Utterance to a single M-dimensional vector. To achieve this, it uses a Bidirectional LSTM layer. Let's call this layer A.
(For simplicity, let's assume that each Utterance has a length of |U| and each Conversation has a length of |C|)
Each Utterance is input to a Bi-LSTM layer with U timesteps, and the output of the last timestep is taken. The input size is (|U|, N), and the output size is (1, M).
This Bi-LSTM layer should be applied separately/simultaneously to each Utterance in the Conversation. Note that, since the network takes as input the entire Conversation, the dimensions for a single input to the network would be (|C|, |U|, N).
As the paper describes, I intend to feed each utterance (i.e. each (|U|, N)) of that input and feed it to a Bi-LSTM layer with |U| units. As there are |C| Utterances in a Conversation, this implies that there should be a total of |C| x |U| Bi-LSTM units, grouped into |C| different partitions for each Utterance. There should be no connection between the |C| groups of units. Once processed, the output of each of those C groups of Bidirectional LSTM units will then be fed into another Bi-LSTM layer, say B.
How is it possible to feed specific portions of the input only to specific portions of the layer A, and make sure that they are not interconnected? (i.e. the portion of Bi-LSTM units used for an Utterance should not be connected to the Bi-LSTM units used for another Utterance)
Is it possible to achieve this through keras.models.Sequential, or is there a specific way to achieve this using Functional API?
Here is what I have tried so far:
# ...
model = Sequential()
model.add(Bidirectional(LSTM(C * U), input_shape = (C, U, N),
merge_mode='concat'))
model.add(GlobalMaxPooling1D())
model.add(Bidirectional(LSTM(n, return_sequences = True), merge_mode='concat'))
# ...
model.compile(loss = loss_function,
optimizer = optimizer,
metrics=['accuracy'])
However, this code is currently receiving the following error:
ValueError: Input 0 is incompatible with layer bidirectional_1: expected ndim=3, found ndim=4
More importantly, the code above obviously does not do the grouping I mentioned. I am looking for a way to enhance the model as I described above.
Finally, below is the figure of the model I described above. It may possibly help clarify some of the written content narrated above. The layer tagged as "Utterance layer" is what I called the layer A. As you can see in the figure, each utterance u_i in the figure is composed of words w_j, which are N-dimensional vectors. (You may omit the embedding layer for the purposes of this question) Assuming, for simplicity, that each u_i has equal number of Words, then each group of Bidirectional LSTM nodes in the Utterance Layer will have an input size of (|U|, N). Yet, since there are |C| such utterances u_i in a Conversation, the dimensions of the entire input will be (|C|, |U|, N).
I'll create a net for what I see in the picture. For now I'm ignoring the "units" part I mentioned in my comment to your question.
This model does exactly what is shown in the picture. All utterances are completely separate from start to end.
model = Sequential()
#You have an extra time dimension that should be kept as is
#So we add a TimeDistributed` wrapper to the first layers
model.add(TimeDistributed(Embedding(dictionaryLength,N), input_shape=(C,U)))
#This is the utterance layer. It works in "word steps", keeping "utterance steps" untouched
model.add(TimeDistributed(Bidirectional(LSTM(M//2, return_sequences=False))))
#Is the pooling really demanded by the article?
#Or was it an attempt to remove one of the time dimensions?
#Not adding it here because I used `return_sequences=False`
model.add(Bidirectional(LSTM(someSize//2,return_sequences=True)))
model.add(Dense(anotherSize)) #is this a CRF layer???
model.summary()
Notice that in every Bidirectional layer, I divided the output size by two, so it's important that M and someSize are even numbers.

Keras sequence prediction with multiple simultaneous sequences

My question is very similar to what it seems this post is asking, although that post doesn't pose a satisfactory solution. To elaborate, I am currently using keras with tensorflow backend and a sequential LSTM model. The end goal is I have n time-dependent sequences with equal time steps (the same number of points on each sequence and the points are all the same time apart) and I would like to feed all n sequences into the same network so it can use correlations between the sequences to better predict the next step for each sequence. My ideal output would be an n-element 1-D array with array[0] corresponding to the next-step prediction for sequence_1, array[1] for sequence_2, and so on.
My inputs are sequences of single values, so each of n inputs can be parsed into a 1-D array.
I was able to get a working model for each sequence independently using the code at the end of this guide by Jakob Aungiers, although my difficulty is adapting it to accept multiple sequences at once and correlate between them (i.e. be analyzed in parallel). I believe the issue is related to the shape of my input data, which is currently in the form of a 4-D numpy array because of how Jakob's Guide splits the inputs into sub-sequences of 30 elements each to analyze incrementally, although I could also be completely missing the target here. My code (which is mostly Jakob's, not trying to take credit for anything that isn't mine) presently looks like this:
As-is this complains with "ValueError: Error when checking target: expected activation_1 to have shape (None, 4) but got array with shape (4, 490)", I'm sure there are plenty of other issues but I'd love some direction on how to achieve what I'm describing. Anything stick out immediately to anyone? Any help you could give will be greatly appreciated.
Thanks!
-Eric
Keras is already prepared to work with batches containing many sequences, there is no secret at all.
There are two possible approaches, though:
You input your entire sequences (all steps at once) and predict n results
You input only one step of all sequences and predict the next step in a loop
Suppose:
nSequences = 30
timeSteps = 50
features = 1 #(as you said: single values per step)
outputFeatures = 1
First apporach: stateful=False:
inputArray = arrayWithShape((nSequences,timeSteps,features))
outputArray = arrayWithShape((nSequences,outputFeatures))
input_shape = (timeSteps,features)
#use layers like this:
LSTM(units) #if the first layer in a Sequential model, add the input_shape
#if you want to return the same number of steps (like a new sequence parallel to the input, use return_sequences=True
Train like this:
model.fit(inputArray,outputArray,....)
Predict like this:
newStep = model.predict(inputArray)
Second approach: stateful=True:
inputArray = sameAsBefore
outputArray = inputArray[:,1:] #one step after input array
inputArray = inputArray[:,:-1] #eliminate the last step
batch_input = (nSequences, 1, features) #stateful layers require the batch size
#use layers like this:
LSMT(units, stateful=True) #if the first layer in a Sequential model, add input_shape
Train like this:
model.reset_states() #you need this in stateful=True models
#if you don't reset states,
#the stateful model will think that your inputs are new steps of the same previous sequences
for step in range(inputArray.shape[1]): #for each time step
model.fit(inputArray[:,step:step+1], outputArray[:,step:step+1],shuffle=False,...)
Predict like this:
model.reset_states()
predictions = np.empty(inputArray.shape)
for step in range(inputArray.shape[1]): #for each time step
predictions[:,step] = model.predict(inputArray[:,step:step+1])

Categories