I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:
Data:
[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]
i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m73))
Target:
[
[1]
[0]
[1]
...
]
On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.
One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.
Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.
Update, including model code:
opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()
num_features = data.shape[2]
num_samples = data.shape[1]
first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features),
return_sequences=True, activation='tanh')
model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(),
keras_metrics.recall(), f1])
Model Summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________
I think there might be an easier solution.
If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True
Warning: your model has layers that act on the length dimension !!
The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.
So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.
The following code fixed that and also prepares a few things to be used with the answer below:
def createModel(forTraining):
#model for training, stateful=False, any batch size
if forTraining == True:
batchSize = None
stateful = False
#model for predicting, stateful=True, fixed batch size
else:
batchSize = 1
stateful = True
model = Sequential()
first_lstm = LSTM(32,
batch_input_shape=(batchSize, num_samples, num_features),
return_sequences=True, activation='tanh',
stateful=stateful)
model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
#this is the last LSTM layer, use return_sequences=False
model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
#don't add a Flatten!!!
#model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
if forTraining == True:
compileThisModel(model)
With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.
The usage of a stateful model as a solution for your question
First, train this new model again, because it has no Flatten layer:
trainingModel = createModel(forTraining=True)
trainThisModel(trainingModel)
Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.
Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.
predictingModel = createModel(forTraining=False)
predictingModel.set_weights(trainingModel.get_weights())
And voilà. Just predict the outputs of the model with a single step:
pseudo for loop as samples arrive to your model:
prob = predictingModel.predict_on_batch(sample)
#where sample.shape == (1, 1, 3)
When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.
Saving and loading states
Just get and set them, saving with h5py:
def saveStates(model, saveName):
f = h5py.File(saveName,'w')
for l, lay in enumerate(model.layers):
#if you have nested models,
#consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s),
data=K.eval(stat),
dtype=K.dtype(stat))
f.close()
def loadStates(model, saveName):
f = h5py.File(saveName, 'r')
allStates = list(f.keys())
for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)
K.set_value(model.layers[layer].states[state], f.get(stateKey))
f.close()
Working test for saving/loading states
import h5py, numpy as np
from keras.layers import RNN, LSTM, Dense, Input
from keras.models import Model
import keras.backend as K
def createModel():
inp = Input(batch_shape=(1,None,3))
out = LSTM(5,return_sequences=True, stateful=True)(inp)
out = LSTM(2, stateful=True)(out)
out = Dense(1)(out)
model = Model(inp,out)
return model
def saveStates(model, saveName):
f = h5py.File(saveName,'w')
for l, lay in enumerate(model.layers):
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))
f.close()
def loadStates(model, saveName):
f = h5py.File(saveName, 'r')
allStates = list(f.keys())
for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)
K.set_value(model.layers[layer].states[state], f.get(stateKey))
f.close()
def printStates(model):
for l in model.layers:
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(l,RNN):
for s in l.states:
print(K.eval(s))
model1 = createModel()
model2 = createModel()
model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states
print('model1')
printStates(model1)
print('model2')
printStates(model2)
saveStates(model1,'testStates5')
loadStates(model2,'testStates5')
print('model1')
printStates(model1)
print('model2')
printStates(model2)
Considerations on the aspects of the data
In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.
If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).
If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
In any Keras RNN you can set the
return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.
But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.
The following Python2+Keras example should exactly represent what you want. Specifically:
allowing to save the whole LSTM intermediate state in a persistent way
while waiting for the next sample
and predicting on a model trained on a specific batch size that may be arbitrary and unknown.
For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.
from __future__ import print_function
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
from keras.optimizers import Adam
import numpy as np
# globals
SEQ_LEN = 7
HID_DIMS = 32
OUTPUT_DIMS = 3 # outputs are assumed to be scalars
##############################################################################
# define the model to be trained on a fixed batch size:
# assume many-to-one training setup (otherwise set return_sequences=True)
TRAIN_BATCH_SIZE = 20
x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
dense = Dense(OUTPUT_DIMS, activation='linear')
m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
m_train.summary()
# a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned
# train the model on a fixed batch size and save it
print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
m_train.save("trained_lstm.h5")
print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
##############################################################################
# Now, although we aren't training anymore, we want to do step-wise predictions
# that do alter the inner state of the model, and keep track of that.
m_trained = load_model("trained_lstm.h5")
print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
# now define an analogous model that allows a flexible batch size for inference:
x_in = Input(shape=[SEQ_LEN, 3])
h_in = Input(shape=[HID_DIMS])
c_in = Input(shape=[HID_DIMS])
pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])
# Let's confirm that this model is able to load the trained parameters:
# first, check that the performance from scratch is not good:
print(">> INFERENCE BEFORE SWAPPING MODEL:")
predictions, hs, zs, cs = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# import state from the trained model state and check that it works:
print(">> INFERENCE AFTER SWAPPING MODEL:")
for layer in m_trained.layers:
if "lstm" in layer.name:
m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
elif "dense" in layer.name:
m_inference.get_layer("dense_infer").set_weights(layer.get_weights())
predictions, _, _, _ = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
for i in range(len(batch123)):
# about output shape: https://keras.io/layers/recurrent/#rnn
# h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
pred, h, cc, c = m_inference.predict([current_input, h, c])
print("input:", current_input)
print("output:", pred)
print(h.shape, cc.shape, c.shape)
raw_input("do something with your prediction and hidden state and press any key to continue")
Additional information:
Since we have two forms of state persistency:
1. The saved/trained parameters of the model that are the same for each sequence
2. The a, c states that evolve throughout the sequences and may be "restarted"
It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:
for w in lstm.weights:
print(w.name, w.shape)
In our case (32 hidden states) returns the following:
lstm_1/kernel:0 (3, 128)
lstm_1/recurrent_kernel:0 (32, 128)
lstm_1/bias:0 (128,)
We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:
The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.
Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:
The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element
The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.
The vector of shape (128,) (named bias), as usual in any other NN setup.
Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.
Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.
Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:
Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).
Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).
Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.
That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.
So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).
Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):
# define training model
train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
lstm_layer = LSTM(n_units, return_state=True)
lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
classifier = Dense(1, activation='sigmoid')
train_output = classifier(lstm_output)
train_model = Model(train_input, train_output)
# compile and fit the model on training data ...
# ==================================================
# define inference model
inf_input = Input(shape=(None, n_feats))
state_h_input = Input(shape=(n_units,))
state_c_input = Input(shape=(n_units,))
# we use the layers of previous model
lstm_output, state_h, state_c = lstm_layer(inf_input,
initial_state=[state_h_input, state_c_input])
output = classifier(lstm_output)
inf_model = Model([inf_input, state_h_input, state_c_input],
[output, state_h, state_c]) # note that we return the states as output
Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:
state_h = np.zeros((1, n_units,))
state_c = np.zeros((1, n_units))
# three new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])
out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
state_h = outputs[0,1]
state_c = outputs[0,2]
# after some time another four new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])
# we have processed 7 timesteps, so the output is valid
out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it
# reinitialize the state to make them ready for the next sequence chunk
state_h = np.zeros((1, n_units))
state_c = np.zeros((1, n_units))
# to be continued...
Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.
Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.
As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.
Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.
There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.
Related
I have a keras model that is trained on a sequence of data with a single label. I'm assuming a categorically encoded feature which passes through an embedding layer before a GRU layer.
samples, timesteps, features = 2000, 10, 1
inputs_1 = np.random.randint(1, 50, [samples, timesteps, features]).astype(np.float32)
labels = np.random.randint(0, 2, [samples, 1])
# Input
input_ = Input(shape=(None,))
# Embeddings
emb = Embedding(input_dim=int(50),
output_dim=20,
input_length=(None,),
mask_zero=False,
name="cat_feat_0" + "_emb")(input_)
gru = GRU(32,
activation="tanh",
dropout=0,
recurrent_dropout=0,
go_backwards=False,
return_sequences=False,
name="gru_cat")(emb)
y = Dense(10, activation = "tanh")(gru)
y = Dropout(0.4)(y)
y = Dense(1, activation = "sigmoid")(y)
model = Model(inputs=input_, outputs=y)
model.compile(loss=BCE_Last_Event,
optimizer=Adam(beta_1=0.9, beta_2=0.999),
metrics=["accuracy"])
model.predict(inputs_1).shape
When I predict my data, the output shape is (2000,1) given that it predicts a single label for the sequence. Would it be possible to output the scores for every event in the sequence such that the model returns predictions of shape (2000, 10, 1)?
I know I can return the sequence in the GRU layer which will be propagated. However, I still only have a single label so the loss function would be erroneous. My current thinking is either:
Create a new model which returns the sequences using the same weights as the trained model
Wrap the model in a TimeDistributed layer such that it predicts every event in the sequence.
I am concerned that the second solution will be error-prone as it will only take as input a single event throughout the entire length of the sequence, rather than the entire sequence for its prediction. Is this thinking correct?
What are the best solutions?
The problem is the following. I have a categorical prediction task of vocabulary size 25K. On one of them (input vocab 10K, output dim i.e. embedding 50), I want to introduce a trainable weight matrix for a matrix multiplication between the input embedding (shape 1,50) and the weights (shape(50,128)) (no bias) and the resulting vector score is an input for a prediction task along with other features.
The crux is, I think that the trainable weight matrix varies for each input, if I simply add it in. I want this weight matrix to be common across all inputs.
I should clarify - by input here I mean training examples. So all examples would learn some example specific embedding and be multiplied by a shared weight matrix.
After every so many epochs, I intend to do a batch update to learn these common weights (or use other target variables to do multiple output prediction)
LSTM? Is that something I should look into here?
With the exception of an Embedding layer, layers apply to all examples in the batch.
Take as an example a very simple network:
inp = Input(shape=(4,))
h1 = Dense(2, activation='relu', use_bias=False)(inp)
out = Dense(1)(h1)
model = Model(inp, out)
This a simple network with 1 input layer, 1 hidden layer and an output layer. If we take the hidden layer as an example; this layer has a weights matrix of shape (4, 2,). At each iteration the input data which is a matrix of shape (batch_size, 4) is multiplied by the hidden layer weights (feed forward phase). Thus h1 activation is dependent on all samples. The loss is also computed on a per batch_size basis. The output layer has a shape (batch_size, 1). Given that in the forward phase all the batch samples affected the values of the weights, the same is true for backdrop and gradient updates.
When one is dealing with text, often the problem is specified as predicting a specific label from a sequence of words. This is modelled as a shape of (batch_size, sequence_length, word_index). Lets take a very basic example:
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
sequence_length = 80
emb_vec_size = 100
vocab_size = 10_000
def make_model():
inp = Input(shape=(sequence_length, 1))
emb = Embedding(vocab_size, emb_vec_size)(inp)
emb = Reshape((sequence_length, emb_vec_size))(emb)
h1 = Dense(64)(emb)
recurrent = LSTM(32)(h1)
output = Dense(1)(recurrent)
model = Model(inp, output)
model.compile('adam', 'mse')
return model
model = make_model()
model.summary()
You can copy and paste this into colab and see the summary.
What this example is doing is:
Transform a sequence of word indices into a sequence of word embedding vectors.
Applying a Dense layer called h1 to all the batches (and all the elements in the sequence); this layer reduces the dimensions of the embedding vector. It is not a typical element of a network to process text (in isolation). But this seemed to match your question.
Using a recurrent layer to reduce the sequence into a single vector per example.
Predicting a single label from the "sentence" vector.
If I get the problem correctly you can reuse layers or even models inside another model.
Example with a Dense layer. Let's say you have 10 Inputs
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a common Dense layer
D = Dense(64, name='one_layer_to_rule_them_all')
nets = [D(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
This code is not going to work if the inputs have different shapes. The first call to D defines its properties. In this example, outputs are set directly to nets. But of course you can concatenate, stack, or whatever you want.
Now if you have some trainable model you can use it instead of the D:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a shared model with the same weights for all inputs
nets = [special_model(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
The weights of this model are shared among all inputs.
I have created a stacked keras decoder model using the following loop:
# Create the encoder
# Define an input sequence.
encoder_inputs = keras.layers.Input(shape=(None, num_input_features))
# Create a list of RNN Cells, these are then concatenated into a single layer with the RNN layer.
encoder_cells = []
for hidden_neurons in hparams['encoder_hidden_layers']:
encoder_cells.append(keras.layers.GRUCell(hidden_neurons,
kernel_regularizer=regulariser,
recurrent_regularizer=regulariser,
bias_regularizer=regulariser))
encoder = keras.layers.RNN(encoder_cells, return_state=True)
encoder_outputs_and_states = encoder(encoder_inputs)
# Discard encoder outputs and only keep the states. The outputs are of no interest to us, the encoder's job is to create
# a state describing the input sequence.
encoder_states = encoder_outputs_and_states[1:]
print(encoder_states)
if hparams['encoder_hidden_layers'][-1] != hparams['decoder_hidden_layers'][0]:
encoder_states = Dense(hparams['decoder_hidden_layers'][0])(encoder_states[-1])
# Create the decoder, the decoder input will be set to zero
decoder_inputs = keras.layers.Input(shape=(None, 1))
decoder_cells = []
for hidden_neurons in hparams['decoder_hidden_layers']:
decoder_cells.append(keras.layers.GRUCell(hidden_neurons,
kernel_regularizer=regulariser,
recurrent_regularizer=regulariser,
bias_regularizer=regulariser))
decoder = keras.layers.RNN(decoder_cells, return_sequences=True, return_state=True)
# Set the initial state of the decoder to be the output state of the encoder. his is the fundamental part of the
# encoder-decoder.
decoder_outputs_and_states = decoder(decoder_inputs, initial_state=encoder_states)
# Only select the output of the decoder (not the states)
decoder_outputs = decoder_outputs_and_states[0]
# Apply a dense layer with linear activation to set output to correct dimension and scale (tanh is default activation for
# GRU in Keras
decoder_dense = keras.layers.Dense(num_output_features,
activation='linear',
kernel_regularizer=regulariser,
bias_regularizer=regulariser)
decoder_outputs = decoder_dense(decoder_outputs)
model = keras.models.Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
model.compile(optimizer=optimiser, loss=loss)
model.summary()
This setup works when I have a single layer encoder and a single layer decoder where the number of neurons is the same. However it does not work when the number of layers of the decoder is more than one.
I get the following error message:
ValueError: An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 48), ndim=2)]; however `cell.state_size` is (48, 58)
My decoder_layers list contains the entries [48, 58]. Therefore my RNN layer that the decoder is comprised of, is a stacked GRU where the first GRU contains 48 neurons and the second contains 58. I would like to set the initial state of the first GRU. I run the states through a Dense layer so that the shape is compatible with the first layer of the decoder. The error message indicates that I am trying to set the initial state of both the first layer and the second layer when I pass the initial state keyword to the decoder RNN layer. Is this correct behaviour? Normally I would set the initial state of the first decoder layer (not built using a cell structure like this) which then would just feed it's inputs into subsequent layers. Is there a way to achieve such behaviour in keras by default when creating a keras.layers.RNN from a list of GRUCell of LSTMCells?
In my own experiments, your intial_states should have batch_size as its first dimension. In other words, each element in one batch may have a different initial state. From your code, I think you missed this dimension.
Although not new to Machine Learning, I am still relatively new to Neural Networks, more specifically how to implement them (In Keras/Python). Feedforwards and Convolutional architectures are fairly straightforward, but I am having trouble with RNNs.
My X data consists of variable length sequences, each data-point in that sequence having 26 features. My y data, although of variable length, each pair of X and y have the same length, e.g:
X_train[0].shape: (226,26)
y_train[0].shape: (226,)
X_train[1].shape: (314,26)
y_train[1].shape: (314,)
X_train[2].shape: (189,26)
y_train[2].shape: (189,)
And my objective is to classify each item in the sequence into one of 39 categories.
What I can gather thus far from reading example code, is that we do something like the following:
encoder_inputs = Input(shape=(None, 26))
encoder = GRU(256, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)
decoder_inputs = Input(shape=(None, 39))
decoder_gru= GRU(256, return_sequences=True)
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(39, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
Which makes sense to me, because each of the sequences have different lengths.
So with a for loop that loops over all sequences, we use None in the input shape of the first GRU layer because we are unsure what the sequence length will be, and then return the hidden state state_h of that encoder. With the second GRU layer returning sequences, and the initial state being the state returned from the encoder, we then pass the outputs to a final softmax activation layer.
Obviously something is flawed here because I get:
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow/python/framework/ops.py", line 458, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is
enabled. To iterate over this tensor use tf.map_fn.
This link points to a proposed solution, but I don't understand why you would add encoder states to a tuple for as many layers you have in the network.
I'm really looking for help in being able to successfully write this RNN to do this task, but also understanding. I am very interested in RNNs and want to understand them more in depth so I can apply them to other problems.
As an extra note, each sequence is of shape (sequence_length, 26), but I expand the dimension to be (1, sequence_length, 26) for X and (1, sequence_length) for y, and then pass them in a for loop to be fit, with the decoder_target_data one step ahead of the current input:
for idx in range(X_train.shape[0]):
X_train_s = np.expand_dims(X_train[idx], axis=0)
y_train_s = np.expand_dims(y_train[idx], axis=0)
y_train_s1 = np.expand_dims(y_train[idx+1], axis=0)
encoder_input_data = X_train_s
decoder_input_data = y_train_s
decoder_target_data = y_train_s1
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
epochs=50,
validation_split=0.2)
With other networks I have wrote (FeedForward and CNN), I specify the model by adding layers on top of Keras's Sequential class. Because of the inherent complexity of RNNs I see the general format of using Keras's Input class like above and retrieving hidden states (and cell states for LSTM) etc... to be logical, but I have also seen them built from using Keras's Sequential Class. Although these were many to one type tasks, I would be interested in how you would write it that way too.
The problem is that the decoder_gru layer does not return its state, therefore you should not use _ as the return value for the state (i.e. just remove , _):
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
Since the input and output lengths are the same and there is a one to one mapping between the elements of input and output, you can alternatively construct the model this way:
inputs = Input(shape=(None, 26))
gru = GRU(64, return_sequences=True)(inputs)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Now you can make this model more complex (i.e. increase its capacity) by stacking multiple GRU layers on top of each other:
inputs = Input(shape=(None, 26))
gru = GRU(256, return_sequences=True)(inputs)
gru = GRU(128, return_sequences=True)(gru)
gru = GRU(64, return_sequences=True)(gru)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Further, instead of using GRU layers, you can use LSTM layers which has more representational capacity (of course this may come at the cost of increasing computational cost). And don't forget that when you increase the capacity of the model you increase the chance of overfitting as well. So you must keep that in mind and consider solutions that prevent overfitting (e.g. adding regularization).
Side note: If you have a GPU available, then you can use CuDNNGRU (or CuDNNLSTM) layer instead, which has been optimized for GPUs so it runs much faster compared to GRU.
I am brand new to Deep-Learning so I'm reading though Deep Learning with Keras by Antonio Gulli and learning a lot. I want to start using some of the concepts. I want to try and implement a neural network with a 1-dimensional convolutional layer that feeds into a bidirectional recurrent layer (like the paper below). All the tutorials or code snippets I've encountered do not implement anything remotely similar to this (e.g. image recognition) or use an older version of keras with different functions and usage.
What I'm trying to do is a variation of this paper:
(1) convert DNA sequences to one-hot encoding vectors; ✓
(2) use a 1 dimensional convolutional neural network; ✓
(3) with max pooling; ✓
(4) send the output to a bidirectional RNN; ⓧ
(5) classify the input;
I cannot figure out how to get the shapes to match up on the Bidirectional RNN. I can't even get an ordinary RNN to work at this stage. How can I restructure the incoming layers to work with a Bidirectional RNN?
Note:
The original code came from https://github.com/uci-cbcl/DanQ/blob/master/DanQ_train.py but I simplified the output layer to just do binary classification. This processed was described (kind of) in https://github.com/fchollet/keras/issues/3322 but I cannot get it to work with the updated keras. The original code (and the 2nd link) work on a very large dataset so I am generating some fake data to illustrate the concept. They are also using an older version of keras where key functionality changes have been made since then.
# Imports
import tensorflow as tf
import numpy as np
from tensorflow.python.keras._impl.keras.layers.core import *
from tensorflow.python.keras._impl.keras.layers import Conv1D, MaxPooling1D, SimpleRNN, Bidirectional, Input
from tensorflow.python.keras._impl.keras.models import Model, Sequential
# Set up TensorFlow backend
K = tf.keras.backend
K.set_session(tf.Session())
np.random.seed(0) # For keras?
# Constants
NUMBER_OF_POSITIONS = 40
NUMBER_OF_CLASSES = 2
NUMBER_OF_SAMPLES_IN_EACH_CLASS = 25
# Generate sequences
https://pastebin.com/GvfLQte2
# Build model
# ===========
# Input Layer
input_layer = Input(shape=(NUMBER_OF_POSITIONS,4))
# Hidden Layers
y = Conv1D(100, 10, strides=1, activation="relu", )(input_layer)
y = MaxPooling1D(pool_size=5, strides=5)(y)
y = Flatten()(y)
y = Bidirectional(SimpleRNN(100, return_sequences = True, activation="tanh", ))(y)
y = Flatten()(y)
y = Dense(100, activation='relu')(y)
# Output layer
output_layer = Dense(NUMBER_OF_CLASSES, activation="softmax")(y)
model = Model(input_layer, output_layer)
model.compile(optimizer="adam", loss="categorical_crossentropy", )
model.summary()
# ~/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/layers/recurrent.py in build(self, input_shape)
# 1049 input_shape = tensor_shape.TensorShape(input_shape).as_list()
# 1050 batch_size = input_shape[0] if self.stateful else None
# -> 1051 self.input_dim = input_shape[2]
# 1052 self.input_spec[0] = InputSpec(shape=(batch_size, None, self.input_dim))
# 1053
# IndexError: list index out of range
You don't need to restructure anything at all to get the output of a Conv1D layer into an LSTM layer.
So, the problem is simply the presence of the Flatten layer, which destroys the shape.
These are the shapes used by Conv1D and LSTM:
Conv1D: (batch, length, channels)
LSTM: (batch, timeSteps, features)
Length is the same as timeSteps, and channels is the same as features.
Using the Bidirectional wrapper won't change a thing either. It will only duplicate your output features.
Classifying.
If you're going to classify the entire sequence as a whole, your last LSTM must use return_sequences=False. (Or you may use some flatten + dense instead after)
If you're going to classify each step of the sequence, all your LSTMs should have return_sequences=True. You should not flatten the data after them.