Sequence labeling with Keras - ignore first K predictions - python

I'm trying to create a sequence labeler for some long sequences.
Due to the nature of the problem, I don't expect the network to perform very well at the beginning of the sequence, due to a lack of historical data.
How can I train the network to ignore the first k predictions?
My network strucutre is as follows:
model = Sequential()
model.add(LSTM(10,return_sequences = True, input_shape = (None, 5)))
model.add(LSTM(10,return_sequences = True))
model.add(LSTM(10,return_sequences = True))
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
model.compile(optimizer = adam,loss = 'binary_crossentropy',metrics = ['acc'])
And I train it by doing: model.fit(X,y) where X.shape is (m,n,5) and y.shape is (m,n,1) where m is number of sequences and n is the sequence length

Related

Image sequence detection with Keras, Convolutional and Stateful Neural Network

I am trying to write a pretty complicated neural network (at least for me) in keras that needs to combine both a common CNN structure and an LSTM/GRU layer.
Basically, I have a dataset of climatological maps of the Mediterranean sea, each map details the wind, pressure and other parameters. I am studying Medicanes (Mediterranean hurricanes) and my goal is to create a neural network that can classify each map with a label zero if there is no trace of such hurricanes or one if the map contains one.
In order to achieve that I need a network with two parts:
feature extractor (normal CNN).
temporal layer (LSTM/GRU).
The main cause of this is that each map is correlated with the previous one because the formation and life cycle of a Medicane can take several days to complete.
Important note: the dataset is too big to be uploaded all at once so I have to work one batch at a time.
I am working with Keras and I found it pretty challenging to adapt its standard framework to my needs so I have come up with some peculiar flow to feed my data into the network.
In particular, I found it hard to pass both my batch size and my time-step parameter to the GRU layer using a more standard alternative.
This is what I tried:
I am positively sure I have overcomplicated the task, but, as I said I am not very proficient with Keras and TensorFlow.
The main problem was that I could not find a way to import the data both in a batch (for RAM reasons) and in a sequence of 10-15 pictures (to be used as the time steps in the GRU layer).
I solved this problem by importing batches of 120 maps in order (no shuffle) and I created a way to turn these batches into the sequence of images I needed then I proceeded to re-batch the sequences and feed them to the model manually.
Data Import
batch_size=120
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"./Figures_1/Train",
validation_split=None,
subset=None,
labels="inferred",
label_mode="binary",
color_mode="rgb",
interpolation='bilinear',
batch_size=batch_size,
image_size=(600, 600),
shuffle=False,
seed=123
)
Get a sequence of Images
Here, I break down the 120 map batches into sequences of 60 observations, and I return each sequence one at a time.
sequence_lengh=60
def sequence_x(train_dataset):
x_numpy = np.asarray(list(map(lambda x: x[0], tfds.as_numpy(train_dataset))),dtype=object)
for element in range(0,x_numpy.shape[0]):
for i in range(0, x_numpy.shape[0],sequence_lengh):
x_seq = x_numpy[element][i:i+sequence_lengh]
yield x_seq
def sequence_y(train_dataset):
y_numpy = np.asarray(list(map(lambda x: x[1], tfds.as_numpy(train_dataset))),dtype=object)
for element in range(0,y_numpy.shape[0]):
for i in range(0, y_numpy.shape[0],sequence_lengh):
y_seq = y_numpy[element][i:i+sequence_lengh]
yield y_seq
CNN Model
I build the CNN model based on a pre-trained DenseNet
from keras.layers import TimeDistributed, GRU
def build_convnet(shape=(600, 600, 3)):
inputs = keras.Input(shape = shape)
x = inputs
# preprocessing
x = keras.applications.densenet.preprocess_input(x)
#Convbase
x = convBase(x)
x = layers.Flatten()(x)
# Fine tuning
x = keras.layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = keras.layers.Dense(512, activation='relu')(x)
x = keras.layers.GlobalMaxPool2D()
return x
GRU Model
I build the time part of the network with a GRU layer
def action_model(shape=(15, 600, 600, 3), nbout=15):
# Create our convnet with (112, 112, 3) input shape
convnet = build_convnet(shape[1:]) #[1:]
# then create our final model
model = keras.Sequential()
# add the convnet with (5, 112, 112, 3) shape
model.add(TimeDistributed(convnet, input_shape=shape))
# here, you can also use GRU or LSTM
model.add(GRU(64))
# and finally, we make a decision network
model.add(Dense(1024, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(15, activation='softmax'))
return model
Transfer Learning
I retrain a part of the GRU
convBase = DenseNet121(include_top=False, weights=None, input_shape=(600,600,3), pooling="avg")
for layer in convBase.layers:
if 'conv5' in layer.name:
layer.trainable = True
for layer in convBase.layers:
if 'conv4' in layer.name:
layer.trainable = True
Model Compile
Model compilation ( image size= 600x600x3)
INSHAPE=(15, 600, 600, 3) # (5, 112, 112, 3)
model = action_model(INSHAPE, 1)
optimizer = keras.optimizers.Adam(0.001)
model.compile(
optimizer,
'categorical_crossentropy',
metrics='accuracy'
)
Model Fit
Here I manually batch my data. I turn an array (60, 600, 600, 3) into a (4,15,600,600) array. Meaning 4 batches each one containing a 15-map long sequence.
epochs = 10
for value in range(0, epochs):
train_x, train_y = sequence_x(train_ds), sequence_y(train_ds)
val_x, val_y = sequence_x(validation_ds), sequence_y(validation_ds)
for i in range(0,278): #
x = next(train_x, "none")
y = next(train_y, "none")
if (x!="none" or y!="none"):
if (np.any(x) and np.any(y)):
x_stack = np.stack((x[:15], x[15:30], x[30:45], x[45:]))
y_stack = np.stack((y[:15], y[15:30], y[30:45], y[45:]))
y_stack=y_stack.reshape(4,15)
model.fit(x=x_stack, y=y_stack,
validation_data=None,
batch_size=None,
shuffle=False
)
else:
continue
else:
continue
The idea is to get a model that, when presented with a sequence of images, can categorize each one of them with a 0 or a 1 if they have a Medicane or not.
The model does compile without any errors but the results it provides are horrible:
.
What am I doing incorrectly? Is there a more effective way to write all of this?

Keras Many-to-one predicting the entire sequence

I have a keras model that is trained on a sequence of data with a single label. I'm assuming a categorically encoded feature which passes through an embedding layer before a GRU layer.
samples, timesteps, features = 2000, 10, 1
inputs_1 = np.random.randint(1, 50, [samples, timesteps, features]).astype(np.float32)
labels = np.random.randint(0, 2, [samples, 1])
# Input
input_ = Input(shape=(None,))
# Embeddings
emb = Embedding(input_dim=int(50),
output_dim=20,
input_length=(None,),
mask_zero=False,
name="cat_feat_0" + "_emb")(input_)
gru = GRU(32,
activation="tanh",
dropout=0,
recurrent_dropout=0,
go_backwards=False,
return_sequences=False,
name="gru_cat")(emb)
y = Dense(10, activation = "tanh")(gru)
y = Dropout(0.4)(y)
y = Dense(1, activation = "sigmoid")(y)
model = Model(inputs=input_, outputs=y)
model.compile(loss=BCE_Last_Event,
optimizer=Adam(beta_1=0.9, beta_2=0.999),
metrics=["accuracy"])
model.predict(inputs_1).shape
When I predict my data, the output shape is (2000,1) given that it predicts a single label for the sequence. Would it be possible to output the scores for every event in the sequence such that the model returns predictions of shape (2000, 10, 1)?
I know I can return the sequence in the GRU layer which will be propagated. However, I still only have a single label so the loss function would be erroneous. My current thinking is either:
Create a new model which returns the sequences using the same weights as the trained model
Wrap the model in a TimeDistributed layer such that it predicts every event in the sequence.
I am concerned that the second solution will be error-prone as it will only take as input a single event throughout the entire length of the sequence, rather than the entire sequence for its prediction. Is this thinking correct?
What are the best solutions?

Extract Keras concatenated layer of 3 embedding layers, but it's an empty list

I am constructing a Keras Classification model with Multiple Inputs (3 actually) to predict one single output. Specifically, my 3 inputs are:
Actors
Plot Summary
Relevant Movie Features
Output:
Genre tags
Python Code (create the multiple input keras)
def kera_multy_classification_model():
sentenceLength_actors = 15
vocab_size_frequent_words_actors = 20001
sentenceLength_plot = 23
vocab_size_frequent_words_plot = 17501
sentenceLength_features = 69
vocab_size_frequent_words_features = 20001
model = keras.Sequential(name='Multy-Input Keras Classification model')
actors = keras.Input(shape=(sentenceLength_actors,), name='actors_input')
plot = keras.Input(shape=(sentenceLength_plot,), name='plot_input')
features = keras.Input(shape=(sentenceLength_features,), name='features_input')
emb1 = layers.Embedding(input_dim = vocab_size_frequent_words_actors + 1,
# based on keras documentation input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
output_dim = Keras_Configurations_model1.EMB_DIMENSIONS,
# int >= 0. Dimension of the dense embedding
embeddings_initializer = 'uniform',
# Initializer for the embeddings matrix.
mask_zero = False,
input_length = sentenceLength_actors,
name="actors_embedding_layer")(actors)
encoded_layer1 = layers.LSTM(100)(emb1)
emb2 = layers.Embedding(input_dim = vocab_size_frequent_words_plot + 1,
output_dim = Keras_Configurations_model2.EMB_DIMENSIONS,
embeddings_initializer = 'uniform',
mask_zero = False,
input_length = sentenceLength_plot,
name="plot_embedding_layer")(plot)
encoded_layer2 = layers.LSTM(100)(emb2)
emb3 = layers.Embedding(input_dim = vocab_size_frequent_words_features + 1,
output_dim = Keras_Configurations_model3.EMB_DIMENSIONS,
embeddings_initializer = 'uniform',
mask_zero = False,
input_length = sentenceLength_features,
name="features_embedding_layer")(features)
encoded_layer3 = layers.LSTM(100)(emb3)
merged = layers.concatenate([encoded_layer1, encoded_layer2, encoded_layer3])
layer_1 = layers.Dense(Keras_Configurations_model1.BATCH_SIZE, activation='relu')(merged)
output_layer = layers.Dense(Keras_Configurations_model1.TARGET_LABELS, activation='softmax')(layer_1)
model = keras.Model(inputs=[actors, plot, features], outputs=output_layer)
print(model.output_shape)
print(model.summary())
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
Model's Structure
My problem:
After successfully fitting and training the model on some training data, I would like to extract the embeddings of this model for later use. My main approach before using a multiple input keras model, was to train 3 different keras models and extract 3 different embedding layers of shape 100. Now that I have the multiple input keras model, I want to extract the concatenated embedding layer with output shape (None, 300).
Although, when I try to use this python command:
embeddings = model_4.layers[9].get_weights()
print(embeddings)
or
embeddings = model_4.layers[9].get_weights()[0]
print(embeddings)
I get either an empty list (1st code sample) either an IndenError: list index out of range (2nd code sample).
Thank you in advance for any advice or help on this matter. Feel free to ask on the comments any additional information that I may have missed, to make this question more complete.
Note: Python code and model's structure have been also presented to this previously answered question
Concatenate layer does not have any weights (it does not have trainable parameter as you ca see from your model summary) hence your get_weights() output is coming empty. Concatenation is an operation.
For your case you can get weights of your individual embedding layers after training.
model.layers[3].get_weights() # similarly for layer 4 and 5
Alternatively if you want to store your embedding in (None, 300) you can use numpy to concatenate weights.
out_concat = np.concatenate([mdoel.layers[3].get_weights()[0], mdoel.layers[4].get_weights()[0], mdoel.layers[5].get_weights()[0]], axis=-1)
Although you can get output tensor of concatenate layer:
out_tensor = model.layers[9].output
# <tf.Tensor 'concatenate_3_1/concat:0' shape=(?, 300) dtype=float32>

Stateful LSTM and stream predictions

I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:
Data:
[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]
i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m73))
Target:
[
[1]
[0]
[1]
...
]
On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.
One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.
Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.
Update, including model code:
opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()
num_features = data.shape[2]
num_samples = data.shape[1]
first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features),
return_sequences=True, activation='tanh')
model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(),
keras_metrics.recall(), f1])
Model Summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________
I think there might be an easier solution.
If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True
Warning: your model has layers that act on the length dimension !!
The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.
So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.
The following code fixed that and also prepares a few things to be used with the answer below:
def createModel(forTraining):
#model for training, stateful=False, any batch size
if forTraining == True:
batchSize = None
stateful = False
#model for predicting, stateful=True, fixed batch size
else:
batchSize = 1
stateful = True
model = Sequential()
first_lstm = LSTM(32,
batch_input_shape=(batchSize, num_samples, num_features),
return_sequences=True, activation='tanh',
stateful=stateful)
model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
#this is the last LSTM layer, use return_sequences=False
model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
#don't add a Flatten!!!
#model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
if forTraining == True:
compileThisModel(model)
With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.
The usage of a stateful model as a solution for your question
First, train this new model again, because it has no Flatten layer:
trainingModel = createModel(forTraining=True)
trainThisModel(trainingModel)
Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.
Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.
predictingModel = createModel(forTraining=False)
predictingModel.set_weights(trainingModel.get_weights())
And voilĂ . Just predict the outputs of the model with a single step:
pseudo for loop as samples arrive to your model:
prob = predictingModel.predict_on_batch(sample)
#where sample.shape == (1, 1, 3)
When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.
Saving and loading states
Just get and set them, saving with h5py:
def saveStates(model, saveName):
f = h5py.File(saveName,'w')
for l, lay in enumerate(model.layers):
#if you have nested models,
#consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s),
data=K.eval(stat),
dtype=K.dtype(stat))
f.close()
def loadStates(model, saveName):
f = h5py.File(saveName, 'r')
allStates = list(f.keys())
for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)
K.set_value(model.layers[layer].states[state], f.get(stateKey))
f.close()
Working test for saving/loading states
import h5py, numpy as np
from keras.layers import RNN, LSTM, Dense, Input
from keras.models import Model
import keras.backend as K
def createModel():
inp = Input(batch_shape=(1,None,3))
out = LSTM(5,return_sequences=True, stateful=True)(inp)
out = LSTM(2, stateful=True)(out)
out = Dense(1)(out)
model = Model(inp,out)
return model
def saveStates(model, saveName):
f = h5py.File(saveName,'w')
for l, lay in enumerate(model.layers):
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))
f.close()
def loadStates(model, saveName):
f = h5py.File(saveName, 'r')
allStates = list(f.keys())
for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)
K.set_value(model.layers[layer].states[state], f.get(stateKey))
f.close()
def printStates(model):
for l in model.layers:
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(l,RNN):
for s in l.states:
print(K.eval(s))
model1 = createModel()
model2 = createModel()
model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states
print('model1')
printStates(model1)
print('model2')
printStates(model2)
saveStates(model1,'testStates5')
loadStates(model2,'testStates5')
print('model1')
printStates(model1)
print('model2')
printStates(model2)
Considerations on the aspects of the data
In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.
If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).
If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
In any Keras RNN you can set the
return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.
But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.
The following Python2+Keras example should exactly represent what you want. Specifically:
allowing to save the whole LSTM intermediate state in a persistent way
while waiting for the next sample
and predicting on a model trained on a specific batch size that may be arbitrary and unknown.
For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.
from __future__ import print_function
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
from keras.optimizers import Adam
import numpy as np
# globals
SEQ_LEN = 7
HID_DIMS = 32
OUTPUT_DIMS = 3 # outputs are assumed to be scalars
##############################################################################
# define the model to be trained on a fixed batch size:
# assume many-to-one training setup (otherwise set return_sequences=True)
TRAIN_BATCH_SIZE = 20
x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
dense = Dense(OUTPUT_DIMS, activation='linear')
m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
m_train.summary()
# a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned
# train the model on a fixed batch size and save it
print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
m_train.save("trained_lstm.h5")
print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
##############################################################################
# Now, although we aren't training anymore, we want to do step-wise predictions
# that do alter the inner state of the model, and keep track of that.
m_trained = load_model("trained_lstm.h5")
print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
# now define an analogous model that allows a flexible batch size for inference:
x_in = Input(shape=[SEQ_LEN, 3])
h_in = Input(shape=[HID_DIMS])
c_in = Input(shape=[HID_DIMS])
pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])
# Let's confirm that this model is able to load the trained parameters:
# first, check that the performance from scratch is not good:
print(">> INFERENCE BEFORE SWAPPING MODEL:")
predictions, hs, zs, cs = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# import state from the trained model state and check that it works:
print(">> INFERENCE AFTER SWAPPING MODEL:")
for layer in m_trained.layers:
if "lstm" in layer.name:
m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
elif "dense" in layer.name:
m_inference.get_layer("dense_infer").set_weights(layer.get_weights())
predictions, _, _, _ = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
for i in range(len(batch123)):
# about output shape: https://keras.io/layers/recurrent/#rnn
# h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
pred, h, cc, c = m_inference.predict([current_input, h, c])
print("input:", current_input)
print("output:", pred)
print(h.shape, cc.shape, c.shape)
raw_input("do something with your prediction and hidden state and press any key to continue")
Additional information:
Since we have two forms of state persistency:
1. The saved/trained parameters of the model that are the same for each sequence
2. The a, c states that evolve throughout the sequences and may be "restarted"
It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:
for w in lstm.weights:
print(w.name, w.shape)
In our case (32 hidden states) returns the following:
lstm_1/kernel:0 (3, 128)
lstm_1/recurrent_kernel:0 (32, 128)
lstm_1/bias:0 (128,)
We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:
The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.
Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:
The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element
The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.
The vector of shape (128,) (named bias), as usual in any other NN setup.
Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.
Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.
Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:
Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).
Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).
Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.
That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.
So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).
Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):
# define training model
train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
lstm_layer = LSTM(n_units, return_state=True)
lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
classifier = Dense(1, activation='sigmoid')
train_output = classifier(lstm_output)
train_model = Model(train_input, train_output)
# compile and fit the model on training data ...
# ==================================================
# define inference model
inf_input = Input(shape=(None, n_feats))
state_h_input = Input(shape=(n_units,))
state_c_input = Input(shape=(n_units,))
# we use the layers of previous model
lstm_output, state_h, state_c = lstm_layer(inf_input,
initial_state=[state_h_input, state_c_input])
output = classifier(lstm_output)
inf_model = Model([inf_input, state_h_input, state_c_input],
[output, state_h, state_c]) # note that we return the states as output
Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:
state_h = np.zeros((1, n_units,))
state_c = np.zeros((1, n_units))
# three new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])
out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
state_h = outputs[0,1]
state_c = outputs[0,2]
# after some time another four new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])
# we have processed 7 timesteps, so the output is valid
out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it
# reinitialize the state to make them ready for the next sequence chunk
state_h = np.zeros((1, n_units))
state_c = np.zeros((1, n_units))
# to be continued...
Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.
Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.
As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.
Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.
There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.

Creating a neural network in keras to multiply two input integers

I am playing around with Keras v2.0.8 in Python v2.7 (Tensorflow backend) to create small neural networks that calculate simple arithmetic functions (add, subtract, multiply, etc.), and am a bit confused. The below code is my network which generates a random training dataset of integers with the corresponding labels (the two inputs added together):
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] + b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
X, y = create_data(0, 500, 10000)
model = Sequential()
model.add(Dense(3, input_dim=2))
model.add(Dense(5, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=10)
test_data, _ = create_data(0, 500, 10)
results = model.predict(test_data, batch_size=2)
sq_error = []
for i in range(0, len(test_data)):
print 'test value:', test_data[i], 'result:', results[i][0], 'error:',\
'%.2f' %(results[i][0] - (test_data[i][0] + test_data[i][1]))
sq_error.append((results[i][0] - (test_data[i][0] + test_data[i][1])))
print '\n total rmse error: ', sqrt(np.sum(np.array(sq_error)))
This trains perfectly well and produces no unexpected results. However, when I create the training data by multiplying the two inputs together the model's loss for each epoch stays around 7,000,000,000 and the model does not converge at all. The data creation function for this is as follows:
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] * b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
I also had the same problem when I had training data of a single input integer and created the label by squaring the input data. However, it worked fine when I only multiplied the single input by a constant value or added/subtracted by a constant.
I have two questions:
1) Why is this the case? I assume it has something to do with the fundamentals of neural networks, but I can't work it out.
2) How could I adapt this code to train a model that multiplies two input numbers together.
The network architecture (2 - 3 - 5 - 3 - 5 - 1) is fairly random right now. I've tried lots of different ones varying in layers and neurons, this one just happened to be on my screen as I write this and got an accuracy of 100% for adding two inputs.
It is due to large gradient updates caused by large numbers in training data. When using a neural network, you should first ensure that the training data falls in a small range (usually [-1,1] or [0,1]) to help the optimization process and prevent disruptive gradient updates. Therefore, you should first normalize data. In this case, one good candidate would be log-normalization.
Further, the 'accuracy' as a metric in Keras is used in case of a classification problem. In a regression problem, using it does not make sense, and instead it's better to use a relevant metric like "mean absolute error" or 'mae'.

Categories