Understand the summary of a LSTM model

Understand the summary of a LSTM model - python

I have the following LSTM model. Can somebody helps me understand the summary of the model?
a) How the param# are calculated?
b) We have no value?
c) the param# near the dropoout why is 0?
model = Sequential()
model.add(LSTM(64, return_sequences=True, recurrent_regularizer=l2(0.0015), input_shape=(timestamps,
input_dim)))
model.add(Dropout(0.5))
model.add(LSTM(64, recurrent_regularizer=l2(0.0015), input_shape=(timesteps,input_dim)))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(n_classes, activation='softmax'))
model.summary()
The following are the input, timestamps, and x_train
timesteps=100
input_dim= 6
X_train=1120
The summary is:
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 64) 18176
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 64) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 33024
_________________________________________________________________
dense_1 (Dense) (None, 64) 4160
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 6) 390
=================================================================
Total params: 59,910
Trainable params: 59,910
Non-trainable params: 0

Part of your question is answered here.
https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model
Simply put, the reason there are so many parameters for an LSTM model is because you have tons of data in your model and many weights need to be trained to fit the model.
Dropout layers don't have parameters because there are no weights in a dropout layer. All a dropout layer does is give a % chance that a neuron won't be included during testing. In this case, you've chosen 50%. Beyond that, there is nothing to configure in a dropout layer.

How parameters are calculated?
well!!. the input dimension is 6 and the hidden neurons in the first LSTM layer is 64.
so the first LSTM layer takes input [64 (initialized hidden state) + 6 (input)] in this form. so we can say the input dimension is 70 [64 (hidden state at t-1) + 6 current input at t].
Now the calculation part.
no of parms = input dimension * hidden units + bias.
= [64 (randomly initialized hidden state dimension) + 6 (input dimension)]*64( hidden neurons ) + 64 ( bias 1 for each hidden neurons)
= (64+6)*64+64
for one FFNN = 4544
But LSTM has 4 FFNN, so simply multiply it by 4.
Total trainable params = 4 * 4544
= 18176
Dropout layer does not have any parameters.
I am not sure which value you are talking about.?

Related

Simple questions about LSTM networks from Keras

Considering this LSTM based RNN:
# Instantiating the model
model = Sequential()
# Input layer
model.add(LSTM(30, activation="softsign", return_sequences=True, input_shape=(30, 1)))
# Hidden layers
model.add(LSTM(12, activation="softsign", return_sequences=True))
model.add(LSTM(12, activation="softsign", return_sequences=True))
# Final Hidden layer
model.add(LSTM(10, activation="softsign"))
# Output layer
model.add(Dense(10))
Is each output unit from the final hidden layer connected to each 12 output unit of the preceding hidden layer ? (10*12 = 120 connections)
Is each one of the 10 outputs from the Dense layer connected to each one of the final hidden layer (10*10 = 100 connections)
Would there be a difference in term of connections between the Input layer and the 1st hidden layer if variable "return_sequence" was set to False (for both layers or for one) ?
Thanks a lot for your help
Aymeric
Here is how I picture the RNN, please tell me if it's wrong:
Note about the picture:
X = one training example, i.e a vector of 30 bitcoin (BTC) values (each value represent one day, 30 days total)
Output vector = 10 values that are supposed to be the 10 next values of bitcoin (10 next days)

Let's take a look at the model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 30, 30) 3840
_________________________________________________________________
lstm_1 (LSTM) (None, 30, 12) 2064
_________________________________________________________________
lstm_2 (LSTM) (None, 30, 12) 1200
_________________________________________________________________
lstm_3 (LSTM) (None, 10) 920
_________________________________________________________________
dense (Dense) (None, 10) 110
=================================================================
Total params: 8,134
Trainable params: 8,134
Non-trainable params: 0
_________________________________________________________________
Since you don't use return_sequences=True, the default is return_sequences=False, which means only the last output from the final LSTM layer is used by the Dense layer.
Yes. But it is actually 110 because you have a bias: (10 + 1) * 10.
There would not. The difference between return_sequence=True and return_sequence=False is that when it is set to false, only the final output will be sent to the next layer. So if I have a time series data with 30 events (1, 30, 30), only the output from the 30th event will be passed along to the next layer. The computations are the same, so there will be no difference in weights. Do know that there might be some shape mis-matches if you try to set some of these to be False out of the box.

how many bidirectional lstm layers to use and how many is too many? Any advice on very imbalanced dataset?

Would much appreciate your help.
I am new to the RNN and I am trying to implement a RNN architecture to classify protein sequences. essentially they are one hot encoded np arrays.
I have an issue that the data is very imbalanced:
Examples:
Total: 34909
Positive: 282 (0.81% of total)
Therefore I am planning to implement the weights for the different classes by adding the class_weight=class_weight parameter when model is fitted.
I am also planning to use the f1 on the validation as a metric instead of accuracy or loss for the model as I am not interested in the true negatives.
Moreover, I am planning to implement transfer learning as I have dataseets with more positive data and datasets with only few points therefore I am planning to pretrain a general model and use the weights to further train on the specific problem.
I have come up with this architecture of the model however I am not sure if adding 4 bidirectional LSTM layers is a wise choice?:
from keras import regularizers
if output_bias is not None:
output_bias = Constant(output_bias)
model = Sequential()
# First LSTM layer
model.add(Bidirectional(LSTM(units=50, return_sequences=True, recurrent_dropout=0.1), input_shape=(timesteps, features)))
model.add(Dropout(0.5))
# Second LSTM layer
model.add(Bidirectional(LSTM(units=50, return_sequences=True)))
model.add(Dropout(0.5))
# Third LSTM layer
model.add(Bidirectional(LSTM(units=50, return_sequences=True)))
model.add(Dropout(0.5))
# Forth LSTM layer
model.add(Bidirectional(LSTM(units=50, return_sequences=False)))
model.add(Dropout(0.5))
#First Dense Layer
model.add(Dense(units=128,kernel_initializer='he_normal',activation='relu'))
model.add(Dropout(0.5))
# Adding the output layer
if output_bias == None:
model.add(Dense(units=1, activation='sigmoid',kernel_regularizer=regularizers.l2(0.001)))
else:
model.add(Dense(units=1, activation='sigmoid',
bias_initializer=output_bias,kernel_regularizer=regularizers.l2(0.001)))
model.compile(optimizer=Adam(lr=1e-3), loss=BinaryCrossentropy(), metrics=metrics)
model.build()
How do I know how many LSTM layers I should add? is it just trial and error?
Is there anything else I should include in the layers?
model.summary():
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional_13 (Bidirectio (None, 5, 100) 28400
_________________________________________________________________
dropout_16 (Dropout) (None, 5, 100) 0
_________________________________________________________________
bidirectional_14 (Bidirectio (None, 5, 100) 60400
_________________________________________________________________
dropout_17 (Dropout) (None, 5, 100) 0
_________________________________________________________________
bidirectional_15 (Bidirectio (None, 5, 100) 60400
_________________________________________________________________
dropout_18 (Dropout) (None, 5, 100) 0
_________________________________________________________________
bidirectional_16 (Bidirectio (None, 100) 60400
_________________________________________________________________
dropout_19 (Dropout) (None, 100) 0
_________________________________________________________________
dense_7 (Dense) (None, 128) 12928
_________________________________________________________________
dropout_20 (Dropout) (None, 128) 0
_________________________________________________________________
dense_8 (Dense) (None, 1) 129
=================================================================
Total params: 222,657
Trainable params: 222,657
Non-trainable params: 0
I have built this model by going through multiple tutorials such as https://www.tensorflow.org/tutorials/text/text_classification_rnn
and https://www.tensorflow.org/tutorials/structured_data/imbalanced_data
Would appreciate if you could point in the right direction.
Thanks!

TimeDistributed Dense Layer after GRU (return_sequences=True) layers causing error with dimensions

I´m currently trying to make my first steps using Keras on top of Tensorflow to classify timeseries data. I was able to get a pretty simple model running but after some feedback it was recommended to me to use multiple GRU layers in a row and add the TimeDistributed wrapper around my Dense layers. Here is the model I was trying:
model = Sequential()
model.add(GRU(100, input_shape=(n_timesteps, n_features), return_sequences=True, dropout=0.5))
model.add(GRU(100, return_sequences=True, go_backwards=True, dropout=0.5))
model.add(GRU(100, return_sequences=True, go_backwards=True, dropout=0.5))
model.add(GRU(100, return_sequences=True, go_backwards=True, dropout=0.5))
model.add(GRU(100, return_sequences=True, go_backwards=True, dropout=0.5))
model.add(GRU(100, return_sequences=True, go_backwards=True, dropout=0.5))
model.add(TimeDistributed(Dense(units=100, activation='relu')))
model.add(TimeDistributed(Dense(n_outputs, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I am receiving the following error message when trying to fit the model with the input having a shape of (2357, 128, 11) (2357 samples, 128 timesteps, 11 features):
ValueError: Error when checking target: expected time_distributed_2 to have 3 dimensions, but got array with shape (2357, 5)
This is the output for model.summary():
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_1 (GRU) (None, 128, 100) 33600
_________________________________________________________________
gru_2 (GRU) (None, 128, 100) 60300
_________________________________________________________________
gru_3 (GRU) (None, 128, 100) 60300
_________________________________________________________________
gru_4 (GRU) (None, 128, 100) 60300
_________________________________________________________________
gru_5 (GRU) (None, 128, 100) 60300
_________________________________________________________________
gru_6 (GRU) (None, 128, 100) 60300
_________________________________________________________________
time_distributed_1 (TimeDist (None, 128, 100) 10100
_________________________________________________________________
time_distributed_2 (TimeDist (None, 128, 5) 505
=================================================================
Total params: 345,705
Trainable params: 345,705
Non-trainable params: 0
So what is the correct way to put multiple GRU layers in a row and add the TimeDistributed Wrapper to the following Dense layers. I will be very grateful for any helpful input

If you set return_sequences = False in your last layer of GRU, the code will work.
You only need to put return_sequences = True in case the output of a RNN is fed to an input again to a RNN, hence to preserve the time dimensionality space. When you set return_sequences = False, this means that the output will be only the last hidden state (instead of hidden state at every time step), and the time dimensionality will disappear.
That is why when you set return_sequnces = False, the output dimensionality decreases from N to N-1.

Keras: How to connect a CNN model with a decision tree

I want to train a model to predict one's emotion from the physical signals. I have a physical signal and using it as input feature;
ecg(Electrocardiography)
I want to use the CNN architecture to extract features from the data, and then use these extracted features to feed a classical "Decision Tree Classifier". Below, you can see my CNN aproach without the decision tree;
model = Sequential()
model.add(Conv1D(15,60,padding='valid', activation='relu',input_shape=(18000,1), strides = 1, kernel_regularizer=regularizers.l1_l2(l1=0.1, l2=0.1)))
model.add(MaxPooling1D(2,data_format='channels_last'))
model.add(Dropout(0.6))
model.add(BatchNormalization())
model.add(Conv1D(30, 60, padding='valid', activation='relu',kernel_regularizer = regularizers.l1_l2(l1=0.1, l2=0.1), strides=1))
model.add(MaxPooling1D(4,data_format='channels_last'))
model.add(Dropout(0.6))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(3, activation = 'softmax'))
I want to edit this code so that, in the output layer there will be working decision tree instead of model.add(Dense(3, activation = 'softmax')). I have tried to save the outputs of the last convolutional layer like this;
output = model.layers[-6].output
And when I printed out the output variable, result was this;
THE OUTPUT: Tensor("conv1d_56/Relu:0", shape=(?, 8971, 30),
dtype=float32)
I guess, the output variable holds the extracted features. Now, how can I feed my decision tree classifier model with this data which is stored in the output variable? Here is the decision tree from scikit learn;
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = 'entropy')
dtc.fit()
How should I feed the fit() method? Thanks in advance.

To extract a vector of features that you can pass on to another algorithm, you need a fully connected layer before your softmax layer. Something like this will add in a 128 dimensional layer just before your softmax layer:
model = Sequential()
model.add(Conv1D(15,60,padding='valid', activation='relu',input_shape=(18000,1), strides = 1, kernel_regularizer=regularizers.l1_l2(l1=0.1, l2=0.1)))
model.add(MaxPooling1D(2,data_format='channels_last'))
model.add(Dropout(0.6))
model.add(BatchNormalization())
model.add(Conv1D(30, 60, padding='valid', activation='relu',kernel_regularizer = regularizers.l1_l2(l1=0.1, l2=0.1), strides=1))
model.add(MaxPooling1D(4,data_format='channels_last'))
model.add(Dropout(0.6))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation = 'softmax'))
If you then run model.summary() you can see the name of the layers:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_9 (Conv1D) (None, 17941, 15) 915
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 8970, 15) 0
_________________________________________________________________
dropout_10 (Dropout) (None, 8970, 15) 0
_________________________________________________________________
batch_normalization_9 (Batch (None, 8970, 15) 60
_________________________________________________________________
conv1d_10 (Conv1D) (None, 8911, 30) 27030
_________________________________________________________________
max_pooling1d_10 (MaxPooling (None, 2227, 30) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 2227, 30) 0
_________________________________________________________________
batch_normalization_10 (Batc (None, 2227, 30) 120
_________________________________________________________________
flatten_6 (Flatten) (None, 66810) 0
_________________________________________________________________
dense_7 (Dense) (None, 128) 8551808
_________________________________________________________________
dropout_12 (Dropout) (None, 128) 0
_________________________________________________________________
dense_8 (Dense) (None, 3) 387
=================================================================
Total params: 8,580,320
Trainable params: 8,580,230
Non-trainable params: 90
_________________________________________________________________
Once your network has been trained you can create a new model where the output layer becomes 'dense_7' and it'll generate 128 dimensional feature vectors:
feature_vectors_model = Model(model.input, model.get_layer('dense_7').output)
dtc_features = feature_vectors_model.predict(your_X_data) # fit your decision tree on this data

Keras model.summary() result - Understanding the # of Parameters

I have a simple NN model for detecting hand-written digits from a 28x28px image written in python using Keras (Theano backend):
model0 = Sequential()
#number of epochs to train for
nb_epoch = 12
#amount of data each iteration in an epoch sees
batch_size = 128
model0.add(Flatten(input_shape=(1, img_rows, img_cols)))
model0.add(Dense(nb_classes))
model0.add(Activation('softmax'))
model0.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model0.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model0.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
This runs well and I get ~90% accuracy. I then perform the following command to get a summary of my network's structure by doing print(model0.summary()). This outputs the following:
Layer (type) Output Shape Param # Connected to
=====================================================================
flatten_1 (Flatten) (None, 784) 0 flatten_input_1[0][0]
dense_1 (Dense) (None, 10) 7850 flatten_1[0][0]
activation_1 (None, 10) 0 dense_1[0][0]
======================================================================
Total params: 7850
I don't understand how they get to 7850 total params and what that actually means?

The number of parameters is 7850 because with every hidden unit you have 784 input weights and one weight of connection with bias. This means that every hidden unit gives you 785 parameters. You have 10 units so it sums up to 7850.
The role of this additional bias term is really important. It significantly increases the capacity of your model. You can read details e.g. here Role of Bias in Neural Networks.

I feed a 514 dimensional real-valued input to a Sequential model in Keras.
My model is constructed in following way :
predictivemodel = Sequential()
predictivemodel.add(Dense(514, input_dim=514, W_regularizer=WeightRegularizer(l1=0.000001,l2=0.000001), init='normal'))
predictivemodel.add(Dense(257, W_regularizer=WeightRegularizer(l1=0.000001,l2=0.000001), init='normal'))
predictivemodel.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
When I print model.summary() I get following result:
Layer (type) Output Shape Param # Connected to
================================================================
dense_1 (Dense) (None, 514) 264710 dense_input_1[0][0]
________________________________________________________________
activation_1 (None, 514) 0 dense_1[0][0]
________________________________________________________________
dense_2 (Dense) (None, 257) 132355 activation_1[0][0]
================================================================
Total params: 397065
________________________________________________________________
For the dense_1 layer , number of params is 264710.
This is obtained as : 514 (input values) * 514 (neurons in the first layer) + 514 (bias values)
For dense_2 layer, number of params is 132355.
This is obtained as : 514 (input values) * 257 (neurons in the second layer) + 257 (bias values for neurons in the second layer)

For Dense Layers:
output_size * (input_size + 1) == number_parameters
For Conv Layers:
output_channels * (input_channels * window_size + 1) == number_parameters
Consider following example,
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
Conv2D(64, (3, 3), activation='relu'),
Conv2D(128, (3, 3), activation='relu'),
Dense(num_classes, activation='softmax')
])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 222, 222, 32) 896
_________________________________________________________________
conv2d_2 (Conv2D) (None, 220, 220, 64) 18496
_________________________________________________________________
conv2d_3 (Conv2D) (None, 218, 218, 128) 73856
_________________________________________________________________
dense_9 (Dense) (None, 218, 218, 10) 1290
=================================================================
Calculating params,
assert 32 * (3 * (3*3) + 1) == 896
assert 64 * (32 * (3*3) + 1) == 18496
assert 128 * (64 * (3*3) + 1) == 73856
assert num_classes * (128 + 1) == 1290

The "none" in the shape means it does not have a pre-defined number. For example, it can be the batch size you use during training, and you want to make it flexible by not assigning any value to it so that you can change your batch size. The model will infer the shape from the context of the layers.
To get nodes connected to each layer, you can do the following:
for layer in model.layers:
print(layer.name, layer.inbound_nodes, layer.outbound_nodes)

Number of parameters is the amount of numbers that can be changed in the model. Mathematically this means number of dimensions of your optimization problem. For you as a programmer, each of this parameters is a floating point number, which typically takes 4 bytes of memory, allowing you to predict the size of this model once saved.
This formula for this number is different for each neural network layer type, but for Dense layer it is simple: each neuron has one bias parameter and one weight per input:
N = n_neurons * ( n_inputs + 1).

The easiest way to calculate number of neurons in one layer is:
Param value / (number of units * 4)
Number of units is in predictivemodel.add(Dense(514,...)
Param value is Param in model.summary() function
For example in Paul Lo's answer , number of neurons in one layer is 264710 / (514 * 4 ) = 130

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understand the summary of a LSTM model - python

Related

Simple questions about LSTM networks from Keras

how many bidirectional lstm layers to use and how many is too many? Any advice on very imbalanced dataset?

TimeDistributed Dense Layer after GRU (return_sequences=True) layers causing error with dimensions

Keras: How to connect a CNN model with a decision tree

Keras model.summary() result - Understanding the # of Parameters

Categories

Resources