I have simple CNN for the MNIST data problem.
cnn_model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters=24, kernel_size=(3,3), activation='relu'),
tf.keras.layers.Conv2D(filters=36, kernel_size=(3,3), activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation='softmax')
])
and this is how summary looks like:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_12 (Conv2D) (None, 26, 26, 24) 240
_________________________________________________________________
conv2d_13 (Conv2D) (None, 24, 24, 36) 7812
_________________________________________________________________
flatten_13 (Flatten) (None, 20736) 0
_________________________________________________________________
dense_26 (Dense) (None, 128) 2654336
_________________________________________________________________
dense_27 (Dense) (None, 10) 1290
=================================================================
Total params: 2,663,678
Trainable params: 2,663,678
Non-trainable params: 0
_________________________________________________________________
I have skipped pool layers in the problem for the simplicity of the question.
The first convolution layer has 240 parameters which is easy to calculate: (kernel size + bias) * number of filters: (3*3+1)*24.
Please explain me why second convolution layer has 7812 parameters (36 * 217).
The flatten layer has size of 20736. Which is the number of pixels produced by 36 filters of the previous layer: 24 * 24 * 36.
But how can we obtain 36 images by 36 filters from 24 images of the previous layer? Shouldn’t the size of the flatten layer be 36 * 24 * 24 * 24 which is the number of filters from the previous layer * size of the bitmap from the previous layer * number of filters from the first convolution layer?
The number of parameters for a convolutional layer is
(filter_height * filter_width * in_channels * out_channels) + out_channels
In your case, that's
(3 * 3 * 24 * 36) + 36 = 7,812
The output shape of such convolution is
(n_samples, remaining_height, remaining_width, n_filters)
Related
I have the following LSTM model. Can somebody helps me understand the summary of the model?
a) How the param# are calculated?
b) We have no value?
c) the param# near the dropoout why is 0?
model = Sequential()
model.add(LSTM(64, return_sequences=True, recurrent_regularizer=l2(0.0015), input_shape=(timestamps,
input_dim)))
model.add(Dropout(0.5))
model.add(LSTM(64, recurrent_regularizer=l2(0.0015), input_shape=(timesteps,input_dim)))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(n_classes, activation='softmax'))
model.summary()
The following are the input, timestamps, and x_train
timesteps=100
input_dim= 6
X_train=1120
The summary is:
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 64) 18176
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 64) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 33024
_________________________________________________________________
dense_1 (Dense) (None, 64) 4160
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 6) 390
=================================================================
Total params: 59,910
Trainable params: 59,910
Non-trainable params: 0
Part of your question is answered here.
https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model
Simply put, the reason there are so many parameters for an LSTM model is because you have tons of data in your model and many weights need to be trained to fit the model.
Dropout layers don't have parameters because there are no weights in a dropout layer. All a dropout layer does is give a % chance that a neuron won't be included during testing. In this case, you've chosen 50%. Beyond that, there is nothing to configure in a dropout layer.
How parameters are calculated?
well!!. the input dimension is 6 and the hidden neurons in the first LSTM layer is 64.
so the first LSTM layer takes input [64 (initialized hidden state) + 6 (input)] in this form. so we can say the input dimension is 70 [64 (hidden state at t-1) + 6 current input at t].
Now the calculation part.
no of parms = input dimension * hidden units + bias.
= [64 (randomly initialized hidden state dimension) + 6 (input dimension)]*64( hidden neurons ) + 64 ( bias 1 for each hidden neurons)
= (64+6)*64+64
for one FFNN = 4544
But LSTM has 4 FFNN, so simply multiply it by 4.
Total trainable params = 4 * 4544
= 18176
Dropout layer does not have any parameters.
I am not sure which value you are talking about.?
If I have:
self.model.add(LSTM(lstm1_size, input_shape=(seq_length, feature_dim), return_sequences=True))
self.model.add(BatchNormalization())
self.model.add(Dropout(0.2))
then my seq_length specifies how many slices of data I want to process at once. If it matters, my model is a sequence-to-sequence (same size).
But if I have:
self.model.add(Bidirectional(LSTM(lstm1_size, input_shape=(seq_length, feature_dim), return_sequences=True)))
self.model.add(BatchNormalization())
self.model.add(Dropout(0.2))
then is that doubling the sequence size? Or at each time step, is it getting seq_length / 2 before and after that timestep?
Using a bidirectional LSTM layer has no effect on the sequence length.
I tested this with the following code:
from keras.models import Sequential
from keras.layers import Bidirectional,LSTM,BatchNormalization,Dropout,Input
model = Sequential()
lstm1_size = 50
seq_length = 128
feature_dim = 20
model.add(Bidirectional(LSTM(lstm1_size, input_shape=(seq_length, feature_dim), return_sequences=True)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
batch_size = 32
model.build(input_shape=(batch_size,seq_length, feature_dim))
model.summary()
This resulted in the following output for bidirectional
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional_1 (Bidirection (32, 128, 100) 28400
_________________________________________________________________
batch_normalization_1 (Batch (32, 128, 100) 400
_________________________________________________________________
dropout_1 (Dropout) (32, 128, 100) 0
=================================================================
Total params: 28,800
Trainable params: 28,600
Non-trainable params: 200
No bidirectional layer:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 128, 50) 14200
_________________________________________________________________
batch_normalization_1 (Batch (None, 128, 50) 200
_________________________________________________________________
dropout_1 (Dropout) (None, 128, 50) 0
=================================================================
Total params: 14,400
Trainable params: 14,300
Non-trainable params: 100
_________________________________________________________________
I've a sample tiny CNN implemented in both Keras and PyTorch. When I print summary of both the networks, the total number of trainable parameters are same but total number of parameters and number of parameters for Batch Normalization don't match.
Here is the CNN implementation in Keras:
inputs = Input(shape = (64, 64, 1)). # Channel Last: (NHWC)
model = Conv2D(filters=32, kernel_size=(3, 3), padding='SAME', activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 1))(inputs)
model = BatchNormalization(momentum=0.15, axis=-1)(model)
model = Flatten()(model)
dense = Dense(100, activation = "relu")(model)
head_root = Dense(10, activation = 'softmax')(dense)
And the summary printed for above model is:
Model: "model_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_9 (InputLayer) (None, 64, 64, 1) 0
_________________________________________________________________
conv2d_10 (Conv2D) (None, 64, 64, 32) 320
_________________________________________________________________
batch_normalization_2 (Batch (None, 64, 64, 32) 128
_________________________________________________________________
flatten_3 (Flatten) (None, 131072) 0
_________________________________________________________________
dense_11 (Dense) (None, 100) 13107300
_________________________________________________________________
dense_12 (Dense) (None, 10) 1010
=================================================================
Total params: 13,108,758
Trainable params: 13,108,694
Non-trainable params: 64
_________________________________________________________________
Here's the implementation of the same model architecture in PyTorch:
# Image format: Channel first (NCHW) in PyTorch
class CustomModel(nn.Module):
def __init__(self):
super(CustomModel, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=32, kernel_size=(3, 3), padding=1),
nn.ReLU(True),
nn.BatchNorm2d(num_features=32),
)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(in_features=131072, out_features=100)
self.fc2 = nn.Linear(in_features=100, out_features=10)
def forward(self, x):
output = self.layer1(x)
output = self.flatten(output)
output = self.fc1(output)
output = self.fc2(output)
return output
And following is the output of summary of the above model:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 64, 64] 320
ReLU-2 [-1, 32, 64, 64] 0
BatchNorm2d-3 [-1, 32, 64, 64] 64
Flatten-4 [-1, 131072] 0
Linear-5 [-1, 100] 13,107,300
Linear-6 [-1, 10] 1,010
================================================================
Total params: 13,108,694
Trainable params: 13,108,694
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 4.00
Params size (MB): 50.01
Estimated Total Size (MB): 54.02
----------------------------------------------------------------
As you can see in above results, Batch Normalization in Keras has more number of parameters than PyTorch (2x to be exact). So what's the difference in above CNN architectures? If they are equivalent, then what am I missing here?
Keras treats as parameters (weights) many things that will be "saved/loaded" in the layer.
While both implementations naturally have the accumulated "mean" and "variance" of the batches, these values are not trainable with backpropagation.
Nevertheless, these values are updated every batch, and Keras treats them as non-trainable weights, while PyTorch simply hides them. The term "non-trainable" here means "not trainable by backpropagation", but doesn't mean the values are frozen.
In total they are 4 groups of "weights" for a BatchNormalization layer. Considering the selected axis (default = -1, size=32 for your layer)
scale (32) - trainable
offset (32) - trainable
accumulated means (32) - non-trainable, but updated every batch
accumulated std (32) - non-trainable, but updated every batch
The advantage of having it like this in Keras is that when you save the layer, you also save the mean and variance values the same way you save all other weights in the layer automatically. And when you load the layer, these weights are loaded together.
I have following program taken from Internet
def my_model(input_shape):
# Define the input placeholder as a tensor with shape input_shape. Think of this as your input image!
X_input = Input(input_shape)
# Zero-Padding: pads the border of X_input with zeroes
X = ZeroPadding2D((3, 3))(X_input)
# CONV -> BN -> RELU Block applied to X
X = Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0')(X)
X = BatchNormalization(axis = 3, name = 'bn0')(X)
X = Activation('relu')(X)
# MAXPOOL
X = MaxPooling2D((2, 2), name='max_pool')(X)
# FLATTEN X (means convert it to a vector) + FULLYCONNECTED
X = Flatten()(X)
X = Dense(1, activation='sigmoid', name='fc')(X)
# Create model. This creates your Keras model instance, you'll use this instance to train/test the model.
model = Model(inputs = X_input, outputs = X, name='myModel')
return model
mymodel = my_model((64,64,3))
mymodel.summary()
Here output of summary is shown as below
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 64, 64, 3) 0
_________________________________________________________________
zero_padding2d_3 (ZeroPaddin (None, 70, 70, 3) 0
_________________________________________________________________
conv0 (Conv2D) (None, 64, 64, 32) 4736
_________________________________________________________________
bn0 (BatchNormalization) (None, 64, 64, 32) 128
_________________________________________________________________
activation_2 (Activation) (None, 64, 64, 32) 0
_________________________________________________________________
max_pool (MaxPooling2D) (None, 32, 32, 32) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 32768) 0
_________________________________________________________________
fc (Dense) (None, 1) 32769
=================================================================
Total params: 37,633
Trainable params: 37,569
Non-trainable params: 64
My question is from which layer this non-trainable params are taken i.e., 64.
Another question is how batch normalization has parameters 128?
Request your help how above numbers we got from model defined above. Thanks for the time and help.
BatchNormalization layer is composed of [gamma weights, beta weights, moving_mean(non-trainable), moving_variance(non-trainable)] and for each parameter there is one value for each element in the last axis (by default in keras, but you can change the axis if you want to).
In your code you have a size 32 in the last dimension before the BatchNormalization layer, so 32*4=128 parameters and since there are 2 non-trainable parameters there are 32*2=64 non-trainable parameters
I have a simple NN model for detecting hand-written digits from a 28x28px image written in python using Keras (Theano backend):
model0 = Sequential()
#number of epochs to train for
nb_epoch = 12
#amount of data each iteration in an epoch sees
batch_size = 128
model0.add(Flatten(input_shape=(1, img_rows, img_cols)))
model0.add(Dense(nb_classes))
model0.add(Activation('softmax'))
model0.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model0.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model0.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
This runs well and I get ~90% accuracy. I then perform the following command to get a summary of my network's structure by doing print(model0.summary()). This outputs the following:
Layer (type) Output Shape Param # Connected to
=====================================================================
flatten_1 (Flatten) (None, 784) 0 flatten_input_1[0][0]
dense_1 (Dense) (None, 10) 7850 flatten_1[0][0]
activation_1 (None, 10) 0 dense_1[0][0]
======================================================================
Total params: 7850
I don't understand how they get to 7850 total params and what that actually means?
The number of parameters is 7850 because with every hidden unit you have 784 input weights and one weight of connection with bias. This means that every hidden unit gives you 785 parameters. You have 10 units so it sums up to 7850.
The role of this additional bias term is really important. It significantly increases the capacity of your model. You can read details e.g. here Role of Bias in Neural Networks.
I feed a 514 dimensional real-valued input to a Sequential model in Keras.
My model is constructed in following way :
predictivemodel = Sequential()
predictivemodel.add(Dense(514, input_dim=514, W_regularizer=WeightRegularizer(l1=0.000001,l2=0.000001), init='normal'))
predictivemodel.add(Dense(257, W_regularizer=WeightRegularizer(l1=0.000001,l2=0.000001), init='normal'))
predictivemodel.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
When I print model.summary() I get following result:
Layer (type) Output Shape Param # Connected to
================================================================
dense_1 (Dense) (None, 514) 264710 dense_input_1[0][0]
________________________________________________________________
activation_1 (None, 514) 0 dense_1[0][0]
________________________________________________________________
dense_2 (Dense) (None, 257) 132355 activation_1[0][0]
================================================================
Total params: 397065
________________________________________________________________
For the dense_1 layer , number of params is 264710.
This is obtained as : 514 (input values) * 514 (neurons in the first layer) + 514 (bias values)
For dense_2 layer, number of params is 132355.
This is obtained as : 514 (input values) * 257 (neurons in the second layer) + 257 (bias values for neurons in the second layer)
For Dense Layers:
output_size * (input_size + 1) == number_parameters
For Conv Layers:
output_channels * (input_channels * window_size + 1) == number_parameters
Consider following example,
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
Conv2D(64, (3, 3), activation='relu'),
Conv2D(128, (3, 3), activation='relu'),
Dense(num_classes, activation='softmax')
])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 222, 222, 32) 896
_________________________________________________________________
conv2d_2 (Conv2D) (None, 220, 220, 64) 18496
_________________________________________________________________
conv2d_3 (Conv2D) (None, 218, 218, 128) 73856
_________________________________________________________________
dense_9 (Dense) (None, 218, 218, 10) 1290
=================================================================
Calculating params,
assert 32 * (3 * (3*3) + 1) == 896
assert 64 * (32 * (3*3) + 1) == 18496
assert 128 * (64 * (3*3) + 1) == 73856
assert num_classes * (128 + 1) == 1290
The "none" in the shape means it does not have a pre-defined number. For example, it can be the batch size you use during training, and you want to make it flexible by not assigning any value to it so that you can change your batch size. The model will infer the shape from the context of the layers.
To get nodes connected to each layer, you can do the following:
for layer in model.layers:
print(layer.name, layer.inbound_nodes, layer.outbound_nodes)
Number of parameters is the amount of numbers that can be changed in the model. Mathematically this means number of dimensions of your optimization problem. For you as a programmer, each of this parameters is a floating point number, which typically takes 4 bytes of memory, allowing you to predict the size of this model once saved.
This formula for this number is different for each neural network layer type, but for Dense layer it is simple: each neuron has one bias parameter and one weight per input:
N = n_neurons * ( n_inputs + 1).
The easiest way to calculate number of neurons in one layer is:
Param value / (number of units * 4)
Number of units is in predictivemodel.add(Dense(514,...)
Param value is Param in model.summary() function
For example in Paul Lo's answer , number of neurons in one layer is 264710 / (514 * 4 ) = 130