What is the architecture behind the Keras LSTM Layer implementation? - python

How does the input dimensions get converted to the output dimensions for the LSTM Layer in Keras? From reading Colah's blog post, it seems as though the number of "timesteps" (AKA the input_dim or the first value in the input_shape) should equal the number of neurons, which should equal the number of outputs from this LSTM layer (delineated by the units argument for the LSTM layer).
From reading this post, I understand the input shapes. What I am baffled by is how Keras plugs the inputs into each of the LSTM "smart neurons".
Keras LSTM reference
Example code that baffles me:
model = Sequential()
model.add(LSTM(32, input_shape=(10, 64)))
model.add(Dense(2))
From this, I would think that the LSTM layer has 10 neurons and each neuron is fed a vector of length 64. However, it seems it has 32 neurons and I have no idea what is being fed into each. I understand that for the LSTM to connect to the Dense layer, we can just plug all 32 outputs to each of the 2 neurons. What confuses me is the InputLayer to the LSTM.
(similar SO post but not quite what I need)

Revisited and updated in 2020: I was partially correct! The architecture is 32 neurons. The 10 represents the timestep value. Each neuron is being fed a 64 length vector (maybe representing a word vector), representing 64 features (perhaps 64 words that help identify a word) over 10 timesteps.
The 32 represents the number of neurons. It represents how many hidden states there are for this layer and also represents the output dimension (since we output a hidden state at the end of each LSTM neuron).
Lastly, the 32-dimensional output vector generated from the 32 neurons at the last timestep is then fed to a Dense layer of 2 neurons, which basically means plug the 32 length vector to both neurons, with weights on the input and activation.
More reading with somewhat helpful answers:
Understanding Keras LSTMs
What exactly am I configuring when I create a stateful LSTM layer with N units
Initializing LSTM hidden states with
Keras

I dont think you are right. Actually timestep number does not impact the number of parameters in LSTM.
from keras.layers import LSTM
from keras.models import Sequential
time_step = 13
featrue = 5
hidenfeatrue = 10
model = Sequential()
model.add(LSTM(hidenfeatrue, input_shape=(time_step, featrue)))
model.summary()
time_step=100
model2 = Sequential()
model2.add(LSTM(hidenfeatrue, input_shape=(time_step, featrue)))
model2.summary()
the reuslt:
Using TensorFlow backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 10) 640
=================================================================
Total params: 640
Trainable params: 640
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_2 (LSTM) (None, 10) 640
=================================================================
Total params: 640
Trainable params: 640
Non-trainable params: 0
_________________________________________________________________

#Sticky, you are wrong in your interpretation.
Input_shape =(batch_size,sequence_length/timesteps,feature_size).So, your input tensor is 10x64 (like 10 words and its 64 features.Just like word embedding).32 are neurons to make output vector size 32.
The output will have shape structure:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of "units".

Related

Understanding the structure of my LSTM model

I'm trying to solve the following problem:
I have time series data from a number of devices. each device recording is of length 3000. Every datapoint captured has 4 measurements. so my data is shaped (number of device recordings, 3000, 4).
I'm trying produce a vector of length 3000 where each data point of is one of 3 labels (y1, y2, y3), so my desired output dim is (number of device recording, 3000, 1). I have labeled data for training.
I'm trying to use an LSTM model for this, as 'classification as I move along time series data' seems like a RNN type of problem.
I have my network set up like this:
model = Sequential()
model.add(LSTM(3, input_shape=(3000, 4), return_sequences=True))
model.add(LSTM(3, activation = 'softmax', return_sequences=True))
model.summary()
and the summary looks like this:
Model: "sequential_23"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_29 (LSTM) (None, 3000, 3) 96
_________________________________________________________________
lstm_30 (LSTM) (None, 3000, 3) 84
=================================================================
Total params: 180
Trainable params: 180
Non-trainable params: 0
_________________________________________________________________
All looks good and well in the output space, as I can use the result from each unit to determine which of my three categories belongs to that particular time step (I think).
But I only have 180 trainable parameters, so I'm guessing that I am doing something horribly wrong.
Can someone help me understand why I have so few trainable parameters? Am I misinterpreting how to set up this LSTM? Am I just worrying over nothing?
Does that 3 units mean I only have 3 LSTM 'blocks'? and that it can only look back 3 observations?
In a simplistic viewpoint, you can consider a LSTM layer as an augmented Dense layer with a memory (hence enabling efficient processing of sequences). So the concept of "units" is also the same for both: the number of neurons or feature units of these layers, or in other words, the number of distinctive features these layers can extract from the input.
Therefore, when you specify the number of units to 3 for the LSTM layer, more or less it means that this layer can only extract 3 distinctive features from the input timesteps (note that the number of units has nothing to do with the length of input sequence, i.e. the entire input sequence will be processed by the LSTM layer no matter what the number of units or the length of input sequence is).
Usually, this might be sub-optimal (though, it really depends on the difficulty of the specific problem and dataset you are working on; i.e. maybe 3 units might be enough for your problem/dataset, and you should experiment to find out). Therefore, often a higher number is chosen for the number of units (common choices: 32, 64, 128, 256), and also the classification task is delegated to a dedicated Dense layer (or sometimes called "softmax layer") at the top of the model.
For example, considering the description of your problem, a model with 3 stacked LSTM layers and a Dense classification layer at the top might look like this:
model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(3000, 4)))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(Dense(3, activation = 'softmax'))

how does input_shape in keras.applications work?

I have been through the Keras documentation but I am still unable to figure how does the input_shape parameter works and why it does not change the number of parameters for my DenseNet model when I pass it my custom input shape. An example:
import keras
from keras import applications
from keras.layers import Conv3D, MaxPool3D, Flatten, Dense
from keras.layers import Dropout, Input, BatchNormalization
from keras import Model
# define model 1
INPUT_SHAPE = (224, 224, 1) # used to define the input size to the model
n_output_units = 2
activation_fn = 'sigmoid'
densenet_121_model = applications.densenet.DenseNet121(include_top=False, weights=None, input_shape=INPUT_SHAPE, pooling='avg')
inputs = Input(shape=INPUT_SHAPE, name='input')
model_base = densenet_121_model(inputs)
output = Dense(units=n_output_units, activation=activation_fn)(model_base)
model = Model(inputs=inputs, outputs=output)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 224, 224, 1) 0
_________________________________________________________________
densenet121 (Model) (None, 1024) 7031232
_________________________________________________________________
dense_1 (Dense) (None, 2) 2050
=================================================================
Total params: 7,033,282
Trainable params: 6,949,634
Non-trainable params: 83,648
_________________________________________________________________
# define model 2
INPUT_SHAPE = (512, 512, 1) # used to define the input size to the model
n_output_units = 2
activation_fn = 'sigmoid'
densenet_121_model = applications.densenet.DenseNet121(include_top=False, weights=None, input_shape=INPUT_SHAPE, pooling='avg')
inputs = Input(shape=INPUT_SHAPE, name='input')
model_base = densenet_121_model(inputs)
output = Dense(units=n_output_units, activation=activation_fn)(model_base)
model = Model(inputs=inputs, outputs=output)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 512, 512, 1) 0
_________________________________________________________________
densenet121 (Model) (None, 1024) 7031232
_________________________________________________________________
dense_2 (Dense) (None, 2) 2050
=================================================================
Total params: 7,033,282
Trainable params: 6,949,634
Non-trainable params: 83,648
_________________________________________________________________
Ideally with an increase in the input shape the number of parameters should increase, however as you can see they stay exactly the same. My questions are thus:
Why do the number of parameters not change with a change in the input_shape?
I have only defined one channel in my input_shape, what would happen to my model training in this scenario? The documentation says the following:
input_shape: optional shape tuple, only to be specified if include_top
is False (otherwise the input shape has to be (224, 224, 3) (with
'channels_last' data format) or (3, 224, 224) (with 'channels_first'
data format). It should have exactly 3 inputs channels, and width and
height should be no smaller than 32. E.g. (200, 200, 3) would be one
valid value.
However when I run the model with this configuration it runs without any problems. Could there be something that I am missing out?
Using Keras 2.2.4 with Tensorflow 1.12.0 as backend.
1.
In the convolutional layers the input size does not influence the number of weights, because the number of weights is determined by the kernel matrix dimensions. A larger input size leads to a larger output size, but not to an increasing number of weights.
This means, that the output size of the convolutional layers of the second model will be larger than for the first model, which would increase the number of weights in the following dense layer. However if you take a look into the architecture of DenseNet you notice that there's a GlobalMaxPooling2D layer after all the convolutional layers, which averages all the values for each output channel. Thats why the output of DenseNet will be of size 1024, whatever the input shape.
2.
Yes, the model will still work. I'm not entirely sure about that, but my guess is that the single channel will be broadcasted (dublicated) to fill all three channels. Thats at least how these things are usually handled (see for exaple tensorflow or numpy).
The DenseNet is composed of two parts, the convolution part, and the global pooling part.
The number of the convolution part's trainable weights doesn't depend on the input shape.
Usually, a classification network should employ fully connected layers to infer the classification, however, in DenseNet, global pooling is used and doesn't bring any trainable weights.
Therefore, the input shape doesn't affect the number of weights of the entire network.

What strategy should I use in my CNN to go from a 3D volume to a 2D plane?

What strategy should I use in my CNN to go from a 3D volume to a 2D plane as the output layer. Can I even have a 2D layer as output?
I am trying to develop a network which input is a 320x320x3 image and output should be 68x2.
I know one way to do it would be to start from 320x320x3 and after a few layer I could flatten my 3D layers and then shorten it down to a 1D array of 136. But I am trying to understand if I could somehow go down to a desired 2d dimension at the final layer.
Thanks,
Shubham
Edit: I might have misread your question initially. If your intention is to have 136 output nodes that can be arranged in a 68x2 matrix (and not to have a 68x68x2 image in the output, as I though at first), then you can use a Reshape layer after your final dense layer with 136 units:
import keras
from keras.models import Sequential
from keras.layers import Conv2D, Flatten, Dense, Reshape
model = Sequential()
model.add(Conv2D(32, 3, input_shape=(320, 320, 3)))
model.add(Flatten())
model.add(Dense(136))
model.add(Reshape((68, 2)))
model.summary()
This will give you the following model, with the desired shape in the output:
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 318, 318, 32) 896
_________________________________________________________________
flatten_2 (Flatten) (None, 3235968) 0
_________________________________________________________________
dense_2 (Dense) (None, 136) 440091784
_________________________________________________________________
reshape_1 (Reshape) (None, 68, 2) 0
=================================================================
Total params: 440,092,680
Trainable params: 440,092,680
Non-trainable params: 0
Make sure to provide your training labels in the same shape when fitting the model.
(original answer, might still be relevant)
Yes, this is commonly done in semantic segmentation models, where the inputs are images and the outputs are tensors of the same height and width of the images, and with the number of channels equal to the number of classes in the output. If you want to do this in TensorFlow or Keras, you can look up existing implementations, for instance of U-Net architectures.
A core feature of these models is that these networks are fully convolutional: they only consist of convolutional layers. Typically, the feaure maps in these models go from 'wide and shallow' (big feature maps in the spatial dimensions with few channels) at first, to 'small and deep' (small spatial dimensions, high-dimensional channel dimension) and back to the desired output dimension. Hence the U-shape:
There are a lot of ways to go from 320x320x3 to 68x2 with a fully convolutional network, but the input and output of your model would basically look like this:
import keras
from keras import Sequential
from keras.layers import Conv2D
model = Sequential()
model.add(Conv2D(32, 3, activation='relu', input_shape=(320,320,3)))
# Include more convolutional layers, pooling layers, upsampling layers etc
...
# At the end of the model, add your final Conv2dD layer with 2 filters
# and the required activation function
model.add(Conv2D(2, 3, activation='softmax'))

Keras Dense layer's input is not flattened

This is my test code:
from keras import layers
input1 = layers.Input((2,3))
output = layers.Dense(4)(input1)
print(output)
The output is:
<tf.Tensor 'dense_2/add:0' shape=(?, 2, 4) dtype=float32>
But What Happend?
The documentation says:
Note: if the input to the layer has a rank greater than 2, then it is
flattened prior to the initial dot product with kernel.
While the output is reshaped?
Currently, contrary to what has been stated in documentation, the Dense layer is applied on the last axis of input tensor:
Contrary to the documentation, we don't actually flatten it. It's
applied on the last axis independently.
In other words, if a Dense layer with m units is applied on an input tensor of shape (n_dim1, n_dim2, ..., n_dimk) it would have an output shape of (n_dim1, n_dim2, ..., m).
As a side note: this makes TimeDistributed(Dense(...)) and Dense(...) equivalent to each other.
Another side note: be aware that this has the effect of shared weights. For example, consider this toy network:
model = Sequential()
model.add(Dense(10, input_shape=(20, 5)))
model.summary()
The model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20, 10) 60
=================================================================
Total params: 60
Trainable params: 60
Non-trainable params: 0
_________________________________________________________________
As you can see the Dense layer has only 60 parameters. How? Each unit in the Dense layer is connected to the 5 elements of each row in the input with the same weights, therefore 10 * 5 + 10 (bias params per unit) = 60.
Update. Here is a visual illustration of the example above:

Keras Sequential Dense input layer - and MNIST: why do images need to be reshaped?

I'm asking this because I feel I'm missing something fundamental.
By now most everyone knows that the MNIST images are 28X28 pixels. The keras documentation tells me this about Dense:
Input shape nD tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).
So a newbie like me would assume that the images could be fed to the model as a 28*28 matrix. Yet every tutorial I found goes through various gymasntics to convert the images to a single 784-long feature.
Sometimes by
num_pixels = X_train.shape[1] * X_train.shape[2]
model.add(Dense(num_pixels, input_dim=num_pixels, activation='...'))
or
num_pixels = np.prod(X_train.shape[1:])
model.add(Dense(512, activation='...', input_shape=(num_pixels,)))
or
model.add(Dense(units=10, input_dim=28*28, activation='...'))
history = model.fit(X_train.reshape((-1,28*28)), ...)
or even:
model = Sequential([Dense(32, input_shape=(784,)), ...),])
So my question is simply - why? Can't Dense just accept an image as-is or, if necessary, just process it "behind the scenes", as it were? And if, as I suspect, this processing has to be done, is any of these methods (or others) inherently preferable?
As requested by OP (i.e. Original Poster), I will mention the answer I gave in my comment and elaborate more.
Can't Dense just accept an image as-is or, if necessary, just process
it "behind the scenes", as it were?
Simply no! That's because currently the Dense layer is applied on the last axis. Therefore, if you feed it an image of shape (height, width) or (height, width, channels), Dense layer would be only applied on the last axis (i.e. width or channels). However, when the image is flattened, all the units in the Dense layer would be applied on the whole image and each unit is connected to all the pixels with different weights. To further clarify this consider this model:
model = models.Sequential()
model.add(layers.Dense(10, input_shape=(28*28,)))
model.summary()
Model summary:
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 10) 7850
=================================================================
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________
As you can see, there are 7850 parameters in the Dense layer: each unit is connected to all the pixels (28*28*10 + 10 bias params = 7850). Now consider this model:
model = models.Sequential()
model.add(layers.Dense(10, input_shape=(28,28)))
model.summary()
Model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 28, 10) 290
=================================================================
Total params: 290
Trainable params: 290
Non-trainable params: 0
_________________________________________________________________
In this case there are only 290 parameters in the Dense layer. Here each unit in the Dense layer is connected to all the pixels as well, but the difference is that the weights are shared across the first axis (28*10 + 10 bias params = 290). It is as though the features are extracted from each row of the image compared to the previous model which extracted features across the whole image. And therefore this (i.e. weight sharing) may or may not be useful for your application.

Categories