what exactly is Dense in LSTM model description?

what exactly is Dense in LSTM model description? - python

I am new to deep_learning and working with Keras, so I want to know what is Dense meaning when we have a code like the one below :
I read the https://keras.io/getting-started/sequential-model-guide/
and I also found some explanations like : Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
which didnt help me so much!
model = Sequential([
Dense(32, input_shape=(784,)),
Activation('relu'),
Dense(10),
Activation('softmax'),

Another name for dense layer is Fully-connected layer. It's actually the layer where each neuron is connected to all of the neurons from the next layer. It implements the operation output = X * W + b where X is input to the layer, and W and b are weights and bias of the layer. W ad b are actually the things you're trying to learn. If you want a more detailed explanation, please refer to this article.

A dense layer is a fully-connected layer, i.e. every neurons of the layer N are connected to every neurons of the layer N+1

The code you wrote is not for LSTM, this is a simple neural network of two fully connected layers also known as dense layers, here sequential means the output of one layer will be directly passed to next layer, which is not sequential learning like LSTM.

Related

Can we have a dense layer between Conv layers in 1D CNN Architecture?

I have a question regarding the one-dimensional convolutional neural network 1D CNN.
Can we have a dense layer between Conv layers in the architecture? just like what I have done in the following example:
Note: It is working correctly with CSV files for classification problems.
model = Sequential()
# First Convolusional Layer
model.add(Conv1D(128, 5, input_shape=(20,1), strides=2, padding='same'))
model.add(Dense(256, activation="relu"))
model.add(MaxPooling1D())
# Second Convolusional Layer
model.add(Conv1D(128, 3, strides=1, padding='same'))
model.add(Dense(64, activation="relu"))
model.add(MaxPooling1D())
# Passing to Fully Connected Layers
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
#model.add(Dropout(0.02))
# Output Layer
model.add(Dense(2, activation = 'sigmoid'))
# Model Compilation
model.compile(loss = 'sparse_categorical_crossentropy',
optimizer = "adam", metrics = ['accuracy'])
# Summary of The Model
model.summary()
Thank you very much!

Yes, that you can certainly do. It is not usual at all and not very advisable from a theoretical perspective but it is possible.
why is it not advisable? (theory) With convolutions one tries to capture spatial features (i.e. information). Values next to each other should have an influence but values far away from this point (in time -- in the case of time series data) should have less influence. That is the whole idea of CNNs. To a fully connected NN the order in which the input is presented to it is not important. It looks at all inputs at the same time since it is equally connected to all. So you loose spatial information. BTW, that is also the reason why it is plausible to do a global pooling before feeding the output of the CNN-part of a model to the fully-connected part of a model (i.e. the dense layers).
Now if you do convolution, you care about spatial information. If you then apply a dense layer, you kind of say "I cared enough about the spatial info". If you then apply convolution again on the output vector of a dense layer it becomes totally irrational.
feasibility
Nonetheless, such a network would be feasible. You would just need to make sure that the dense layer outputs a vector (or matrix) again, on which you can apply convolution.
However, your code lacks of a proper adapter from the output of the convolution layer to the dense layer. You should apply some type of global pooling operation to create a vector that serves as an input to the dense layer. That would also save you the Flatten() step. Again, it should work your way anyway. It is just about style since now your are sending mixed signals: flatten should concatenate all spatial layers but the NN ignores spatial information...
I don't get the point of applying MaxPooling1D´ after the Dense-layer. One could simply reduce the number of outputs of the Dense-layer. And you definitely don't need a second Flattenafter aDense`-layer as it returns a vector by definition (and pooling won't add a dimension to it)

What is the difference between Activation layer and activation keyword argument

guys what is the difference between activation kwarg and Activation layer in tensorflow?
here's an example :
activation kwarg :
model.add(Dense(64,activation="relu"))
Activation layer :
model.add(Dense(64))
model.add(Activation("sigmoid"))
PS: im new to tensorflow

In Dense(64,activation="relu"), the relu activation function becomes a part of Dense layer, and will be called automatically whenever this Dense layer is called.
In Activation("relu"), the relu activation function is a layer itself and is decoupled from the Dense layer. This is necessary if you want to have a reference to the tensor after Dense but before activation for, says, branching purposes.
input_tensor = Input((10,))
intermediate_tensor = Dense(64)(input_tensor)
branch_1_tensor = Activation('relu')(intermediate_tensor)
branch_2_tensor = Dense(64)(intermediate_tensor)
final_tensor = branch_1_tensor + branch_2_tensor
model = Model(inputs=input_tensor, outputs=final_tensor)
However, your model is Sequential model so your two samples are effectively equal: the relu activation function will be called automatically. To obtain a reference to the tensor before Activation in this case, you can go through model.layers and get the output of the Dense layer from within.

What layers are affected by dropout layer in Tensorflow?

Consider transfer learning in order to use a pretrained model in keras/tensorflow. For each old layer, trained parameter is set to false so that its weights are not updated during training whereas the last layer(s) have been substituted with new layers and these must be trained. Particularly two fully connected hidden layers with 512 and 1024 neurons and and relu activation function have been added. After these layers a Dropout layer is used with rate 0.2. This means that during each epoch of training 20% of the neurons are randomly discarded.
What layers does this dropout layer affect? Does it affect all the network including also the pretrained layers for which layer.trainable=false has been set or does it affect only the newly added layers? Or does it affect only the previous layer (i.e., the one with 1024 neurons)?
In other words, which layer(s) do the neurons that are turned off during each epoch by the dropout belong to?
import os
from tensorflow.keras import layers
from tensorflow.keras import Model
from tensorflow.keras.applications.inception_v3 import InceptionV3
local_weights_file = 'weights.h5'
pre_trained_model = InceptionV3(input_shape = (150, 150, 3),
include_top = False,
weights = None)
pre_trained_model.load_weights(local_weights_file)
for layer in pre_trained_model.layers:
layer.trainable = False
# pre_trained_model.summary()
last_layer = pre_trained_model.get_layer('mixed7')
last_output = last_layer.output
# Flatten the output layer to 1 dimension
x = layers.Flatten()(last_output)
# Add two fully connected layers with 512 and 1,024 hidden units and ReLU activation
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(1024, activation='relu')(x)
# Add a dropout rate of 0.2
x = layers.Dropout(0.2)(x)
# Add a final sigmoid layer for classification
x = layers.Dense (1, activation='sigmoid')(x)
model = Model( pre_trained_model.input, x)
model.compile(optimizer = RMSprop(lr=0.0001),
loss = 'binary_crossentropy',
metrics = ['accuracy'])

The dropout layer will affect the output of the previous layer.
If we look at the specific part of your code:
x = layers.Dense(1024, activation='relu')(x)
# Add a dropout rate of 0.2
x = layers.Dropout(0.2)(x)
# Add a final sigmoid layer for classification
x = layers.Dense (1, activation='sigmoid')(x)
In your case, 20% of the output of the layer defined by x = layers.Dense(1024, activation='relu')(x) will be dropped at random, before being passed to the final Dense layer.

Only the previous layer's neurons are "turned off", but all layers are "affected" in terms of backprop.
Later layers: Dropout's output is input to the next layer, so next layer's outputs will change, and so will next-next's, etc.
Previous layers: as the "effective output" of the pre-Dropout layer is changed, so will gradients to it, and thus any subsequent gradients. In the extreme case of Dropout(rate=1), zero gradient will flow.
Also, note that whole neurons are only dropped if input to Dense is 2D (batch_size, features); Dropout applies a random uniform mask to all dimensions (equivalent to dropping whole neurons in 2D case). To drop whole neurons, set Dropout(.2, noise_shape=(batch_size, 1, features)) (3D case). To drop same neurons across all samples, use noise_shape=(1, 1, features) (or (1, features) for 2D).

Dropout technique is not implemented on every single layer within a neural network; it’s commonly leveraged within the neurons in the last few layers within the network.
The technique works by randomly reducing the number of interconnecting neurons within a neural network. At every training step, each neuron has a chance of being left out, or rather, dropped out of the collated contribution from connected neurons
There’s some debate as to whether the dropout should be placed before or after the activation function. As a rule of thumb, place the dropout after the activate function for all activation functions other than relu.
you can add dropout after every hidden layer and generally it affect only the previous layer in (your case it will effect (x = layers.Dense(1024, activation='relu')(x) )). In the original paper that proposed dropout layers, by Hinton (2012), dropout (with p=0.5) was used on each of the fully connected (dense) layers before the output; it was not used on the convolutional layers. This became the most commonly used configuration.
I am adding the resources link that might help you:
https://towardsdatascience.com/understanding-and-implementing-dropout-in-tensorflow-and-keras-a8a3a02c1bfa
https://towardsdatascience.com/dropout-on-convolutional-layers-is-weird-5c6ab14f19b2
https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab

relu as a parameter in Dense() ( or any other layer) vs ReLu as a layer in Keras

I was just wondering if there is any significant difference between the use and speciality of
Dense(activation='relu')
and
keras.layers.ReLu
How and where the later one can be used? My best guess is in Functional API usecase but I don't know how.

Creating some Layer instance passing the activation as parameter i.e. activation='relu' is the same as creating some Layer instance and then creating an activation e.g. Relu instance. Relu() is a layer which returns K.relu() function over inputs:
class ReLU(Layer):
.
.
.
def call(self, inputs):
return K.relu(inputs,
alpha=self.negative_slope,
max_value=self.max_value,
threshold=self.threshold)
From the Keras documentation:
Usage of activations
Activations can either be used through an Activation layer, or through
the activation argument supported by all
forward layers:
from keras.layers import Activation, Dense
model.add(Dense(64))
model.add(Activation('tanh'))
This is equivalent to:
model.add(Dense(64, activation='tanh'))
You can also pass an element-wise TensorFlow/Theano/CNTK function as
an activation:
from keras import backend as K
model.add(Dense(64, activation=K.tanh))
Update:
Answering OP's aditional question: How and where the later one can be used?:
You can use it when you used some layer, which doesn't accept activation parameter like e.g. tf.keras.layers.Add, tf.keras.layers.Subtract etc, but you want to get a rectified output of such layers as a result:
added = tf.keras.layers.Add()([x1, x2])
relu = tf.keras.layers.ReLU(added)

The most obvious use case is when you need to put a ReLU without a Dense layer, for example when implementing ResNet, the design requires a ReLU activation after summing the residual connection, like it is shown here:
x = layers.add([x, shortcut])
x = layers.Activation('relu')(x)
return x
It is also useful when you want to put a BatchNormalization layer between the pre-activation of a Dense layer and the ReLU activation. When using a GlobalAveragePooling classifier (such as in the SqueezeNet architecture), then you need to put a softmax activation after the GAP using Activation("softmax") and there are no Dense layers in the network.
There are probably more cases, these are just samples.

Where do the parameters in keras layers apply?

I'm trying to get to grips with the basics of neural networks and am struggling to understand keras layers.
Take the following code from tensorflow's tutorials:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
So this network has 3 layers? The first is just the 28*28 nodes representing the pixel values. The second is a hidden layer which takes weighted sums from the first, applies relu and then sends these to 10 output layers which are softmaxed?
but then this model seems to require different inputs to the layers:
model = keras.Sequential([
layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
Why does the input layer now have both an input_shape and a value 64? I read that the first parameter specifies the number of nodes in the second layer, but that doesn't seem to fit with the code in the first example. Also, why does the input layer have an activation? Is this just relu-ing the values before they enter the network?
Also, with regards activation functions, why are softmax and relu treated as alternatives? I thought relu applied to all the inputs of a single node, whereas softmax acted on the outputs of all the nodes across a layer?
Any help is really appreciated!
First example is from: https://www.tensorflow.org/tutorials/keras/basic_classification
Second example is from: https://www.tensorflow.org/tutorials/keras/basic_regression

Basically you have two types of API in Keras: Sequential and Functional API https://keras.io/getting-started/sequential-model-guide/
In Sequential API, you don't explictly refers an Input Layer Input https://keras.io/layers/core/#input
That is why you need to add an input_shape to specify the dimensions Of the First layer,
more information in https://jovianlin.io/keras-models-sequential-vs-functional/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.