Why not all the activation functions are identical?

Why not all the activation functions are identical? - python

This is what I picked up from somewhere on the Internet.
This is a very simple GAN+CNN modeling code especially for a descrinimator model, written in keras python3.6.
It works pretty fine but I've got something not clear.
def __init__(self):
self.img_rows = 28
self.img_cols = 28
self.channels = 1
def build_discriminator(self):
img_shape = (self.img_rows, self.img_cols, self.channels)
model = Sequential()
model.add(Conv2D(64,5,5, strides=(2,2)
padding='same', input_shape=img_shape))
model.add(LeakyReLU(0.2))
model.add(Conv2D(128,5,5,strides=(2,2)))
model.add(LeakyReLU(0.2))
model.add(Flatten())
model.add(Dense(256))
model.add(LeakyReLU(0.2))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
return model
There are some activation functions appearing but why aren't they all identical?
If the very last output is 'sigmoid' here, I think the rest also better be the same functions?
Why are LeakyReLU is used in the middle??Thanks.

I guess they didn't use sigmoid for the rest of the layers, because with sigmoid you have a big problem of vanishing gradients in deep networks.
The reason is, that the sigmoid function "flattens out" on both sides around zero giving the layers towards the output layer a tendency to produce very small gradients and thus small learning rates, because loosely speaking, the gradient of the deeper layers is kind of a product of the gradients of the lower layers as a result of the chain rule of derivation. So if you have just few sigmoid layers, you might have luck with them, but as soon as you chain several of them, they produce instability in the gradients.
Its too complex for me to explain it in an article here, but if you want to know it in more detail, you can read it in a chapter of a online book.
Btw. this book is really great. It's worth reading more. Probably to understand the chapter, you have to read chapter 1 of the book first, if you don't know how back propagation works.

The output and hidden layer activation functions do not have to be identical. Hidden layer activations are part of the mechanisms that learns features, so its important that they do not have vanishing gradient issues (like sigmoid has), while the output layer activation function is more related to the output task, for example, softmax activation for classification.

Related

Optimal permutations of kernel_initializer, activation function and optimizer for Regression

I am using the kernel_initializer='normal' and optimizer='adam' to find an optimum regression solution. I am getting close to 0.94 accuracy on training data. I would like to test a few other kernel_initializer, activation function and optimizer combinations but I am not sure kernel_initializer and activation function which works well for regression. Please suggest
# create model
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='root_mean_squared_error', optimizer='adam')

Well, it might not be a good idea. You're after a fairly small margin in performance, and by "fishing" for good results you're essentially exploiting your validation as the training set, relying on small variations in the data to inform model design.
Few tips:
Glorot initializer (default) is usually the best. However, the difference is really small, especially in such a tiny model.
relu activation is helpful to fight vanishing gradients. With three layers in the model, you probably won't have it. Here it really depends on the nature of the data; even linear activation might make sense.
for a regular regression (i.e. predicting a number, not a binary output(s)), you probably need to use linear regression on the output layer. It is the default one, but it's better to make it explicit.
other optimizers might improve rate of conversion, but usually don't improve the performance. Adam sounds like a reasonable choice - sgd will do the same but slower, ftrl works the best on sparse data such as language input.

convolutional neural network for object localization

I saw in deep learning course by Andrew Ng a way to localize single object on image : https://www.youtube.com/watch?v=GSwYGkTfOKk .
As I understand it, you can for example bound a point to specific part of the object, take coordinates: x, y as labels y and train CNN.
I wanted to train a CNN neural network to localize my eyes (not clasiffication). I took 200 photos of me: 60x60 pixels in gray scale. I labeld left and right eye, Each coordinate of labeled eye was normalized to 0-1. The y label is : [x of eye1, y of eye1, x of eye2, y of eye2]. I used SGD optimazer with mse loss and in the output layer sigmoid function.
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(64, (3,3), input_shape= (60,60, 1)))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model.add(tf.keras.layers.Conv2D(32, (3,3)))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(4, activation='sigmoid'))
sgd= tf.keras.optimizers.SGD(lr = 0.01)
model.compile(loss = 'mean_squared_error', optimizer=sgd, metrics=['accuracy'])
model.fit(x,y, batch_size=3, epochs=15, validation_split=0.2)
It didnt work for this task, so what is the way to solve this problem? I saw somewhere: apply CNN to image (I suppose without dense layers), then on flatten data from CNN use linear regression for each x/y coordinate (multivariable logistic regression). Is this a solution ? As I understand it, I would feed each image into Conv and MaxPool layers, then Flatten and then i feed the data to lin. regression and train it, but I have no idea how to do this in keras. I am new in this field, so any idea helps me.

First of all, a couple of observations with regard to your code.
Since the last layer contanins more than 2 neurons, the activation function that you have to use is softmax , not sigmoid (note that this is in the case of classification, not regression).
You should only use sigmoid when you are doing binary classification, but not when you have more than two classes (note that you can also use softmax for 2 classes, however it is not necessarily recommended from a small computational overhead viewpoint).
Your problem is both a regression and classification one!.
The first layer of your convolutional neural network contains 64 feature maps, with each size of the kernel 3x3. Although the way you feed the images to your neural network is correct, you only feed the grayscale image, not the x1,x2,y1,y2 coordinates.
For an ANN with regression, take a look at this tutorial: https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/.
Your intuition is correct; object detection neural networks replace fully connected layers with convolutional ones. Yann LeCun even states that fully connected layers should not be a part of CNNs.
Since you are new to this field, I would recommend to adopt the following pipeline.
1) Find an online github model written in your preferred deep learning library(Keras/PyTorch/TensorFlow etc).
2) Follow the instructions/tutorial in order to reproduce the results obtained by the github user.
3) By means of the latter you should also understand the code / get a good intuitive grasp.
4) Adapt the model to the problem that you need.
One place where you could start is here (this is object detection - detect multiple objects and also of different categories) : https://github.com/pierluigiferrari/ssd_keras.
If you have further questions, please write them down, I would be glad to be of assistance!

How to determine optimal number of layers and activation function(s)

So I am working on the MNIST and Boston_Housing datasets using keras, and I was wondering how I would determine the optimal number of layers and activation functions for each layer.
Now, I am not asking what the optimal number of layers/activation functions are, but rather the process I should go through to determine these parameters.
I am evaluating my model using mean squared error and mean absolute error.
Here is what my current model looks like:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, init='glorot_uniform', activation=layers.Activation('selu')))
model.add(layers.Dense(64,activation = 'softplus'))
model.add(layers.Dense(1))
model.compile(optimizer = 'rmsprop',
loss='mse',
metrics=['mae'])
I have a mean squared error of 3.5 and a mean squared error of 27.

For choosing the activation function,
Modern neural networks mainly use ReLU or leakyReLU in the hidden layers
For classification, a softmax activation is used at the output layer.
For regression, a linear activation is used at the output layer.
For choosing the number of layers,
Totally depends on your problem.
More layers are helpful, when the data is complex as they could approximate the function between the input and output efficiently.
Sometimes, for smaller problems l, like MNIST, even a net with 2 hidden layers would work well.

Keras Sequential model input layer

When creating a Sequential model in Keras, I understand you provide the input shape in the first layer. Does this input shape then make an implicit input layer?
For example, the model below explicitly specifies 2 Dense layers, but is this actually a model with 3 layers consisting of one input layer implied by the input shape, one hidden dense layer with 32 neurons, and then one output layer with 10 possible outputs?
model = Sequential([
Dense(32, input_shape=(784,)),
Activation('relu'),
Dense(10),
Activation('softmax'),
])

Well, it actually is an implicit input layer indeed, i.e. your model is an example of a "good old" neural net with three layers - input, hidden, and output. This is more explicitly visible in the Keras Functional API (check the example in the docs), in which your model would be written as:
inputs = Input(shape=(784,)) # input layer
x = Dense(32, activation='relu')(inputs) # hidden layer
outputs = Dense(10, activation='softmax')(x) # output layer
model = Model(inputs, outputs)
Actually, this implicit input layer is the reason why you have to include an input_shape argument only in the first (explicit) layer of the model in the Sequential API - in subsequent layers, the input shape is inferred from the output of the previous ones (see the comments in the source code of core.py).
You may also find the documentation on tf.contrib.keras.layers.Input enlightening.

It depends on your perspective :-)
Rewriting your code in line with more recent Keras tutorial examples, you would probably use:
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=784))
model.add(Dense(10, activation='softmax')
...which makes it much more explicit that you only have 2 Keras layers. And this is exactly what you do have (in Keras, at least) because the "input layer" is not really a (Keras) layer at all: it's only a place to store a tensor, so it may as well be a tensor itself.
Each Keras layer is a transformation that outputs a tensor, possibly of a different size/shape to the input. So while there are 3 identifiable tensors here (input, outputs of the two layers), there are only 2 transformations involved corresponding to the 2 Keras layers.
On the other hand, graphically, you might represent this network with 3 (graphical) layers of nodes, and two sets of lines connecting the layers of nodes. Graphically, it's a 3-layer network. But "layers" in this graphical notation are bunches of circles that sit on a page doing nothing, whereas a layers in Keras transform tensors and do actual work for you. Personally, I would get used to the Keras perspective :-)
Note finally that for fun and/or simplicity, I substituted input_dim=784 for input_shape=(784,) to avoid the syntax that Python uses to both confuse newcomers and create a 1-D tuple: (<value>,).

Keras: specify input dropout layer that always keeps certain features

I'm training a neural net using Keras in Python for time-series climate data (predicting value X at time t=T), and tried adding a (20%) dropout layer on the inputs, which seemed to limit overfitting and cause a slight increase in performance. However, after I added a new and particularly useful feature (the value of the response variable at time of prediction t=0), I found massively increased performance by removing the dropout layer. This makes sense to me, since I can imagine how the neural net would "learn" the importance of that one feature and base the rest of its training around adjusting that value (i.e, "how do these other features affect how the response at t=0 changes by time t=T").
In addition, there are a few other features that I think should be present for all epochs. That said, I am still hopeful that a dropout layer could improve the model performance-- it just needs to not drop out certain features, like X at t_0: I need a dropout layer that will only drop out certain features.
I have searched for examples of doing this, and read the Keras documentation here, but can't seem to find a way to do it. I may be missing something obvious, as I'm still not familiar with how to manually edit layers. Any help would be appreciated. Thanks!
Edit: sorry for any lack of clarity. Here is the code where I define the model (p is the number of features):
def create_model(p):
model = Sequential()
model.add(Dropout(0.2, input_shape=(p,))) # % of features dropped
model.add(Dense(1000, input_dim=p, kernel_initializer='normal'
, activation='sigmoid'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal',activation='linear'))
model.compile(loss=cost_fn, optimizer='adam')
return model

The best way I can think of applying dropout only to specific features is to simply separate the features in different layers.
For that, I suggest you simply divide your inputs in essential features and droppable features:
from keras.layers import *
from keras.models import Model
def create_model(essentialP,droppableP):
essentialInput = Input((essentialP,))
droppableInput = Input((droppableP,))
dropped = Dropout(0.2)(droppableInput) # % of features dropped
completeInput = Concatenate()([essentialInput, dropped])
output = Dense(1000, kernel_initializer='normal', activation='sigmoid')(completeInput)
output = Dense(30, kernel_initializer='normal', activation='relu')(output)
output = Dense(1, kernel_initializer='normal',activation='linear')(output)
model = Model([essentialInput,droppableInput],output)
model.compile(loss=cost_fn, optimizer='adam')
return model
Train the model using two inputs. You have to manage your inputs before training:
model.fit([essential_train_data,droppable_train_data], predictions, ...)

I don't see any harm to using dropout in the input layer. The usage/effect would be a little different than normal of course. The effect would be similar to adding synthetic noise to an input signal; only the feature/pixel/whatever would be entirely unknown[zeroed out] instead of noisy. And inserting synthetic noise into the input is one of the oldest ways to improve robustness; certainly not bad practice as long as you think about whether it makes sense for your data set.

This question has already an accepted answer but it seems to me you are using dropout in a bad way.
Dropout is only for the hidden layers, not for the input layer !
Dropout act as a regularizer, and prevent the hidden layer complex coadaptation, quoting Hinton paper "Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging" (http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
Dropout can be seen as training several different models with your data and averaging the prediction at test time. If you prevent your models to have all the inputs during training, it will perform badly, especially if one input is crucial. What you want is actually avoid overfitting, meaning you prevent too complex models during the training phase (so each of your models will select the most important features first) before testing.
It is common practice to drop some of the features in ensemble learning but it is control and not stochastic like for dropout. It also works for neural networks as hidden layers have (often) way more neurons as inputs, and so dropout follows the law of big numbers, as for a small number of inputs, you can have in some bad case almost all your inputs dropped.
In conlusion: it is a bad practice to use dropout in the input layer of a neural network.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.