How to determine optimal number of layers and activation function(s)

How to determine optimal number of layers and activation function(s) - python

So I am working on the MNIST and Boston_Housing datasets using keras, and I was wondering how I would determine the optimal number of layers and activation functions for each layer.
Now, I am not asking what the optimal number of layers/activation functions are, but rather the process I should go through to determine these parameters.
I am evaluating my model using mean squared error and mean absolute error.
Here is what my current model looks like:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, init='glorot_uniform', activation=layers.Activation('selu')))
model.add(layers.Dense(64,activation = 'softplus'))
model.add(layers.Dense(1))
model.compile(optimizer = 'rmsprop',
loss='mse',
metrics=['mae'])
I have a mean squared error of 3.5 and a mean squared error of 27.

For choosing the activation function,
Modern neural networks mainly use ReLU or leakyReLU in the hidden layers
For classification, a softmax activation is used at the output layer.
For regression, a linear activation is used at the output layer.
For choosing the number of layers,
Totally depends on your problem.
More layers are helpful, when the data is complex as they could approximate the function between the input and output efficiently.
Sometimes, for smaller problems l, like MNIST, even a net with 2 hidden layers would work well.

Related

Tensorflow Model Underpredicts Values with Dropout

I am having a problem implementing dropout as a regularization method in my dense NN model. It appears that adding a dropout value above 0 just scales down the predicted value, in a way makes me think something is not being accounted for correctly after individual weights are being set to zero. I'm sure I am implementing something incorrectly, but I can't seem to figure out what.
The code to build this model was taken directly from a tensorflow page (https://www.tensorflow.org/tutorials/keras/overfit_and_underfit), but occurs no matter what architecture I use to build the model.
model = tf.keras.Sequential([
layers.Dense(512, activation='relu', input_shape=[len(X_train[0])]),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1)
])
Any help would be much appreciated!

It's perfectly normal to decrease accuracy in training set when adding Dropout. You usually do this as a trade-off to increase accuracy in unseen data (test set) and thus, generalization properties.
However, try to decrease Dropout rate to 0.10 or 0.20. You will get better results. Also, unless you are dealing with hundreds of millions of examples, try to decrease the neurons from your neural net, like from 512 to 128. With a complex neural net the backpropagation gradients won't reach an optimum level. With a neural net that is too simple, the gradients will saturate and won't learn, either.
Other point, you may want to apply pd.get_dummies to your output (Y) and increase last layer to Dense(2) and normalize input data.

Optimal permutations of kernel_initializer, activation function and optimizer for Regression

I am using the kernel_initializer='normal' and optimizer='adam' to find an optimum regression solution. I am getting close to 0.94 accuracy on training data. I would like to test a few other kernel_initializer, activation function and optimizer combinations but I am not sure kernel_initializer and activation function which works well for regression. Please suggest
# create model
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='root_mean_squared_error', optimizer='adam')

Well, it might not be a good idea. You're after a fairly small margin in performance, and by "fishing" for good results you're essentially exploiting your validation as the training set, relying on small variations in the data to inform model design.
Few tips:
Glorot initializer (default) is usually the best. However, the difference is really small, especially in such a tiny model.
relu activation is helpful to fight vanishing gradients. With three layers in the model, you probably won't have it. Here it really depends on the nature of the data; even linear activation might make sense.
for a regular regression (i.e. predicting a number, not a binary output(s)), you probably need to use linear regression on the output layer. It is the default one, but it's better to make it explicit.
other optimizers might improve rate of conversion, but usually don't improve the performance. Adam sounds like a reasonable choice - sgd will do the same but slower, ftrl works the best on sparse data such as language input.

Why not all the activation functions are identical?

This is what I picked up from somewhere on the Internet.
This is a very simple GAN+CNN modeling code especially for a descrinimator model, written in keras python3.6.
It works pretty fine but I've got something not clear.
def __init__(self):
self.img_rows = 28
self.img_cols = 28
self.channels = 1
def build_discriminator(self):
img_shape = (self.img_rows, self.img_cols, self.channels)
model = Sequential()
model.add(Conv2D(64,5,5, strides=(2,2)
padding='same', input_shape=img_shape))
model.add(LeakyReLU(0.2))
model.add(Conv2D(128,5,5,strides=(2,2)))
model.add(LeakyReLU(0.2))
model.add(Flatten())
model.add(Dense(256))
model.add(LeakyReLU(0.2))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
return model
There are some activation functions appearing but why aren't they all identical?
If the very last output is 'sigmoid' here, I think the rest also better be the same functions?
Why are LeakyReLU is used in the middle??Thanks.

I guess they didn't use sigmoid for the rest of the layers, because with sigmoid you have a big problem of vanishing gradients in deep networks.
The reason is, that the sigmoid function "flattens out" on both sides around zero giving the layers towards the output layer a tendency to produce very small gradients and thus small learning rates, because loosely speaking, the gradient of the deeper layers is kind of a product of the gradients of the lower layers as a result of the chain rule of derivation. So if you have just few sigmoid layers, you might have luck with them, but as soon as you chain several of them, they produce instability in the gradients.
Its too complex for me to explain it in an article here, but if you want to know it in more detail, you can read it in a chapter of a online book.
Btw. this book is really great. It's worth reading more. Probably to understand the chapter, you have to read chapter 1 of the book first, if you don't know how back propagation works.

The output and hidden layer activation functions do not have to be identical. Hidden layer activations are part of the mechanisms that learns features, so its important that they do not have vanishing gradient issues (like sigmoid has), while the output layer activation function is more related to the output task, for example, softmax activation for classification.

How to decide number of layers to add in a sequential model to solve a multiple linear regression problem using Tensorflow?

I was working on a problem defined in Predict Fuel Efficiency, in which I didn't get clarity that how one should decide the number of layers( hidden layers) to solve a multiple linear problem?
In the above defined problem, there are three hidden layers is being used.
ie.
myModel = keras.Sequential([
layers.Dense(32, activation=tf.nn.relu, input_shape= [len(train_dataset.keys())]),
layers.Dense(32, activation=tf.nn.relu),
layers.Dense(1),
---- ])
For a similar problem, how I should decide the number of layers to add in the sequential model?

Does applying a Dropout Layer after the Embedding Layer have the same effect as applying the dropout through the LSTM dropout parameter?

I am slightly confused on the different ways to apply dropout to my Sequential model in Keras.
My model is the following:
model = Sequential()
model.add(Embedding(input_dim=64,output_dim=64, input_length=498))
model.add(LSTM(units=100,dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
Assume that I added an extra Dropout layer after the Embedding layer in the below manner:
model = Sequential()
model.add(Embedding(input_dim=64,output_dim=64, input_length=498))
model.add(Dropout(0.25))
model.add(LSTM(units=100,dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
Will this make any difference since I then specified that the dropout should be 0.5 in the LSTM parameter specifically, or am I getting this all wrong?

When you add a dropout layer you're adding dropout to the output of the previous layer only, in your case you are adding dropout to your embedding layer.
An LSTM cell is more complex than a single layer neural network, when you specify the dropout in the LSTM cell you are actually applying dropout to 4 different sub neural network operations in the LSTM cell.
Below is a visualization of an LSMT cell from Colah's blog on LSTMs (the best visualization of LSTM/RNNs out there, http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The yellow boxes represent 4 fully connected network operations (each with their own weights) which occur under the hood of the LSTM - this is neatly wrapped up in the LSTM cell wrapper, though it's not really so hard to code by hand.
When you specify dropout=0.5 in the LSTM cell, what you are doing under the hood is applying dropout to each of these 4 neural network operations. This is effectively adding model.add(Dropout(0.25)) 4 times, once after each of the 4 yellow blocks you see in the diagram, within the internals of the LSTM cell.
I hope that short discussion makes it more clear how the dropout applied in the LSTM wrapper, which is applied to effectively 4 sub networks within the LSTM, is different from the dropout you applied once in the sequence after your embedding layer. And to answer your question directly, yes, these two dropout definitions are very much different.
Notice, as a further example to help elucidate the point: if you were to define a simple 5 layer fully connected neural network you would need to define dropout after each layer, not once. model.add(Dropout(0.25)) is not some kind of global setting, it's adding the dropout operation to a pipeline of operations. If you have 5 layers, you need to add 5 dropout operations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.