Optimal permutations of kernel_initializer, activation function and optimizer for Regression - python

I am using the kernel_initializer='normal' and optimizer='adam' to find an optimum regression solution. I am getting close to 0.94 accuracy on training data. I would like to test a few other kernel_initializer, activation function and optimizer combinations but I am not sure kernel_initializer and activation function which works well for regression. Please suggest
# create model
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='root_mean_squared_error', optimizer='adam')

Well, it might not be a good idea. You're after a fairly small margin in performance, and by "fishing" for good results you're essentially exploiting your validation as the training set, relying on small variations in the data to inform model design.
Few tips:
Glorot initializer (default) is usually the best. However, the difference is really small, especially in such a tiny model.
relu activation is helpful to fight vanishing gradients. With three layers in the model, you probably won't have it. Here it really depends on the nature of the data; even linear activation might make sense.
for a regular regression (i.e. predicting a number, not a binary output(s)), you probably need to use linear regression on the output layer. It is the default one, but it's better to make it explicit.
other optimizers might improve rate of conversion, but usually don't improve the performance. Adam sounds like a reasonable choice - sgd will do the same but slower, ftrl works the best on sparse data such as language input.

Related

how could i increase the accuracy of training set

i'm working on a classification problem (human activity classification) and i used CNN the code of model is :
model = Sequential()
model.add(Conv2D(100, (2, 2), activation = 'relu', input_shape = X_train[0].shape))
model.add(Dropout(0.1))
#adding pooling layer
model.add(MaxPool2D(2,2))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='softmax'))
compiling and fiting :
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 20, validation_data= (X_test, y_test), verbose=1)
the accuracy was like this
how coul'd i increase the last value of accuracy ? and why the curve is increasing kinda fast?
There are a few avenues you can pursue here, specifically finding answers to the following questions for your particular problem. Here's a great video, although not for tensorflow, but I think the question you are asking is general enough for it to apply
What is the right amount of time to train for? Likely the answer here is somewhere between 20 epochs and 90, more specifically, it's where your two series in the plot start to diverge; in other words, your model starts to memorize the training data at the point of divergence. Tensorflow has early stopping mechanisms to help with this.
What is the performance of a naïve guesser? Is the complexity of your model proportional to the complexity/dimensionality of the problem?
What is the human insight that you can bring to the problem? Are there things you can do to the features that will help the model create separability in higher dimensions? For example, let's say your model is going to predict what activity a person is going to do at a given point in time. In this case, information related to people might be separate from time and activity data. You can create features that represent combinations of other features (assuming you have a lot of data), and encode this and feed it to your model. You can create embeddings in your model to get your model to deal with the sparsity that occurs when you combine such categorical features.
Another aspect of this that I think is very important to answer is "Why am I solving this problem?". In some cases, the answer might be "I want to learn X", in which case you might approach it differently. For example, if it's all tabular data, you might have more interpretable/better results using something like scikit-learn using a tree based model. It also, of course, depends on the amount and type of data you have. Nested cross-validation can give you great insight into what are the combinations of hyperparameters and features that will produce a model that generalizes, and also about the variation you can expect to see on unseen data.
Best of luck!

Tensorflow Model Underpredicts Values with Dropout

I am having a problem implementing dropout as a regularization method in my dense NN model. It appears that adding a dropout value above 0 just scales down the predicted value, in a way makes me think something is not being accounted for correctly after individual weights are being set to zero. I'm sure I am implementing something incorrectly, but I can't seem to figure out what.
The code to build this model was taken directly from a tensorflow page (https://www.tensorflow.org/tutorials/keras/overfit_and_underfit), but occurs no matter what architecture I use to build the model.
model = tf.keras.Sequential([
layers.Dense(512, activation='relu', input_shape=[len(X_train[0])]),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1)
])
Any help would be much appreciated!
It's perfectly normal to decrease accuracy in training set when adding Dropout. You usually do this as a trade-off to increase accuracy in unseen data (test set) and thus, generalization properties.
However, try to decrease Dropout rate to 0.10 or 0.20. You will get better results. Also, unless you are dealing with hundreds of millions of examples, try to decrease the neurons from your neural net, like from 512 to 128. With a complex neural net the backpropagation gradients won't reach an optimum level. With a neural net that is too simple, the gradients will saturate and won't learn, either.
Other point, you may want to apply pd.get_dummies to your output (Y) and increase last layer to Dense(2) and normalize input data.

How do I best optimize my paramters, choices of activation, optimizer ect. in a LSTM?

I'm training a LSTM Neural Network to predict a volatilty (timeseries) in Keras. At the moment, my network is specified as follows:
model = Sequential()
model.add(LSTM(10, input_shape=(1,1), kernel_regularizer = l2(0.0001)))
model.add(Dense(1, activation = 'relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=16)
Here, I have a lot of parameters I could cross validate:
units in LSTM?
More layers?
regularizer (L1 or l2, and amount)?
Activation function?
Optimizer?
Batch size?
However, CV on all these parameters would result in huge computational time, so how do I determind the correct specifications for all of them?
So far as I know, doing grid-search might be the best approach. However, you can lessen your search space by examining your data. If you don't have much data, try to go for a smaller model, don't go too big (or else it will overfit). This can lessen your search space a bit. Some say less layer but more unit works well for low-resource data, but still, it is not guaranteed.
Regularizer can sometimes good or bad, it depends on the task. You'll never know if the setting is correct or not unless you experiment on it.
For batch size, it is recommended to experiment on the batch size from 16 to 512 (or you can go higher if you can). The larger the batch size is, the faster it trains, the more memory it consumes. Smaller batch size also means the model will "walk" more random. In other words, the loss will decrease at a more random pace.
For optimizer, if you want to grid search, just use Adam. It is quite good for most tasks.
All in all, no one can guarantee that tuning different hyperparameters will result in a performance gain. It all needs to be experimented and record. That's why there are so many research papers done on hyperparameters tuning.

Why not all the activation functions are identical?

This is what I picked up from somewhere on the Internet.
This is a very simple GAN+CNN modeling code especially for a descrinimator model, written in keras python3.6.
It works pretty fine but I've got something not clear.
def __init__(self):
self.img_rows = 28
self.img_cols = 28
self.channels = 1
def build_discriminator(self):
img_shape = (self.img_rows, self.img_cols, self.channels)
model = Sequential()
model.add(Conv2D(64,5,5, strides=(2,2)
padding='same', input_shape=img_shape))
model.add(LeakyReLU(0.2))
model.add(Conv2D(128,5,5,strides=(2,2)))
model.add(LeakyReLU(0.2))
model.add(Flatten())
model.add(Dense(256))
model.add(LeakyReLU(0.2))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
return model
There are some activation functions appearing but why aren't they all identical?
If the very last output is 'sigmoid' here, I think the rest also better be the same functions?
Why are LeakyReLU is used in the middle??Thanks.
I guess they didn't use sigmoid for the rest of the layers, because with sigmoid you have a big problem of vanishing gradients in deep networks.
The reason is, that the sigmoid function "flattens out" on both sides around zero giving the layers towards the output layer a tendency to produce very small gradients and thus small learning rates, because loosely speaking, the gradient of the deeper layers is kind of a product of the gradients of the lower layers as a result of the chain rule of derivation. So if you have just few sigmoid layers, you might have luck with them, but as soon as you chain several of them, they produce instability in the gradients.
Its too complex for me to explain it in an article here, but if you want to know it in more detail, you can read it in a chapter of a online book.
Btw. this book is really great. It's worth reading more. Probably to understand the chapter, you have to read chapter 1 of the book first, if you don't know how back propagation works.
The output and hidden layer activation functions do not have to be identical. Hidden layer activations are part of the mechanisms that learns features, so its important that they do not have vanishing gradient issues (like sigmoid has), while the output layer activation function is more related to the output task, for example, softmax activation for classification.

How to determine optimal number of layers and activation function(s)

So I am working on the MNIST and Boston_Housing datasets using keras, and I was wondering how I would determine the optimal number of layers and activation functions for each layer.
Now, I am not asking what the optimal number of layers/activation functions are, but rather the process I should go through to determine these parameters.
I am evaluating my model using mean squared error and mean absolute error.
Here is what my current model looks like:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, init='glorot_uniform', activation=layers.Activation('selu')))
model.add(layers.Dense(64,activation = 'softplus'))
model.add(layers.Dense(1))
model.compile(optimizer = 'rmsprop',
loss='mse',
metrics=['mae'])
I have a mean squared error of 3.5 and a mean squared error of 27.
For choosing the activation function,
Modern neural networks mainly use ReLU or leakyReLU in the hidden layers
For classification, a softmax activation is used at the output layer.
For regression, a linear activation is used at the output layer.
For choosing the number of layers,
Totally depends on your problem.
More layers are helpful, when the data is complex as they could approximate the function between the input and output efficiently.
Sometimes, for smaller problems l, like MNIST, even a net with 2 hidden layers would work well.

Categories