Keras - Hyper Tuning the initial state of the model

Keras - Hyper Tuning the initial state of the model - python

I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?

... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?

I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.

maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)

Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.

Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

Related

Questions that in case of fluctuating the validation accuracy and loss curve for image binary classification, ask the way of analysis and solution

I implement training and evaluating for binary classification with image data through transfer learning from keras API. I'd like to compare performance each models(ResNet, Inception, Xception, VGG, Efficient Net). The datasets are composed by train(approx.2000ea), valid(approx.250ea), test(approx.250ea).
But I faced unfamiliar situation for me so I'm asking couple of questions here.
As shown below, Valid Accuracy or Loss has a very high up and down deviation.
I wonder which one is the problem and what needs to be changed.
epoch_acc_loss
loss_epoch
acc_epoch
If I want to express validation accuracy with number, what should I say in the above case?
Average or maximum or minimum?
It is being performed using Keras (tensorflow), and there are many examples in the API for
train, valid but the code for Test(evaluation?) is hard to find. When figuring performance,
normally implement until valid? or Do I need to show evaluation result?
Now I use Keras API for transfer learning and set this.
include_top=False
conv_base.trainable=False
Summary
I wonder if there is an effect of transfer learning without includint from top, or if it's not,
is there a way to freeze or learn from a specific layer of conv_base.
I'm a beginner and have not many experience so it could be ridiculous questions but please give kind advice.
Thanks a lot in advance.

It's hard to figure out the problem without any given code/model structure. From your loss graph I can see that your model is facing underfitting (or it has a lots of dropout). Common mistakes, that make models underfit are: very high lr and primitive structure (so model can't figure out the dependencies in your data). And you should never forget about the principle "garbage in - garbage out", so double-check tour data for any structure roughness.
Well, validation accuracy in you model training logs is mean accuracy for validation set. Validation technique is based on statistics - you take random N% out of your set for validation, so average is always better if we're talking about multiple experimets (or cross validation).
I'm not sure if I've understood your question correct here, but if you want to evaluate your model with the metric, that you've specified for it after the training process (fit() function call) you should use model.evaluate(val_x, val_y). Or you may use model.predict(val_x) and compare its results to val_y via your metric function.
If you are using default weights for keras pretrained models (imagenet weights) and you want to use your own fully-connected part with it, you may use ONLY pretrained feature extractor (conv blocks). So you specify include_top=False. Of course there will be some positive effect (I'd say it will be significant in comparison with randomly initialized weights) because conv blocks have params that were trained to extract correct features from image. Also would recommend here to use so called "fine-tuning" technique - freeze all layers in pretrained part except a few in its end (may be few layers or even 2-3 conv blocks). Here's the example of fine-tuning of EfficientNetB0:
effnet = EfficientNetB0(weights="imagenet", include_top=False, input_shape=(540, 960, 3))
effnet.trainable = True
for layer in effnet.layers:
if 'block7a' not in layer.name and 'top' not in layer.name:
layer.trainable = False
Here I freeze all pretrained weights except last conv block ones. I've looked into the model with effnet.summary() and selected names of blocks that I want to unfreeze.

is it incorrect to change a model's parameters after training it?

i was trying to use average ensembling on a group of models i trained earlier (i'm creating a new model in the ensemble for each pre-trained model i'm using and then loading the trained weights onto it, it's inefficient this way i know but i'm just learning about it so it doesn't really matter). and I mistakenly changed some of the network's parameters when loading the models in the ensemble code like using Relu instead of leakyRelu which i used in training the models and a different value for an l2 regularizer in the dense layer in one of the models. this however gave me a better testing accuracy for the ensemble. can you please explain to me if/how this is incorrect, and if it's normal can i use this method to further enhance the accuracy of the ensemble.

I believe it is NOT correct to chnage model's parameters after training it. parameters here I mean the trainable-parameters like the weights in Dense node but not hyper-parameters like learning rate.
What is training?
Training essentially is a loop that keeps changing, or update, the parameters. It updates the parameter in such a way that it believes it can reduce the loss. It is also like moving your point in a hyper-spaces that the loss function gives a small loss on that point.
Smaller loss means higher accruacy in general.
Changing Weights
So now, changing your parameters values, by mistake or by purpose, is like moving that point to somewhere BUT you have no logical reason behind that such move will give you a smaller loss. You are just randomly wandering around that hyper-space and in your case you are just lucky that you land to some point that so happened give you a smaller loss or a better testing accuracy. It is just purely luck.
Changing activation function
Also, altering the activation function from leakyRelu to relu is similar you randomly alter the shape of your hype-space. Even though you are at the some point the landscape changes, you are still have no logical reason to believe by such change of landscape you can have a smaller loss staying at the same point
When you change the model manually, you need to retrain.

Though you changed the network's parameters when loading the models. It is not incorrect to alter the hyper-parameters of your ensemble's underlying models. In some cases, the models that are used in an ensemble method require unique tunings which can, as you mentioned, give "you a better testing accuracy for the ensemble model."
To answer your second question, you can use this method to further enhance the accuracy of the ensemble, you can also use Bayesian optimization, GridSearch, and RandomSearch if you prefer more automated means of tuning your hyperparameters.

Keras: Over fitting Conv2D

I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.

1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.

The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.

I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)

How to give specific outputs higher priority in accuracy / reducing loss, when training Neural Network

So I am dealing with a simple neural network with 10 inputs and one output. I can have as many hidden layers as suggested, however I am using 2. I am also using "mean_squared_error" loss function and RMSProp optimizer.
Anyhow, the question I have is, lets suppose my output values are like this:
[0,0,3,0,0,0,5,0,0,2,0...] etc. Note, that value 0 repeats more often. So What I would love to do, is to try to force Neural Network to learn better in case "non zero values on the output side". To give more of an "importance" to those values.
Because if I use 'mean_squared_error', the training will try to optimize according to entire dataset, this will lead mostly to optimization of cases, where 0 is an output value.
EDIT:
The problem I am dealing with, could be simple modeling of physical system. Let us say, we have a black-box system with known inputs. This black-box has a single outputs (let us say temperature). Based on our inputs and corresponding outputs, we could model the system using Neural Network as a "black-box" and then use the trained NN to predict temperature.
EDIT:
So I am now using different training/validation set. I was suspecting that there is something wrong with the previous one.
Now I got something like the image above (please see the immediate spike)
What could cause that?
Keep in mind, I am not experienced in NNs, so literally any feedback are welcomed :)

there are two important concepts in ML.
"underfitting" and "overfitting", which in your case I think it's underfitting.
to overcome this problem there are some ways:
make your model more complex by adding more layers and units
if you are using regularization terms, decrease their values
use more features (if there is any)
hope this help you.

If your outputs are integers [0,0,3,0,0,0,5,0,0,2,0...], i.e., classes, you will probably do a classification. So, your loss should be categorical_crossentopy. In this case, there are two ways of doing what you want:
1- You can use SMOTE, Synthetic Minority Oversampling technique so that the non-zero classes get the same weight as the zero-class. For binary classes:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
sm = SMOTEENN()
x, y = sm.fit_sample(X, Y)
2- You can also adjust Keras class weights:
class_weight = {0: 1.,1: 30.}
model.fit(X, Y, nb_epoch=1000, batch_size=16, class_weight=class_weight)

Default Initialization for Tensorflow LSTM states and weights?

I am using the LSTM cell in Tensorflow.
lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units)
I was wondering how the weights and states are initialized or rather what the default initializer is for LSTM cells (states and weights) in Tensorflow?
And is there an easy way to manually set an Initializer?
Note: For tf.get_variable() the glorot_uniform_initializer is used as far as I could find out from the documentation.

First of all, there is a difference between the weights of a LSTM (the usual parameter set of a ANN), which are by default also initialized by the Glorot or also known as the Xavier initializer (as mentioned in the question).
A different aspect is the cell state and the state of the initial recurrent input to the LSTM. Those are initialized by a matrix usually denoted as initial_state.
Leaving us with the question, how to initialize this initial_state:
Zero State Initialization is good practice if the impact of initialization is low
The default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small.
Zero State Initialization in each batch can lead to overfitting
Zero Initialization for each batch will lead to the following: Losses at the early steps of a sequence-to-sequence model (i.e., those immediately after a state reset) will be larger than those at later steps, because there is less history. Thus, their contribution to the gradient during learning will be relatively higher. But if all state resets are associated with a zero-state, the model can (and will) learn how to compensate for precisely this. As the ratio of state resets to total observations increases, the model parameters will become increasingly tuned to this zero state, which may affect performance on later time steps.
Do we have other options?
One simple solution is to make the initial state noisy (to decrease the loss for the first time step). Look here for details and other ideas

I don't think you can initialize an individual cell, but when you execute the LSTM with tf.nn.static_rnn or tf.nn.dynamic_rnn, you can set the initial_state argument to a tensor containing the LSTM's initial values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.