I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?
... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?
I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.
maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)
Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.
Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.
I have 8160 dataset, input's shape is (1500,1), output's shape is (2500,1). input's values are between -80 to 3. Outpur values vary between 0 or 1.
I thought that, normalization is applied to the input by using global maximum and global minimum. Then i realize, normalization is used for every specific input seperately by using local maximum and local minimum value of that input. Then i try both of them, and local normalization is gives much much better result.
I can't understand this. for example, by using local normalization, '0' corresponds to -20 for input 1, '0' corresponds to -80 for input 2. I think, usefull and comparative information about input 1 and input 2 is lost. Can someone help me to understand this, How can local normalization give better result? or is something wrong with my model? Thank you.
Making it short: the optimisation of neural networks is usually performed using a gradient decent algorithm. Normalising or standardising your variables helps the algorithm to converge, as the size of the input features impacts the state size. A big variance in the features sizes
For more detail have a look here and here.
As the author of the second link puts it:
Using such unprocessed data as input features for a linear regression system might slow down the gradient descent algorithm to a crawl. It happens because [...] such not normalized data warps the cost function the gradient descent has to process, making the minimum point really difficult to reach.
I've coded a simple neural network for XOR in python. While there is loads of information online about how to program this, there isn't much on how to feed the data through it. I've tested the change in weights after one cycle for inputs [1,1] to compare my results with my lecture slides and it's 100% the same, so I believe the code works. I can train the network for that same input, but when I change the input (and corresponding target) every cycle the error doesn't go down.
Should I allow changing the weights and inputs after every cycle or should I run through all the possible inputs first, get an average error and then change the weights? (But changing weights are dependent on the output, so what output would I use then)
I can share my code, if needed, but I'm pretty certain it's correct.
Please give me some advice? Thank you in advance.
So, you're saying you implemented a neural network on your own ?
well in this case, basically each neuron on the input layer must be assigned with a feature of a certain row, than just iterate through each layer and each neuron in that layer and calculate as instructed.
I'm sure you are familiar with the back-propagation algorithm so you'll know when to stop.
once you're done with that row, do it again to the next row, assign each feature to each of the input neurons and start the iterations again.
once youre done with all records, thats an Epoch.
I hope that answers your question.
also, I would recommend you to try out Keras, its easy to use and a good tool to be experienced in.
So I am dealing with a simple neural network with 10 inputs and one output. I can have as many hidden layers as suggested, however I am using 2. I am also using "mean_squared_error" loss function and RMSProp optimizer.
Anyhow, the question I have is, lets suppose my output values are like this:
[0,0,3,0,0,0,5,0,0,2,0...] etc. Note, that value 0 repeats more often. So What I would love to do, is to try to force Neural Network to learn better in case "non zero values on the output side". To give more of an "importance" to those values.
Because if I use 'mean_squared_error', the training will try to optimize according to entire dataset, this will lead mostly to optimization of cases, where 0 is an output value.
EDIT:
The problem I am dealing with, could be simple modeling of physical system. Let us say, we have a black-box system with known inputs. This black-box has a single outputs (let us say temperature). Based on our inputs and corresponding outputs, we could model the system using Neural Network as a "black-box" and then use the trained NN to predict temperature.
EDIT:
So I am now using different training/validation set. I was suspecting that there is something wrong with the previous one.
Now I got something like the image above (please see the immediate spike)
What could cause that?
Keep in mind, I am not experienced in NNs, so literally any feedback are welcomed :)
there are two important concepts in ML.
"underfitting" and "overfitting", which in your case I think it's underfitting.
to overcome this problem there are some ways:
make your model more complex by adding more layers and units
if you are using regularization terms, decrease their values
use more features (if there is any)
hope this help you.
If your outputs are integers [0,0,3,0,0,0,5,0,0,2,0...], i.e., classes, you will probably do a classification. So, your loss should be categorical_crossentopy. In this case, there are two ways of doing what you want:
1- You can use SMOTE, Synthetic Minority Oversampling technique so that the non-zero classes get the same weight as the zero-class. For binary classes:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
sm = SMOTEENN()
x, y = sm.fit_sample(X, Y)
2- You can also adjust Keras class weights:
class_weight = {0: 1.,1: 30.}
model.fit(X, Y, nb_epoch=1000, batch_size=16, class_weight=class_weight)
I am trying to write a simple neural network that can come up with weights to for, say, the y=x function. Here's my code:
http://codepad.org/rPdZ7fOz
As you can see, the error level never really goes down much. I tried changing the momentum and learning rate but it did not help much. Is my number of input, hidden and output correct for what I want to do? If not, what should it be? If so, what else could be wrong?
You're attempting to train the network to give output values 1,2,3,4 as far as I understood. Yet, at the output you use a sigmoid (math.tanh(..)) whose values are always between -1 and 1.
So the output of your Neural network is always between -1 and 1 and thus you always get a large error when trying to fit output values outside that range.
(I just checked that when scaling your input and output values by 0.1, there seems to be a nice training progress and I get at the end:
error 0.00025
)
The Neural Network you're using is useful if you want to do classification (e.g. assign the data point to class A if the NN output is < 0 or B if it is > 0). It looks like what you want to do is regression (fit a real-valued function).
You can remove the sigmoid at the output node but you will have to slightly modify your backpropagation procedure to take this into account.