Tensorflow neural network loss value NaN

Tensorflow neural network loss value NaN - python

I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.
The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).
For some information about my model:
3 layer model. all fully connected layers.
batch size of 1000
learning rate of .001 (i also tried .1 and .01 but nothing changed)
using CrossEntropyLoss (i did add an epsilon value to prevent log0)
using AdamOptimizer
learning rate decay is .95
The exact code for the model is below: (I'm using the TF-Slim library)
input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}
Any help would be greatly appreciated! Thank you so much!

Two (possibly more) reasons why it doesn't work:
You skipped or inappropriately applied feature scaling of your
inputs and outputs. Consequently, data may be difficult to handle
for Tensorflow.
Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.

For some reasons, your training process has diverged, and you may have infinite values in your weights, wich gives NaN losses. The reasons can be many, try changing your training parameters (use smaller batchs for test).
Also, using a relu for the last output in a classifier is not the usual method, try using a sigmoid.

From my understanding Relu doesn't put a cap on the upper bound for Neural Networks so its more likely to deconverge depending upon its implementation.
Try switching all the activation functions to tanh or sigmoid. Relu is generally used for convolution in cnns.
Its also difficult to determine if your deconverging due to cross entropy as we don't know how you effected it with your epsilon value. Try just using the residual its much simpler but still effective.
Also a 5000-5000-4500 neural network is huge. Its unlikely you actually need a network that large.

Related

Keras - Hyper Tuning the initial state of the model

I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?

... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?

I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.

maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)

Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.

Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

How to handle LSTMs with many features in python?

I have a binary classification problem. I use the following keras model to do my classification.
input1 = Input(shape=(25,6))
x1 = LSTM(200)(input1)
input2 = Input(shape=(24,6))
x2 = LSTM(200)(input2)
input3 = Input(shape=(21,6))
x3 = LSTM(200)(input3)
input4 = Input(shape=(20,6))
x4 = LSTM(200)(input4)
x = concatenate([x1,x2,x3,x4])
x = Dropout(0.2)(x)
x = Dense(200)(x)
x = Dropout(0.2)(x)
output = Dense(1, activation='sigmoid')(x)
However, the results I get is extremely bad. I thought the reason is that I have too many features, thus, needs have more improved layers after the concatenate.
I was also thinking if it would be helpful to used a flatten() layer after the concatenate.
anyway, since I am new to deep learning, I am not so sure how to make this a better model.
I am happy to provide more details if needed.

Here is what I can suggest
Remove every things that prevent overfitting, such as Dropout and regularizer. What can happen is that your model may not be able to capture the complexity of your data using given layer, so you need to make sure that your model is able to overfit first before adding regularizer.
Now try increase number of Dense layer and number of neuron in each layer until you can see some improvement. There is also a possibility that your data is too noisy or you have only few data to train the model so you can't even produce a useful predictions.
Now if you are LUCKY and you can see overfitting, you can add Dropout and regularizer.
Because every neural network is a gradient base algorithm, you may end up at local minimum. You may also need to run the algorithm multiple times with different initial weight before you can get a good result or You can change your loss function so that you have a convex problem where local minimum is global minimum.
If you can't achieve better result
You may need to try different topology because LSTM is just trying to model a system that assume to have Markov property. you can look at nested-LSTM or something like that, which model the system in the way that next time step is not just depend on current time step.

The Dropout right before the output layer could be problematic. I would suggest removing both Dropout layers and evaluating performance, then re-introduce regularization once the model is performing well on the the training set.

Python Keras LSTM learning converges too fast on high loss

This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.
I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).
The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.
My questions are basically :
How to initialize weights and bias given this problem?
Which optimizer to use?
How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
Should I add more training data?
Input and output are normalized with minmax.
I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)
I have printed the weights after few epochs and noticed that many weights
are zero and the rest are basically have the value of 1 (or very close to it).
Here are some plots from tensorboard :

Faster convergence with a very high loss could possibly mean you are facing an exploding gradients problem. Try to use a much lower learning rate like 1e-5 or 1e-6. You can also try techniques like gradient clipping to limit your gradients in case of high learning rates.
Answer 1
Another reason could be initialization of weights, try the below 3 methods:
Method described in this paper https://arxiv.org/abs/1502.01852
Xavier initialization
Random initialization
For many cases 1st initialization method works the best.
Answer 2
You can try different optimizers like
Momentum optimizer
SGD or Gradient descent
Adam optimizer
The choice of your optimizer should be based on the choice of your loss function. For example: for a logistic regression problem with MSE as a loss function, gradient based optimizers will not converge.
Answer 3
How deep or wide your network should be is again fully dependent on which type of network you are using and what the problem is.
As you said you are using a sequential model using LSTM, to learn sequence on text. No doubt your choice of model is good for this problem you can also try 4-5 LSTMs.
Answer 4
If your gradients are going either 0 or infinite, it is called vanishing gradients or it simply means early convergence, try gradient clipping with proper learning rate and the first weight initialization technique.
I am sure this will definitely solve your problem.

Consider reducing your batch_size.
With large batch_size, it could be that your gradient at some point couldn't find any more variation in your data's stochasticity and for that reason it convergences earlier.

Replicating results with different artificial neural network frameworks (ffnet, tensorflow)

I'm trying to model a technical process (a number of nonlinear equations) with artificial neural networks. The function has a number of inputs and a number of outputs (e.g. 50 inputs, 150 outputs - all floats).
I have tried the python library ffnet (wrapper for a fortran library) with great success. The errors for a certain dataset are well below 0.2%.
It is using a fully connected graph and these additional parameters.
Basic assumptions and limitations:
Network has feed-forward architecture.
Input units have identity activation function, all other units have sigmoid activation function.
Provided data are automatically normalized, both input and output, with a linear mapping to the range (0.15, 0.85). Each input and output is treated separately (i.e. linear map is unique for each input and output).
Function minimized during training is a sum of squared errors of each output for each training pattern.
I am using one input layer, one hidden layer (size: 2/3 of input vector + size of output vector) and an output layer. I'm using the scipy conjugate gradient optimizer.
The downside of ffnet is the long training time and the lack of functionality to use GPUs. Therefore i want to switch to a different framework and have chosen keras with TensorFlow as the backend.
I have tried to model the previous configuration:
model = Sequential()
model.add(Dense(n_hidden, input_dim=n_in))
model.add(BatchNormalization())
model.add(Dense(n_hidden))
model.add(Activation('sigmoid'))
model.add(Dense(n_out))
model.add(Activation('sigmoid'))
model.summary()
model.compile(loss='mean_squared_error',
optimizer='Adamax',
metrics=['accuracy'])
However the results are far worse, the error is up to 0.5% with a few thousand (!) epochs of training. The ffnet training was automatically canceled at 292 epochs. Furthermore the differences between the network response and the validation target are not centered around 0, but mostly negative.
I have tried all optimizers and different loss functions. I have also skipped the BatchNormalization and normalized the data manually in the same way that ffnet does it. Nothing helps.
Does anyone have a suggestion to obtain better results with keras?

I understand you are trying to re-train the same architecture from scratch, with a different library. The first fundamental issue to keep in mind here is that neural nets are not necessarily reproducible, when weights are initialized randomly.
For example, here is the default constructor parameter for Dense in Keras:
init='glorot_uniform'
But even before trying to evaluate the convergence of Keras optimizations, I would recommend trying to port the weights for which you got good results, from ffnet, into your Keras model. You can do so either with the kwarg Dense(..., weights=) of each layer, or globally at the end model.set_weights(...)
Using the same weights must yield the exact same result between the two libs. Unless you run into some floating point rounding issues. I believe that as long as porting the weights is not consistent, working on the optimization is unlikely to help.

What activation function to use or modifications to make when neural network gives same output on regression with PyBrain?

I have a neural network with one input, three hidden neurons and one output. I have 720 input and corresponding target values, 540 for training, 180 for testing.
When I train my network using Logistic Sigmoid or Tan Sigmoid function, I get the same outputs while testing, i.e. I get same number for all 180 output values. When I use Linear activation function, I get NaN, because apparently, the value gets too high.
Is there any activation function to use in such a case? Or any improvements to be done? I can update the question with details and code if required.

Neural nets are not stable when fed input data on arbitrary scales (such as between approximately 0 and 1000 in your case). If your output units are tanh they can't even predict values outside the range -1 to 1 or 0 to 1 for logistic units!
You should try recentering/scaling the data (making it have mean zero and unit variance - this is called standard scaling in the datascience community). Since it is a lossless transformation you can revert back to your original scale once you've trained the net and predicted on the data.
Additionally, a linear output unit is probably the best as it makes no assumptions about the output space and I've found tanh units to do much better on recurrent neural networks in low dimensional input/hidden/output nets.

Newmu is right that the scaling is probably the issue here; you need to scale your inputs to lie in the valid range. (Standardization to zero mean, unit variance, as they suggest, though, isn't a great choice since that means about a third of your data will like outside [-1, 1]....) I don't know about pybrain, but in scikit-learn you'd want sklearn.preprocessing.MinMaxScaler.
But, also, in the comments you said your dataset looks like this:
where the horizontal axis is inputs, vertical is targets. So, when you see an input of 200, you have one training example saying it's 80 and one saying it's 320; what do you want it to say then? An "optimal" neural network (which may be hard to achieve) would predict 200 or so.
You may need to think about how to reframe your learning problem to be a more-consistent function from inputs to targets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow neural network loss value NaN - python

Related

Keras - Hyper Tuning the initial state of the model

How to handle LSTMs with many features in python?

Python Keras LSTM learning converges too fast on high loss

Replicating results with different artificial neural network frameworks (ffnet, tensorflow)

What activation function to use or modifications to make when neural network gives same output on regression with PyBrain?

Categories

Resources