Tensorflow ReLu doesn't work?

Tensorflow ReLu doesn't work? - python

I have written a convolutional network in tensorflow with relu as an activation function, however it is not learning (loss is constant for both eval and train data set).
For different activation functions everything works as it should.
Here is code where the nn is created:
def _create_nn(self):
current = tf.layers.conv2d(self.input, 20, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
self.descriptor = current = tf.layers.conv2d(current, 32, 5, activation=self.activation)
if not self.drop_conv:
current = tf.layers.conv2d(current, self.layer_7_filters_n, 3, activation=self.activation)
if self.add_conv:
current = tf.layers.conv2d(current, 48, 2, activation=self.activation)
self.descriptor = current
last_conv_output_shape = current.get_shape().as_list()
self.descr_size = last_conv_output_shape[1] * last_conv_output_shape[2] * last_conv_output_shape[3]
current = tf.layers.dense(tf.reshape(current, [-1, self.descr_size]), 100, activation=self.activation)
current = tf.layers.dense(current, 50, activation=self.last_activation)
return current
self.activiation is set to tf.nn.relu and self.last_activiation is set to tf.nn.softmax
loss function and optimizer are created here:
self._nn = self._create_nn()
self._loss_function = tf.reduce_sum(tf.squared_difference(self._nn, self.Y), 1)
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(self._loss_function)
I tried changing variables initialization by passing tf.random_normal_initializer(0.1, 0.1) as initializers however it did not result in any change in loss function.
I would be grateful for help in making this neural network work with ReLu.
Edit: Leaky ReLu has the same problem
Edit: Small example where I managed to duplicate same error:
x = tf.constant([[3., 211., 123., 78.]])
v = tf.Variable([0.5, 0.5, 0.5, 0.5])
h_d = tf.layers.Dense(4, activation=tf.nn.leaky_relu)
h = h_d(x)
y_d = tf.layers.Dense(4, activation=tf.nn.softmax)
y = y_d(h)
d = tf.constant([[.5, .5, 0, 0]])
Gradients (as calculated with tf.gradients) for h_d and y_d kernels and biases are either equal or close to 0

In a very improbable case, all activations in some layer can be negative for all samples. They are set to zero by the ReLU and there is no learning progress because the gradient is zero in the negative part of the ReLU.
Things that make this more probable are a small dataset, weird scaling of input features, inappropriate weight initialization, and/or few channels in intermediate layers.
Here you use random_normal_initializer with mean=0.1, so maybe your inputs are all negative, and thus get mapped to negative values. Try mean=0, or rescale input features.
You can also try a Leaky ReLU. Also maybe the learning rate is too small or too large.

Looks like the problem was with the scale of input data. With values being between 0 and 255 that scale was more or less kept in the next layers, giving pre-activation outputs of the last layer having large enough differences to decrease softmax gradient to (almost) 0.
It was observable only with relu-like activation functions because other, like sigmoid or softsign, kept values ranges in network smaller, with an order of magnitude of 1 inststead of tens or hundreds.
The solution here was to just multiply input to rescale it to 0-1, in case of bytes by 1/255.

Related

Keras functional API: My model is only optimizing for one loss function instead of 2 loss functions while training

I have 2 loss functions in my model - Cross Entropy and Mean Squared.
I want my model to minimize both the losses but the model is only minimizing mean squared error during training.
def buildGenerator(dmodel, batch=100):
inputs = Input(shape=(256,256,1))
x = Conv2D(
filters = 32,
kernel_size = 3,
padding = 'same',
strides = 1
)(inputs)
x = BatchNormalization(momentum = 0.9)(x)
x = LeakyReLU(alpha=0.2)(x)
.........................
...........................
outputs1 = Conv2D(
filters = 2,
kernel_size = 3,
padding = 'same',
strides = 1
)(x)
outputs2 = dmodel(outputs1)
model = Model(inputs = inputs, outputs = [ outputs2, outputs1], name = 'functional_model')
model.compile(
loss = ['binary_crossentropy','mse' ],
optimizer = 'Adam',
loss_weights = [1.0, 0.6],
metrics=['accuracy', 'mse']
)
return model
In this code, dmodel is another model. I am using dmodel to classify outputs1 generated by the model and then finding cross-entropy between input labels and the output labels.
This is how I am training
dmodel = buildDiscriminator()
dmodel.load_weights('./GAN/discriminator')
dmodel.trainable = False
x, y1 = getGeneratorData()
y2 = np.ones((batch, 1))
model = buildGenerator(dmodel)
model.fit(x,[y2, y1],epochs=1)
I tried a lot of things like changing loss_weights, changing loss functions but nothing is working. My model is only minimizing the MSE function.
I don't understand what I am doing wrong.
I think using the discriminator model inside the generator is the issue but I am not sure.

I do not know whether there is a simple syntax to combine different loss functions, but you can try to define an own loss class. In another thread I found this code snippet that defines an own loss class that combines two other loss functions:
rho = 0.05
class loss_with_KLD(losses.Loss):
def __init__(self, rho):
super(loss_with_KLD, self).__init__()
self.rho = rho
self.kl = losses.KLDivergence()
self.mse = losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.SUM)
def call(self, y_true, y_pred):
mse = self.mse(y_true, y_pred)
kl = self.kl(self.rho, y_pred)
return mse + kl
If you just replace the KLDivergence by the binary cross entropy then this should work. Additionally, you would need to alter the call() function, since this implementation applies two loss function on the same predicted y value, but you actually predict two different y values. In this case, your y_true and y_pred would both contain two values and you would need to apply each loss function on only one of them. I do not know if it is easily possible to take a single element from a vector (in the style of y_true[0]), but if it is not, you could work around this by applying a "mask" to you vectors by multiplying them with [0, 1] or [1, 0], depending on the value you need. With this done you can use the reduce_sum() function get you single value and apply the loss function on your new y_true and y_pred.
This is a little bit more complicated, but it should get the job done.

When you specify 2 loss functions they apply to your 2 different outputs.
i.e. in your example binary_crossentropy applies to output2 which has a y_true value of all ones. And is the output of a non-trainable model.
It seems likely that you want to return a single value from model since you do not seem to have labels for output2. While you could define your own custom loss function that combines both losses on the same value, I would advise against it. If the output value is a single class prediction (i.e. pixel on/off) then binary_crossentrophy makes sense; If it is supposed to be a discrete value then mse makes sense.

Keras Recurrent Neural Networks For Multivariate Time Series

I have been reading about Keras RNN models (LSTMs and GRUs), and authors seem to largely focus on language data or univariate time series that use training instances composed of previous time steps. The data I have is a bit different.
I have 20 variables measured every year for 10 years for 100,000 persons as input data, and the 20 variables measured for year 11 as output data. What I would like to do is predict the value of one of the variables (not the other 19) for the 11th year.
I have my data structured as X.shape = [persons, years, variables] = [100000, 10, 20] and Y.shape = [persons, variable] = [100000, 1]. Below is my Python code for a LSTM model.
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(128, activation = 'tanh',
input_shape = (X.shape[1], X.shape[2])))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X, Y, epochs = 25, batch_size = 128)
I have four (related) questions, please:
Have I coded the Keras model correctly for the data structure I have? The performance I get from a fully-connected network (using flattened data) and from LSTM, GRU, and 1D CNN models are nearly identical, and I don't know if I have made an error in Keras or if a recurrent model is simply not helpful in this case.
Should I have Y as a series with shape Y.shape = [persons, years] = [100000, 11], rather than including the variable in X, which would then have shape X.shape = [persons, years, variables] = [100000, 10, 19]? If so, how can I get the RNN to output the predicted sequence? When I use return_sequences = True, Keras returns an error.
Is this the best way to predict with the data I have? Are there better option choices available in the Keras RNN models, or even other models?
How could I simulate data resembling the data structure I have so that a RNN model would outperform a fully-connected network?
UPDATE:
I have tried a simulation, with what I hope is a very simple case where an RNN should be expected to outperform a FNN.
While the LSTM tends to outperform the FNN when both have less hidden layers (4), the performance becomes identical with more hidden layers (8+). Can anyone think of a better simulation where a RNN would be expected to outperform a FNN with a similar data structure?
from keras import models
from keras import layers
from keras.layers import Dense, LSTM
import numpy as np
import matplotlib.pyplot as plt
The code below simulates data for 10,000 instances, 10 time steps, and 2 variables. If the second variable has a 0 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 3. If the second variable has a 1 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 9.
My hope was that the RNN would keep the value of second variable at the very first time step in memory and use that to know which value (3 or 9) to multiply the the first variable for the very last time step.
## Simulate data.
instances = 10000
sequences = 10
X = np.zeros((instances, sequences * 2))
X[:int(instances / 2), 1] = 1
for i in range(instances):
for j in range(0, sequences * 2, 2):
X[i, j] = np.random.random()
Y = np.zeros((instances, 1))
for i in range(len(Y)):
if X[i, 1] == 0:
Y[i] = X[i, -2] * 3
if X[i, 1] == 1:
Y[i] = X[i, -2] * 9
Below is code for a FNN:
## Densely connected model.
# Define model.
network_dense = models.Sequential()
network_dense.add(layers.Dense(4, activation = 'relu',
input_shape = (X.shape[1],)))
network_dense.add(Dense(1, activation = None))
# Compile model.
network_dense.compile(optimizer = 'rmsprop', loss = 'mean_absolute_error')
# Fit model.
history_dense = network_dense.fit(X, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_dense.predict(X[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_dense.predict(X[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Below is code for a LSTM:
## Structure X data for LSTM.
X_lstm = X.reshape(X.shape[0], X.shape[1] // 2, 2)
X_lstm.shape
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(4, activation = 'relu',
input_shape = (X_lstm.shape[1], 2)))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X_lstm, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_lstm.predict(X_lstm[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_lstm.predict(X_lstm[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Yes the code used is correct for what you are trying to do. 10 years is the time window used to predict the following year so that should be the number of inputs into your model for each of the 20 variables. The sample size of 100,000 observations is not relevant to the input shape of your model.
The way that you had originally shaped the dependent variable Y is correct. You are predicting a window of 1 year for 1 variable and you have 100,000 observations. The key word argument return_sequences=True will cause an error to be thrown because you only have a single LSTM layer. Set this parameter to True if you are implementing multiple LSTM layers and the layer in question is followed by another LSTM layer.
I wish I could offer some guidance to 3 but without actually having your dataset I don't know if it's possible to answer this with any sort of certainty.
I will say that LSTM's were designed to address what is know as the the long term dependency problem present in regular RNN's. What this problem boils down to is that as the gap between when the relevant information was observed to the point where that information would be useful grows, the standard RNN will have a harder time learning the relationship between them. Think of predicting a stock price based on 3 days of activity vs an entire year.
This leads into number 4. If I use the term 'resembling' loosely and stretch your time window further out to say 50 years as opposed to 10, the advantages gained from using an LSTM would become more apparent. Although I'm sure that someone more experienced will be able to offer a better answer and I look forward to seeing it.
I found this page helpful for understanding LSTM's:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

How can I make TensorFlow RNN training more robust?

I am training an RNN on a time series. I subclassed RNNCell and I use it in dynamic_rnn. The topology of the RNNCell is as follows:
input (shape [15, 100, 3])
1x3 convolution (5 kernels), ReLu (shape [15, 98, 5])
1x(remaining) convolution (20 kernels), ReLu (shape [15, 1, 20])
concatenate previous output (shape [15, 1, 21])
squeeze and 1x1 convolution (1 kernel), ReLu (shape [15, 1])
squeeze and softmax (shape [15])
The batch size for dynamic_rnn is around 100 (not the same 100 of the description above, that's the number of time periods in a window of data). Epochs are made of about 200 batches.
I would like to experiment with hyperparameters and regularization, but too often what I try stops the learning entirely and I don't understand why. These are some of the weird things that happen:
Adagrad works, but if I use Adam or Nadam the gradients are all zero.
I am forced to set a huge learning rate (~1.0) to see learning from epoch to epoch.
If I try to add dropout after any of the convolutions, even if I set keep_prob to 1.0 it stops learning.
If I tweak the number of kernels in the convolutions, for some choices that would seem just as good (e.g. 5, 25, 1 vs 5, 20, 1) again the network stops learning entirely.
Why is this model so fragile? Is it the topology of the RNNCell?
EDIT:
This is the code of the RNNCell:
class RNNCell(tf.nn.rnn_cell.RNNCell):
def __init__(self):
super(RNNCell, self).__init__()
self._output_size = 15
self._state_size = 15
def __call__(self, X, prev_state):
network = X
# ------ 2 convolutional layers ------
network = tflearn.layers.conv_2d(network, 5, [1, 3], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
width = network.get_shape()[2]
network = tflearn.layers.conv_2d(network, 20, [1, width], [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
# ------ concatenate the previous state ------
_, height, width, features = network.get_shape()
network = tf.reshape(network, [-1, int(height), 1, int(width * features)])
network = tf.concat([network, prev_state[..., None, None]], axis=3)
# ------ last convolution and softmax ------
network = tflearn.layers.conv_2d(network, 1, [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
network = network[:, :, 0, 0]
predictions = tflearn.layers.core.activation(network, activation="softmax")
return predictions, predictions
#property
def output_size(self):
return self._output_size
#property
def state_size(self):
return self._state_size

Most probably you are facing vanished gradients problem.
Potentially the instability can be caused by using ReLU in a combination with a pretty small number of parameters to tune. As far as I understand from the description there are only 1x3x5 = 15 trainable parameters in a first layer for instance. If to suppose, that the initialization is around zero, than gradients of in average 50% of parameters will always stay zero. Generally speaking ReLU on small networks in an evil, especially in a case of RNNs.
Try to use Leaky ReLU (but you can face exploding gradients though)
Try to use tanh, but check initial values of parameters, that they are really around zero, otherwise your gradients will vanish very quickly as well.
Retrieve results of untrained, but just initialized network at a step 0. With a right initialization and NN construction you should get normally distributed values around .5 If you have strictly ones, zeros or mix of them, your NN architecture is wrong. All values strictly .5 is also bad.
Consider more robust approach such as LSTM

Can't seem to implement L2 regularization correctly in Python — low accuracy scores

I'm trying to add regularization to my Mnist digits NN classifier, which I've created using numpy and vanilla Python. I'm currently using Sigmoid activations with Cross Entropy cost function.
Without using the regularizer, I get 97% accuracy.
However, once I add the regularizer, I"m only getting about 11% despite, playing around with different hyper parameters. I've tried different learning rates:
.001, .1, 1
and different lambd values such as:
.5, .8, 1.0, 2.0 etc.
I can't seem to figure out what mistake I'm making. I feel like I'm missing a step maybe?
The only changes I've made are to the derivatives of the weights. I've implemented the gradients as follows:
def calculate_gradients(self,x, y, lambd):
'''calculate all gradients with respect to
cost. Here our cost function is cross_entropy
last_layer_z_error = dC/dZ (z is logit)
All weight gradients also include regularization gradients
x.shape[0] = len of sample size
'''
##### First we calculate the output layer gradients #########
gradients, activations, zs = self.gather_backprop_data(x,y)
#gradient of cost with respect to Z of last layer
last_layer_z_error = ((activations[-1] - y))
#updating the weight_derivatives of final layer
gradients['w'+ str(self.num_layers -1)] = np.dot(activations[-2].T,last_layer_z_error)/x.shape[0] + (lambd/x.shape[0])*(self.parameters['w'+ str(self.num_layers -1)])
gradients['b'+ str(self.num_layers -1)] = np.mean(last_layer_z_error, axis =0)
gradients['b'+ str(self.num_layers -1)] = np.expand_dims(gradients['b'+ str(self.num_layers -1)],0)
###HIDDEN LAYER GRADIENTS###
z_previous_layer = last_layer_z_error
for i in reversed(range(1,self.num_layers -1)):
z_previous_layer =np.dot(z_previous_layer,self.parameters['w'+ str(i+1)].T, )*\
(sigmoid_derivative(zs[i-1]))
gradients['w'+str(i)] = np.dot((activations[i-1].T),z_previous_layer)/x.shape[0] + (lambd/x.shape[0])*(self.parameters['w'+str(i)])
gradients['b'+str(i)] = np.mean(z_previous_layer, axis =0)
gradients['b'+str(i)] = np.expand_dims(gradients['b'+str(i)],0)
return gradients
The entire code can be found here:
I've uploaded the entire notebook to Github if needed:
https://github.com/moondra2017/Neural-Networks-from-scratch/blob/master/Neural%20Network%20from%20scratch-Testing%20expanded%20Mnist-Sigmoid%20with%20cross-entroupy-with%20L2%20regularization.ipynb

Cost function always returning zero for a binary classification in tensorflow

I have written the following binary classification program in tensorflow that is buggy. The cost is returning to be zero all the time no matter what the input is. I am trying to debug a larger program which is not learning anything from the data. I have narrowed down at least one bug to the cost function always returning zero. The given program is using some random inputs and is having the same problem. self.X_train and self.y_train is originally supposed to read from files and the function self.predict() has more layers forming a feedforward neural network.
import numpy as np
import tensorflow as tf
class annClassifier():
def __init__(self):
with tf.variable_scope("Input"):
self.X = tf.placeholder(tf.float32, shape=(100, 11))
with tf.variable_scope("Output"):
self.y = tf.placeholder(tf.float32, shape=(100, 1))
self.X_train = np.random.rand(100, 11)
self.y_train = np.random.randint(0,2, size=(100, 1))
def predict(self):
with tf.variable_scope('OutputLayer'):
weights = tf.get_variable(name='weights',
shape=[11, 1],
initializer=tf.contrib.layers.xavier_initializer())
bases = tf.get_variable(name='bases',
shape=[1],
initializer=tf.zeros_initializer())
final_output = tf.matmul(self.X, weights) + bases
return final_output
def train(self):
prediction = self.predict()
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=self.y))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(cost, feed_dict={self.X:self.X_train, self.y:self.y_train}))
with tf.Graph().as_default():
classifier = annClassifier()
classifier.train()
If someone could please figure out what I am doing wrong in this, I can try making the same change in my original program. Thanks a lot!

The only problem is invalid cost used. softmax_cross_entropy_with_logits should be used if you have more than two classes, as softmax of a single output always returns 1, as it is defined as :
softmax(x)_i = exp(x_i) / SUM_j exp(x_j)
so for a single number (one dimensional output)
softmax(x) = exp(x) / exp(x) = 1
Furthermore, for softmax output TF expects one-hot encoded labels, so if you provide only 0 or 1, there are two possibilities:
True label is 0, so the cost is -0*log(1) = 0
True label is 1, so the cost is -1*log(1) = 0
Tensorflow has a separate function to handle binary classification which applies sigmoid instead (note, that the same function for more than one output would apply sigmoid independently on each dimension which is what multi-label classification would expect):
tf.sigmoid_cross_entropy_with_logits
just switch to this cost and you are good to go, you do not have to encode anything as one-hot anymore either, as this function is designed solely to be used for your use-case.
The only missing bit is that .... your code does not have actual training routine you need to define optimiser, ask it to minimise a loss and then run a train op in the loop. In your current setting you just try to predict over and over, with the network which never changes.
In particular, please refer to Cross Entropy Jungle question on SO which provides more detailed description of all these different helper functions in TF (and other libraries), which have different requirements/use cases.

The softmax_cross_entropy_with_logits is basically a stable implementation of the 2 parts :
softmax = tf.nn.softmax(prediction)
cost = -tf.reduce_mean(labels * tf.log(softmax), 1)
Now in your example, prediction is a single value, so when you apply softmax on it, its going to be always 1 irrespective of the value (exp(prediction)/exp(prediction) = 1), and so the tf.log(softmax) term becomes 0. Thats why you always get your cost zero.
Either apply sigmoid to get your probabilities between 0 or 1 or if you use want to use softmax get the labels as [1, 0] for class 0 and [0, 1] for class 1.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.