Keras binary_crossentropy cost function systematic error with large networks

Keras binary_crossentropy cost function systematic error with large networks - python

The title may be too specific and this may also apply to other cost functions as well. Basically I have the dummy setup as below (in my case I have a vastly different architecture, but the problem is always replicated):
hidden_units=10000
l2_sparsity = 5e-7
l1_sparsity = 1e-8
mod=Sequential([Dense(hidden_units, input_shape = (1000,), activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
),
Dense(hidden_units, activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
),
Dense(1000, activation="sigmoid",
)
])
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0)
mod.compile(loss='binary_crossentropy', optimizer=adam, metrics=['binary_accuracy'])
x = np.array([np.array([random.random() for i in range(1000)], dtype=np.float64) for i in range(10)])
classes = (x + 0.5).astype(np.uint32)
def replica_cross_entropy_loss(predictions, truth):
eps=10e-8
predictions = np.clip(predictions, eps, 1.0-eps)
x = np.log(1.0 *predictions / (1.0 - predictions) )
return np.mean(np.clip(x, 0, None) - x * truth + np.log(1 + np.exp(-abs(x))))
mod.fit(x[:2],classes[:2], batch_size=128, epochs=1)
preds = mod.predict(x[:2])
print replica_cross_entropy_loss(preds,classes[:2])
print mod.test_on_batch(x[:2],classes[:2])[0]
print 100*(mod.test_on_batch(x[:2],classes[:2])[0] /
replica_cross_entropy_loss(preds,classes[:2]) - 1)
What I am doing here is comparing the built-in keras loss to a (I believe) faithful reconstruction using numpy. Now, experimenting with the hidden units in the mlp architecture outlined yields some interesting results. Specifically:
The keras cost function is systematically higher than the numpy
equivalent, with the difference related to the number of neurons in
the hidden layer.
As an illustration, here is a simple graph that plots the number of neurons in the hidden layer of this mlp vs the pct overestimate of the keras cost function:
I should say that this seems to be backend agnostic, effecting both theano and tensorflow.
Judging from this behaviour, it seems likely to be a precision issue. Do you know if there is a way to mitigate this, as for my real model the cost function is consistently 5-10% different to the numpy answer

There was a regularisation term difference between the two cost functions, which is arguably not that useful in evaluating performance

Related

Question About Dropout Layer and Batch Normalization Layer in DNN model

I have some queries about the Dropout layer and Batch normalized layer. Basically, I have made a simple DNN structure with a Dropout layer and Batch normalized layer and train it that's fine.
The simple structure of DNN model for example:
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(10, activation='relu', input_shape=[11]),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(8, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(6, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1,activation='softmax'),
])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=100,
verbose=0,
)
But now I would like to use the train model's weights and bias of all layers in my custom prediction model(forget about the other way).
# Predictions for test
test_logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
test_relu_1 = tf.nn.relu(test_logits_1)
test_logits_2 = tf.matmul(test_relu_1, weights_2) + biases_2
test_relu_2 = tf.nn.relu(test_logits_2)
test_logits_3 = tf.matmul(test_relu_2, weights_3) + biases_3
test_relu_3 = tf.nn.relu(test_logits_3)
test_logits_4 = tf.matmul(test_logits_3 , weights_4) + biases_4
test_prediction = tf.nn.softmax(test_relu_4)
Now the question is here: have to need to add the dropout layer and batch normalized layer, batch size in the prediction model?? If yes then why to do that and how do I extract all the details of layers and use them in my custom prediction model?

#Dr. Snoopy thanks for pointing out that the BatchNormalization has parameters but to my knowledge they are not the normalization weights(weights being normalized) based on what I was able to deduce from the docs and little research.
The doc says the following(quoted text below) and based on the description it is clear that beta and gamma values are trainable variables which tallies with the output from tensorflow.
During training (i.e. when using fit() or when calling the layer/model with the argument training=True), the layer normalizes its output using the mean and standard deviation of the current batch of inputs. That is to say, for each channel being normalized, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta, where:
epsilon is small constant (configurable as part of the constructor arguments)
gamma is a learned scaling factor (initialized as 1), which can be disabled by passing scale=False to the constructor.
beta is a learned offset factor (initialized as 0), which can be disabled by passing center=False to the constructor.
But that is not the end of the story as the model summary indicates more parameters than the number of parameters beta and gamma comprise of.
A factor of 4 can be observed here i.e. the number of parameters in a BatchNormalization layer are 4 times the input shape the layer operates on.
These additional parameters are moving_mean and moving_variance values which can be seen in the following output
Coming back to the original question and concern of OP, "What parameters should i worry about?", the parameters that are needed for inference are moving_mean, moving_variance, beta, and gamma values.
The way to use these values/parameters is again easily deducible from the docs which I quote here again-
During inference (i.e. when using evaluate() or predict() or when calling the layer/model with the argument training=False (which is the default), the layer normalizes its output using a moving average of the mean and standard deviation of the batches it has seen during training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.
self.moving_mean and self.moving_var are non-trainable variables that are updated each time the layer in called in training mode, as such:
moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
moving_var = moving_var * momentum + var(batch) * (1 - momentum)
As such, the layer will only normalize its inputs during inference after having been trained on data that has similar statistics as the inference data.
So assuming the moving_mean, moving_variance, beta, and gamma values are available for every BatchNormalization layer, I think the following piece of code needs to be added after the first activation-
# epsilon is just to avoid ZeroDivisionError, so the default value should be okay
test_BN_1 = (test_relu_1 - moving_mean_1) / (moving_var_1 + epsilon_1) * gamma_1 + beta_1
EDIT:
Turns out that the documentation seems to be wrong but the implementation seems to be right based on what I could deduce from the source code on github.
If you follow the following links you'll see that the in call method of BatchNormalization class here https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L1227 the calculation is actually done by keras backend normalization function batch_normalization here https://github.com/keras-team/keras/blob/35146d00b44ca645fbf4ad0b007faa07632c6f9e/keras/backend.py#L2963. The backend function doc string seems to be in agreement with what is mentioned in the reference paper and the picture you've posted.
So that means, you should use the square root of the variance only.

Improve Accuracy for a Siamese Network

I wrote this little model using Keras Functional API to find similarity of a dialogue between two individuals. I am using Gensim's Doc2Vec embeddings for transforming text-data into vectors (vocab size: 4117). My data is equally divided up into 56 positive cases and 64 negative cases. (yes I know the dataset is small - but that's all I have for the time being).
def euclidean_distance(vects):
x, y = vects
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
return K.sqrt(K.maximum(sum_square, K.epsilon()))
ch_inp = Input(shape=(38, 200))
csr_inp = Input(shape=(38, 200))
inp = Input(shape=(38, 200))
net = Embedding(int(vocab_size), 16)(inp)
net = Conv2D(16, 1, activation='relu')(net)
net = TimeDistributed(LSTM(8, return_sequences=True))(net)
out = Activation('relu')(net)
sia = Model(inp, out)
x = sia(csr_inp)
y = sia(ch_inp)
sub = Subtract()([x, y])
mul = Multiply()([sub, sub])
mul_x = Multiply()([x, x])
mul_y = Multiply()([y, y])
sub_xy = Subtract()([x, y])
euc = Lambda(euclidean_distance)([x, y])
z = Concatenate(axis=-1)([euc, sub_xy, mul])
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Activation('relu')(z)
z = GlobalMaxPooling1D()(z)
z = Dense(2, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)
model = Model([ch_inp, csr_inp], out)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
The problem is: my accuracy won't improve from 60.87% - I ran 10 epochs and the accuracy remains constant. Is there something I've done here in my code that's causing that? Or perhaps its an issue with my data?
I also did K-Fold Validation for some Sklearn models and got these results from the dataset:
Additionally, an overview of my dataset is attached below:
I'm definitely struggling with this one - so literally any help here would be appreciated. Thanks!
UPDATE:
I increased my data-size to 1875 train-samples. Its accuracy improved to 70.28%. But its still constant over all iterations.

I see two things that may be important there.
You're using 'relu' after the LSTM. An LSTM in Keras already has 'tanh' as default activation. So, although you're not locking your model, you're making it harder for it to learn, with an activation that constraints the results between as small range plus one that cuts the negative values
You're using 'relu' with very few units! Relu with few units, bad initialization, big learning rates and bad luck will get stuck in the zero region without any gradients.
If your loss completely freezes, it's most probably due to the second point above. And even if it doesn't freeze, it may be using just one unit from the 2 Dense units, for instance, making the layer very poor.
You should do something from below:
Your model is small, so quit using 'relu' and use 'tanh' instead. This will give your model the expected power it should have.
Otherwise, you should definitely increase the number of units, both for the LSTM and for the Dense, so 'relu' doesn't get easily stuck.
You can add a BatchNormalization layer after Dense and before 'relu', this way you guarantee that a good amount units will always be above zero.
In any case, don't use 'relu' after the LSTM.
The other approach would be making the model more powerful.
For instance:
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Conv1D(10, 3, activation = 'tanh')(z) #or 'relu' maybe
z = MaxPooling1D(z)
z = Conv1D(15, 3, activation = 'tanh')(z) #or 'relu' maybe
z = Flatten()(z) #unless the length is variable, then GlobalAveragePooling1D()(z)
z = Dense(10, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)

Diverging loss in Keras with custom loss

I have a fully connected feed-forward network implemented with Keras. Initially, I used binary cross-entropy as the loss and the metric, and Adam optimizer as follows
adam = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['binary_crossentropy'])
This model trains well and gives good results. In order to get better results I want to use a different loss function and metric as below,
import keras.backend as K
def soft_bit_error_loss(yTrue, yPred):
loss = K.pow(1 - yPred, yTrue) * K.pow(yPred, 1-yTrue)
return K.mean(loss)
def ber(yTrue, yPred):
x_hat_train = K.cast(K.greater(yPred, 0.5), 'uint8')
train_errors = K.cast(K.not_equal(K.cast(yTrue, 'uint8'), x_hat_train), 'float32')
train_ber = K.mean(train_errors)
return train_ber
I use it to compile my model as below
model.compile(optimizer=adam, loss=soft_bit_error_loss, metrics=[ber])
However, when I do that, the loss and the metric diverge after some training, everytime as in the following pictures.
What can be the cause of this?

Your loss function is very unstable, look at it:
Where I replaced y_pred (variable) with x and y_true (constant) with c for simplicity.
As your predictions approach zero, at least one operation will tend to 1/0, which is infinite. Although by the limits theory you can know the result is ok, Keras doesn't know the "whole" function as one, it calculates derivatives based on the basic operations used.
So, one easy solution is the one pointed by #today:
loss = K.switch(yTrue == 1, 1 - yPred, yPred)
It's exactly the same function (difference only when c is not zero or 1).
Also, even easier, for c=0 or c=1, it's just a plain loss='mae'.

Understanding Regularization in Keras

I am trying to understand why regularization syntax in Keras looks the way that it does.
Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.
However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:
input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')
This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.
The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)?

Let's break down the components of your question:
Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.
When you use layer regularisation, the base Layer class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.
Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:
kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.
bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.
activity: is applied to the output vector, the y in y = f(Wx + b).

Implemention of Neural Networks

I'm trying to understand how to implement neural networks. So I made my own dataset. Xtrain is numpy.random floats. Ytrain is sign(sin(1/x^3).
Try to implement neural networks gave me very poor results. 30%accuracy. Random Forest with 100 trees give 97%. But I heard that NN can approximate any function. What is wrong in my understanding?
import numpy as np
import keras
import math
from sklearn.ensemble import RandomForestClassifier as RF
train = np.random.rand(100000)
test = np.random.rand(100000)
def g(x):
if math.sin(2*3.14*x) > 0:
if math.cos(2*3.14*x) > 0:
return 0
else:
return 1
else:
if math.cos(2*3.14*x) > 0:
return 2
else:
return 3
def f(x):
x = (1/x) ** 3
res = [0, 0, 0, 0]
res[g(x)] = 1
return res
ytrain = np.array([f(x) for x in train])
ytest = np.array([f(x) for x in test])
train = np.array([[x] for x in train])
test = np.array([[x] for x in test])
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
model = Sequential()
model.add(Dense(100, input_dim=1))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(4))
model.add(Activation('softmax'))
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
P.S. I tried out many layers, activation functions, loss functions, optimizers, but never got more than 30% accuracy :(

I suspect that the 30% accuracy is a combination of small learning rate setting and a small training-step setting.
I ran your code snippet with model.fit(train, ytrain, nb_epoch=5, batch_size=32), after 5 epoch's training it yields about 28% accuracy. With the same setting but increasing the training steps to nb_epoch=50, the loss drops to ~1.157 ish and the accuracy raises to 40%. Further increase training steps should lead the model to further converging. Other than that, you can also try to configure the model with a larger learning rate setting which could make the converging faster :
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.1, momentum=0.9, nesterov=True), metrics=['accuracy'])
Although be careful don't set the learning rate to be too large otherwise your loss could blow up.
EDIT:
NN is known for having the potential for modeling extremely complex function, however, whether or not the model actually produce a good performance is a matter of how the model is designed, trained, and many other matters related to the specific application.

Zhongyu Kuang's answer is correct in stating that you may need to train it longer or with a different learning rate.
I'll add that the deeper your network, the longer you'll need to train it before it converges. For a relatively simple function like sign(sin(1/x^3)), you may be able to get away with a smaller network than the one you're using.
Additionally, softmax probably isn't the best output layer. You just need to yield -1 or 1. A single tanh unit seems like it would do well. softmax is generally used when you want to learn a probability distribution over a finite set. (You'll probably want to switch your error function from cross entropy to mean square error for similar reasons.)
Try a network with one sigmoidal hidden layer and an output layer with just one tanh unit. Then play around with the layer size and learning rate. Maybe add a second hidden layer if you can't get results with just one, but I wouldn't be surprised if it's unnecessary.
Addendum: In this approach, you'll replace f(x) with a direct calculation of the target function instead of the one-hot vector you're using currently.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras binary_crossentropy cost function systematic error with large networks - python

There was a regularisation term difference between the two cost functions, which is arguably not that useful in evaluating performance

Related

Question About Dropout Layer and Batch Normalization Layer in DNN model

Improve Accuracy for a Siamese Network

Diverging loss in Keras with custom loss

Understanding Regularization in Keras

Implemention of Neural Networks

Categories

Resources