I have a fully connected feed-forward network implemented with Keras. Initially, I used binary cross-entropy as the loss and the metric, and Adam optimizer as follows
adam = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['binary_crossentropy'])
This model trains well and gives good results. In order to get better results I want to use a different loss function and metric as below,
import keras.backend as K
def soft_bit_error_loss(yTrue, yPred):
loss = K.pow(1 - yPred, yTrue) * K.pow(yPred, 1-yTrue)
return K.mean(loss)
def ber(yTrue, yPred):
x_hat_train = K.cast(K.greater(yPred, 0.5), 'uint8')
train_errors = K.cast(K.not_equal(K.cast(yTrue, 'uint8'), x_hat_train), 'float32')
train_ber = K.mean(train_errors)
return train_ber
I use it to compile my model as below
model.compile(optimizer=adam, loss=soft_bit_error_loss, metrics=[ber])
However, when I do that, the loss and the metric diverge after some training, everytime as in the following pictures.
What can be the cause of this?
Your loss function is very unstable, look at it:
Where I replaced y_pred (variable) with x and y_true (constant) with c for simplicity.
As your predictions approach zero, at least one operation will tend to 1/0, which is infinite. Although by the limits theory you can know the result is ok, Keras doesn't know the "whole" function as one, it calculates derivatives based on the basic operations used.
So, one easy solution is the one pointed by #today:
loss = K.switch(yTrue == 1, 1 - yPred, yPred)
It's exactly the same function (difference only when c is not zero or 1).
Also, even easier, for c=0 or c=1, it's just a plain loss='mae'.
Related
I am new to tensorflow
In a part of a code for a tensorflow session, there is :
loss = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=net, labels=self.out_placeholder, name='cross_entropy')
self.loss = tf.reduce_mean(loss, name='mean_squared_error')
I want to use mean_squared_error loss function for this purpose. I found this loss function in tensorflow website:
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
I need this loss function for a regression problem.
I tried:
loss = tf.losses.mean_squared_error(predictions=net, labels=self.out_placeholder)
self.loss = tf.reduce_mean(loss, name='mean_squared_error')
Where net = tf.matmul(input_tensor, weights) + biases
However, I'm not sure if it's the correct way.
First of all keep in mind that cross-entropy is mainly used for classification, while MSE is used for regression.
In your case cross entropy measures the difference between two distributions (the real occurences, called labels - and your predictions)
So while the first loss functions works on the result of the softmax layer (which can be seen as a probability distribution), the second one works directly on the floating point output of your network (which is no probability distribution) - therefore they cannot be simply exchanged.
I am trying to understand why regularization syntax in Keras looks the way that it does.
Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.
However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:
input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')
This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.
The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)?
Let's break down the components of your question:
Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.
When you use layer regularisation, the base Layer class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.
Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:
kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.
bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.
activity: is applied to the output vector, the y in y = f(Wx + b).
I'm trying to build a simple mnist GAN and need less to say, it didn't work. I've searched a lot and fixed most of my code. Though I can't really understand how loss functions are working.
This is what I did:
loss_d = -tf.reduce_mean(tf.log(discriminator(real_data))) # maximise
loss_g = -tf.reduce_mean(tf.log(discriminator(generator(noise_input), trainable = False))) # maxmize cuz d(g) instead of 1 - d(g)
loss = loss_d + loss_g
train_d = tf.train.AdamOptimizer(learning_rate).minimize(loss_d)
train_g = tf.train.AdamOptimizer(learning_rate).minimize(loss_g)
I get -0.0 as my loss value. Can you explain how to deal with loss functions in GANs?
It seems you try to sum the generator and discriminator losses together which is completely wrong!
Since the Discriminator train with both real and generated data you have to create two distinct losses, one for real data and other one for noise data(generated) that you pass into the discriminator network.
Try to change your code as follows:
1)
loss_d_real = -tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(real_data),labels= tf.ones_like(discriminator(real_data))))
2)
loss_d_fake=-tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(noise_input),labels= tf.zeros_like(discriminator(real_data))))
then the discriminator loss will be equal to = loss_d_real+loss_d_fake.
Now create loss for your generator:
3)
loss_g= tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(genereted_samples), labels=tf.ones_like(genereted_samples)))
Maryam seems to have identified the cause of your spurious loss values(i.e. summing the generator and discriminator losses). Just wanted to add that you should probably opt for the Stochastic Gradient Descent optimizer for the discriminator in lieu of Adam - doing so provides stronger theoretical guarantees of the network's convergence when playing the minimax game(source: https://github.com/soumith/ganhacks).
The title may be too specific and this may also apply to other cost functions as well. Basically I have the dummy setup as below (in my case I have a vastly different architecture, but the problem is always replicated):
hidden_units=10000
l2_sparsity = 5e-7
l1_sparsity = 1e-8
mod=Sequential([Dense(hidden_units, input_shape = (1000,), activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
),
Dense(hidden_units, activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
),
Dense(1000, activation="sigmoid",
)
])
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0)
mod.compile(loss='binary_crossentropy', optimizer=adam, metrics=['binary_accuracy'])
x = np.array([np.array([random.random() for i in range(1000)], dtype=np.float64) for i in range(10)])
classes = (x + 0.5).astype(np.uint32)
def replica_cross_entropy_loss(predictions, truth):
eps=10e-8
predictions = np.clip(predictions, eps, 1.0-eps)
x = np.log(1.0 *predictions / (1.0 - predictions) )
return np.mean(np.clip(x, 0, None) - x * truth + np.log(1 + np.exp(-abs(x))))
mod.fit(x[:2],classes[:2], batch_size=128, epochs=1)
preds = mod.predict(x[:2])
print replica_cross_entropy_loss(preds,classes[:2])
print mod.test_on_batch(x[:2],classes[:2])[0]
print 100*(mod.test_on_batch(x[:2],classes[:2])[0] /
replica_cross_entropy_loss(preds,classes[:2]) - 1)
What I am doing here is comparing the built-in keras loss to a (I believe) faithful reconstruction using numpy. Now, experimenting with the hidden units in the mlp architecture outlined yields some interesting results. Specifically:
The keras cost function is systematically higher than the numpy
equivalent, with the difference related to the number of neurons in
the hidden layer.
As an illustration, here is a simple graph that plots the number of neurons in the hidden layer of this mlp vs the pct overestimate of the keras cost function:
I should say that this seems to be backend agnostic, effecting both theano and tensorflow.
Judging from this behaviour, it seems likely to be a precision issue. Do you know if there is a way to mitigate this, as for my real model the cost function is consistently 5-10% different to the numpy answer
There was a regularisation term difference between the two cost functions, which is arguably not that useful in evaluating performance
I'm trying to understand how to implement neural networks. So I made my own dataset. Xtrain is numpy.random floats. Ytrain is sign(sin(1/x^3).
Try to implement neural networks gave me very poor results. 30%accuracy. Random Forest with 100 trees give 97%. But I heard that NN can approximate any function. What is wrong in my understanding?
import numpy as np
import keras
import math
from sklearn.ensemble import RandomForestClassifier as RF
train = np.random.rand(100000)
test = np.random.rand(100000)
def g(x):
if math.sin(2*3.14*x) > 0:
if math.cos(2*3.14*x) > 0:
return 0
else:
return 1
else:
if math.cos(2*3.14*x) > 0:
return 2
else:
return 3
def f(x):
x = (1/x) ** 3
res = [0, 0, 0, 0]
res[g(x)] = 1
return res
ytrain = np.array([f(x) for x in train])
ytest = np.array([f(x) for x in test])
train = np.array([[x] for x in train])
test = np.array([[x] for x in test])
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
model = Sequential()
model.add(Dense(100, input_dim=1))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(4))
model.add(Activation('softmax'))
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
P.S. I tried out many layers, activation functions, loss functions, optimizers, but never got more than 30% accuracy :(
I suspect that the 30% accuracy is a combination of small learning rate setting and a small training-step setting.
I ran your code snippet with model.fit(train, ytrain, nb_epoch=5, batch_size=32), after 5 epoch's training it yields about 28% accuracy. With the same setting but increasing the training steps to nb_epoch=50, the loss drops to ~1.157 ish and the accuracy raises to 40%. Further increase training steps should lead the model to further converging. Other than that, you can also try to configure the model with a larger learning rate setting which could make the converging faster :
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.1, momentum=0.9, nesterov=True), metrics=['accuracy'])
Although be careful don't set the learning rate to be too large otherwise your loss could blow up.
EDIT:
NN is known for having the potential for modeling extremely complex function, however, whether or not the model actually produce a good performance is a matter of how the model is designed, trained, and many other matters related to the specific application.
Zhongyu Kuang's answer is correct in stating that you may need to train it longer or with a different learning rate.
I'll add that the deeper your network, the longer you'll need to train it before it converges. For a relatively simple function like sign(sin(1/x^3)), you may be able to get away with a smaller network than the one you're using.
Additionally, softmax probably isn't the best output layer. You just need to yield -1 or 1. A single tanh unit seems like it would do well. softmax is generally used when you want to learn a probability distribution over a finite set. (You'll probably want to switch your error function from cross entropy to mean square error for similar reasons.)
Try a network with one sigmoidal hidden layer and an output layer with just one tanh unit. Then play around with the layer size and learning rate. Maybe add a second hidden layer if you can't get results with just one, but I wouldn't be surprised if it's unnecessary.
Addendum: In this approach, you'll replace f(x) with a direct calculation of the target function instead of the one-hot vector you're using currently.