I have an LSTM predicting time series values in tensorflow.
The model is working using an MSE as a loss function.
However, I'd like to be able to create a custom loss function where one of the error values is multiplied by two (therefore producing a higher error value).
In my batch of size 10, I want the 3rd value of the first input to be multiplied by 2, but because this is time series, this corresponds to the second value in the second input and the first value in the third input.
The error I get is:
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients
How do I make the gradients?
def loss_function(y_true, y_pred, peak_value=3, weight=2):
# peak value is where the multiplication happens on the first line
# weight is the how much the error is multiplied by
all_dif = tf.squared_difference(y_true, y_pred) # should be shape=[10,10]
peak = [peak_value] * 10
listy = range(0, 10)
c = [(i - j) % 10 for i, j in zip(peak, listy)]
for i in range(0, 10):
indices = [[i, c[i]]]
values = [1.0]
shape = [10,10]
delta = tf.SparseTensor(indices, values, shape)
all_dif = all_dif + tf.sparse_tensor_to_dense(delta)
return tf.reduce_sum(all_dif)
I believe the psuedo code would look something like this:
#tf.custom_gradient
def loss_function(y_true, y_pred, peak_value=3, weight=2)
## your code
def grad(dy):
return dy * partial_derivative
return loss, grad
Where partial_derivative is the analytically evaluated partial derivative with respect to your loss function. If your loss function is a function of more than one variable, it will require a partial derivative respect to each variable, I believe.
If you need more information, the documentation is good: https://www.tensorflow.org/api_docs/python/tf/custom_gradient
And I've yet to find an example of this functionality embedded in a model that's not a toy.
Related
I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()
I wanted to code my implementation of polynomial regression, but my model's gradients either exploded or my model didn't fit the data well enough.
For testing purposes, my dataset is just the function x^2 and my model is a second-degree polynomial ax^2 + bx + c. I trained it for 50 epochs using batch gradient descent.
I noticed that the model explodes with the learning rate >=0.001 and underfits with a learning rate <=0.0001
To visualize the model, at the end of each epoch, I plot the model's predictions with the labels. So, in the ideal case, these lines should be indistinguishable.
*The orange line is the labels and the blue one is the model's predictions.
Here is the model exploding:*
And here it underfits:*
One interesting thing is, that even though the model's predictions are way too big, the line still resembles the correct polynomial. And the picture where the predictions go into negatives is also correct, but just flipped/mirrored.
I made the code in python. This is my main.py:
from decimal import Decimal
from matplotlib.pyplot import plot, draw, pause, clf
from model import PolynomialRegression
POLYNOMIAL_FUNCTION = [0, 1, 2]
LEARNING_RATE = Decimal(0.0001)
DATASET = [0, 1, 2, 3, 4, 5, 6, 7]
LABELSET = [0, 1, 4, 9, 16, 32, 64, 128]
EPOCHS = 50
model = PolynomialRegression(POLYNOMIAL_FUNCTION, LEARNING_RATE)
for _ in range(EPOCHS):
for data, label in zip(DATASET, LABELSET):
# train the model
model.train(data, label)
# update the model
model.update()
# predict the dataset
predictions = [model.predict(data) for data in DATASET]
# plot predictions and labels
plot(predictions)
plot(LABELSET)
draw()
pause(0.1)
clf()
print(model.parameters)
# erase the stored gradients
model.clear_grad()
And this is my model.py:
from decimal import Decimal
class PolynomialRegression:
"""
Polynomial regression model.
"""
def __init__(self, polynomial_function: list, learning_rate: Decimal) -> None:
# the structure of the polynomial function (the exponents)
self.polynomial_function = polynomial_function
# parameters of the model set to be 1
self.parameters = [Decimal(1)] * len(polynomial_function)
self.learning_rate = learning_rate
# stored gradients to update the model
self.gradients = []
def predict(self, x: Decimal) -> Decimal:
"""
Make a prediction based on the input.
Args:
x (Decimal): Input to the model.
Returns:
Decimal: A prediction.
"""
y = Decimal(0)
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute a term and add it to the final output
y += param * (x ** exponent)
return y
def train(self, x: Decimal, y: Decimal) -> Decimal:
"""
Compute a gradient from a given input and target output.
Args:
x (Decimal): Input for the model.
y (Decimal): Target/Desired output.
Returns:
Decimal: An MSE loss.
"""
prediction = self.predict(x)
error = prediction - y
loss = error ** 2
gradient = []
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute the gradient for a single parameter
param_gradient = error * (x ** exponent) * self.learning_rate
# add the parameter gradient to the gradient list
gradient.append(param_gradient)
# add the gradient to a list
self.gradients.append(gradient)
return loss
def __sum_gradients(self) -> Decimal:
"""
Return a sum of gradients along the 0 axis.
(equivalent of numpy.sum(x, axis=0))
Returns:
list: List of summed Decimals.
"""
result = [Decimal(0)] * len(self.parameters)
# iterate through the y axis
for gradient in self.gradients:
# iterate through the x axis
for i, param_gradient in enumerate(gradient):
result[i] += param_gradient
return result
def update(self) -> None:
"""
Update the model's parameters based on the stored gradients.
"""
summed_gradients = self.__sum_gradients()
# fraction used to calculate the average for every gradient
averaging_fraction = Decimal(1) / len(self.gradients)
for param_index, grad in enumerate(summed_gradients):
self.parameters[param_index] -= averaging_fraction * grad
def clear_grad(self) -> None:
"""
Clear/Reset the stored gradients.
"""
self.gradients = []
I think the problem lies somewhere in my gradient descent calculations, but it may also be something unexpected and silly.
First your dataset consists of only 8 datapoints. This is to few data to generalize a model, which means that you are probably overfitting.
The second thing I see, is that you do not normalize the x data. The model is not very complex so I guess it doesn't really matter in that context. But if you had a more complex model with n features and one feature is very small and one is very big, the feature with the bigger values would influence the result much more than the smaller one. Which might result in a bad performing model.
But your last plot doesn't look like it's underfitting to me. You have to realize that a ML model will always have an error. In my opinion for 8 datapoints, a model with only one layer and 50 epochs that looks fine. You probably could improve the results by learning longer, but that would mean to overfit the model even more. But to be honest if your goal is to emulate a mathematical function with ML this should be okay. You could also add a new layer.
The fact that your lr has to be that small to not fuck up the results tells me that you are correct, there is something wrong with the gradient descent you might want to look into this behavior.
An easy way to evaluate this is to build your model in pytorch and then use your optimizer to update the weights. If you get the same problem, it was your gradient descent, if not the problem lies somewhere else. But I strongly believe it is your gradient descent. Maybe debug into this function and look at the actual values you are subtracting.
I am new to machine learning, python and tensorflow. I am used to code in C++ or C# and it is difficult for me to use tf.backend.
I am trying to write a custom loss function for an LSTM network that tries to predict if the next element of a time series will be positive or negative. My code runs nicely with the binary_crossentropy loss function. I want now to improve my network having a loss function that adds the value of the next time series element if the predicted probability is greater than 0.5 and substracts it if the prob is less or equal to 0.5.
I tried something like this:
def customLossFunction(y_true, y_pred):
temp = 0.0
for i in range(0, len(y_true)):
if(y_pred[i] > 0):
temp += y_true[i]
else:
temp -= y_true[i]
return temp
Obviously, dimensions are wrong but since I cannot step into my function while debugging, it is very hard to get a grasp of dimensions here.
Can you please tell me if I can use an element-by-element function? If yes, how? And if not, could you help me with tf.backend?
Thanks a lot
From keras backend functions, you have the function greater that you can use:
import keras.backend as K
def customLossFunction(yTrue,yPred)
greater = K.greater(yPred,0.5)
greater = K.cast(greater,K.floatx()) #has zeros and ones
multiply = (2*greater) - 1 #has -1 and 1
modifiedTrue = multiply * yTrue
#here, it's important to know which dimension you want to sum
return K.sum(modifiedTrue, axis=?)
The axis parameter should be used according to what you want to sum.
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> time steps dimension (if you're using return_sequences = True until the end)
axis=2 -> predictions for each step
Now, if you have only a 2D target:
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> predictions for each sequence
If you simply want to sum everything for every sequence, then just don't put the axis parameter.
Important note about this function:
Since it contains only values from yTrue, it cannot backpropagate to change the weights. This will lead to a "none values not supported" error or something very similar.
Although yPred (the one that is connected to the model's weights) is used in the function, it's used only for getting a true x false condition, which is not differentiable.
at the moment I'm looking at the cifar10 example and I noticed the function _variable_with_weight_decay(...) in the file cifar10.py. The code is as follows:
def _variable_with_weight_decay(name, shape, stddev, wd):
"""Helper to create an initialized Variable with weight decay.
Note that the Variable is initialized with a truncated normal distribution.
A weight decay is added only if one is specified.
Args:
name: name of the variable
shape: list of ints
stddev: standard deviation of a truncated Gaussian
wd: add L2Loss weight decay multiplied by this float. If None, weight
decay is not added for this Variable.
Returns:
Variable Tensor
"""
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
var = _variable_on_cpu(
name,
shape,
tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
if wd is not None:
weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', weight_decay)
return var
I'm wondering if this function does what it says. It is clear that when a weight decay factor is given (wd not None) the deacy value (weight_decay) is computed. But is it every applied? At the end the unmodified variable (var) is return, or am I missing something?
Second question would be how to fix this? As I understand the value of the scalar weight_decay must be subtracted from each element in the weight matrix, but I'm unable to find a tensorflow op that can do that (adding/subtracting a single value from every element of a tensor). Is there any op like this?
As a workaround I thought it might be possible to create a new tensor initialized with the value of weight_decay and use tf.subtract(...) to achieve the same result. Or is this the right way to go anyway?
Thanks in advance.
The code does what it says. You are supposed to sum everything in the 'losses' collection (which the weight decay term is added to in the second to last line) for the loss that you pass to the optimizer. In the loss() function in that example:
tf.add_to_collection('losses', cross_entropy_mean)
[...]
return tf.add_n(tf.get_collection('losses'), name='total_loss')
so what the loss() function returns is the classification loss plus everything that was in the 'losses' collection before.
As a side note, weight decay does not mean you subtract the value of wd from every value in the tensor as part of the update step, it multiplies the value by (1-learning_rate*wd) (in plain SGD). To see why this is so, recall that l2_loss computes
output = sum(t_i ** 2) / 2
with t_i being the elements of the tensor. This means that the derivative of l2_loss with regard to each tensor element is the value of that tensor element itself, and since you scaled l2_loss with wd the derivative is scaled as well.
Since the update step (again, in plain SGD) is (forgive me for omitting the time step indexes)
w := w - learning_rate * dL/dw
you get, if you only had the weight decay term
w := w - learning_rate * wd * w
or
w := w * (1 - learning_rate * wd)
I've compared extensively to existing tutorials but I can't figure out why my weights don't update. Here is the function that return the list of updates:
def get_updates(cost, params, learning_rate):
updates = []
for param in params:
updates.append((param, param - learning_rate * T.grad(cost, param)))
return updates
It is defined at the top level, outside of any classes. This is standard gradient descent for each param. The 'params' parameter here is fed in as mlp.params, which is simply the concatenated lists of the param lists for each layer. I removed every layer except for a logistic regression one to isolate the reason as to why my cost was not decreasing. The following is the definition of mlp.params in MLP's constructor. It follows the definition of each layer and their respective param lists.
self.params = []
for layer in self.layers:
self.params += layer.params
The following is the train function, which I call for each minibatch during each epoch:
train = theano.function([minibatch_index], cost,
updates=get_updates(cost, mlp.params, learning_rate),
givens= {
x: train_set_x[minibatch_index * batch_size : (minibatch_index + 1) * batch_size],
y: train_set_y[minibatch_index * batch_size : (minibatch_index + 1) * batch_size]
})
If you require further details, the entire file is available here: http://pastebin.com/EeNmXfGD
I don't know how many people use Theano (it doesn't seem like plenty); if you've read to this point, thank you.
Fixed: I've determined that I can't use average squared error as the cost function. It works as usual after replacing it with a negative log-likelihood.
This behavior it caused by a few things but it comes down to the cost not being properly computed. In your implementation , the output of the LogisticRegression layer is the predicted class for every input digit (obtained with the argmax operation) and you take the squared difference between it and the expected prediction.
This will give you gradients of 0s wrt to any parameter in your model because the gradient of the output of the argmax (predicted class) wrt the input of the argmax (class probabilities) will be 0.
Instead, the LogisticRegression should output the probabilities of the classes :
def output(self, input):
input = input.flatten(2)
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
return self.p_y_given_x
And then in the MLP class, you compute the cost. You can used mean squared error between the desired probabilities for each class and the probabilities computed by the model but people tend to use the Negative Log Likelihood of the expected classes and you can implement it as such in the MLP class :
def neg_log_likelihood(self, x, y):
p_y_given_x = self.output(x)
return -T.mean(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
Then you can use this function to compute your cost and the model trains :
cost = mlp.neg_log_likelihood(x_, y)
A few additional things:
At line 215, when you print your cost, you format it as an integer value but it is a floating point value; this will lose precision in the monitoring.
Initializing all the weights to 0s as you do in your LogisticRegression class is often not recommended. Weights should differ in their original values so as to help break symmetry