Backpropagation for cross entropy and softmax? - python

I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()

Related

What is the alternative of elementwise_grad of autograd in JAX?

I want to solve a second order differential equation with neural network. For automatic differentiation I am using JAX library. To compute first order and second order derivative of my target variable 'u' i.e to compute du/dx and d2u/dx2 elementwise_grad has been used in an example. In jax what is its alternative?
For example neural network function is evaluating 'u': which is defined as below:
'''
def u(params, inputs):
for Weights, biases in params:
outputs = np.dot(inputs, Weights) + biases
inputs = sigmoid(outputs)
return outputs
'''
u has two arguments: params is the set of weights and biases and inputs is the x range with respect to which u will be differentiated.
suppose x has a length of 50, so size of output u will also be 50*1
Now I have to take differentiation of all 50 values of u at a time. By JAX, which functions should I use to calculate du/dx and d2u/dx2? grad function is not working
dudx = grad(u,1)(x)
d2udx2 = grad(grad(u,1)(x))(x)
These are giving some errors
This isn't really a function that has a meaningful elementwise gradient. It's mapping one vector space to another vector space, and the appropriate derivative for this kind of operation is a jacobian:
dudx = jax.jacobian(u, 1)(params, x)
The result is a matrix whose entries are the derivative of the ith output with respect to the jth input.
Note that if you had a truly element-wise function and wanted to compute the element-wise gradient, you could do so with vmap; for example:
def f(x):
return jnp.exp(x) - 1
df_dx = jax.vmap(jax.grad(f))(x)
That doesn't work for your function, because the mapping to the output vector space is determined by the contents of params, and vmap cannot easily account for that.

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

PyTorch: Calculate Hessian only for a subset of parameters?

I am writing ElasticWeightConsolidation method and for that I need to compute Fisher matrix. As I understood Fisher Matrix is just Hessian of likelihood by weights of neural network. There is good function as torch.autograd.functional.hessian(func, inputs, create_graph=False, strict=False)
So I want to compute hessian(loss,weights) where loss = torch.nn.CrossEntropyLoss(). I also prepared weights of the network so that it its long 1D tensor, to have a possibility simply take diagonal elements of hessian like that:
def flat_param(model_param = yann_lecun.parameters()):
ans_data = []
ans_data = torch.tensor(ans_data, requires_grad=True)
ans_data = ans_data.to(device)
for p in model_param:
temp_data = p.data.flatten()
ans_data = torch.cat((ans_data,temp_data))
return ans_data
ans = flat_param(yann_lecun.parameters())
then I tried so: hessian(loss, inputs = ans) but problem is that loss takes also targets, but I don't want to compute hessian of them. The task is mnist classification so that targets are integers 0,...,9
and if I add y_train to the parameters like that hessian(loss,inputs = (ans,y_train_01)
It is crashing with words "can't take gradient from integer". I tried also to make y_train_01.requires_grad = False but it didn't help. I understood that loss also depends on y_train_01, but is there any way to determine that targets are constants in my case?
You can create a new 'wrapper' function where the targets are fixed:
def original_model(features, targets):
...
return ...
def feature_differentiable_model(features):
fixed_targets = ...
return original_model(features, fixed_targets)
And then call:
hessian(feature_differentiable_model, features_vals)
The second order partial derivatives from this will be equivalent to the analogous ones of the full Hessian product at the location (features_vals, fixed_targets).

XGBoost Custom Objective Function - Squared Normalized Error

I try to use Squared Normalized Error as my objective function for XGBoostRegressor using documentation hints here: https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html. My objective function equation is:
(prediction - observation) / standard_deviation(observations)
While trying to develop it I encountered the following issues:
I am wondering if such objective function is allowed, since standard deviation contains information about all observations (labels) while loss is calculated for each training example individually.
If my approach is correct, I am wondering how to calculate hessian and gradient of this objective function. I analyzed the squared error loss function here: Creating a Custom Objective Function in for XGBoost.XGBRegressor, but failed to understand why x=(predictions - observations) is treated as one parameter. In other words, why do we use loss function as x^2 instead of (x-y)^2? x and y correspond to predictions and observations respectively.
EDIT: I use XGBoost for the task of photovoltaic (PV) yield forecasting and I make predictions for multiple systems using one model. I would like to have low percentage error for all systems, despite their size. However, squared error makes training focus on the largest systems, as their error is naturally the largest. I changed the objective function to:
(prediction - observation) / system_size
and made system_size a global variable, as adding new input variables to gradient and hessian functions is not allowed. The code compiles without errors, but predictions are within very small range. Gradient can be divided by system_sizes, as dividing by constant does not change the derivative. Code I managed to develop so far:
def gradient_sne(predt: np.ndarray, dtrain: DMatrix) -> np.ndarray:
#Compute the gradient squared normalized error.
y = dtrain.get_label()
return 2*(predt - y)/system_sizes
def hessian_sne(predt: np.ndarray, dtrain: DMatrix) -> np.ndarray:
#Compute the hessian for squared error
y = dtrain.get_label()
return 0*y + 2
def custom_sne(y_pred, y_true):
#squared error objective. A simplified version of MSNE used as
#objective function.
grad = gradient_sne(y_pred, y_true)
hess = hessian_sne(y_pred, y_true)
return grad, hess
# Customized metric
def nrmse(predt: np.ndarray, dtrain: DMatrix):
''' Root mean squared normalized error metric. '''
y = dtrain.get_label()
predt[predt < 0] = 0 # all negative predictions are zero
std_dev = np.std(y)
elements = np.power(((y - predt) / std_dev), 2)
return 'RMSNE', float(np.sqrt(np.sum(elements) / len(y)))
I use python 3.7.5 and xgboost 1.0.2. I would appreciate your help very much.

Hessian matrix always singular in Theano

I'm trying to implement training an MLP with Newton's method in Theano. I've managed to find out how to obtain the inverse hessian (it's pretty easy, just needed to remove some asserts in the nlinalg module). However, when I try to update my weight vectors with my gradient_descent function, I get the following error:
numpy.linalg.linalg.LinAlgError: Singular matrix
Apply node that caused the error: MatrixInverse(InplaceDimShuffle{x,0,1}.0)
Obviously, if my hessian is singular, it can't be inverted. But the weights are random, so how come it's always singular? I don't know much about matrix computation, but it seems to me that there must be a flaw in my process. Here's the relevant part of my code:
def gradient_descent(error, weights, w_flat, learning_rate=0.1):
# decrease flattened weights by the product of the inverse Hessian and the gradient
hess = T.hessian(cost=error, wrt=[w_flat])
grad = T.grad(error, wrt=[w_flat])
mult = T.nlinalg.matrix_inverse(hess) * grad
update = w_flat - learning_rate * mult
return T.flatten(update)
#initialize weight matrices
w_h_flat = theano.shared(np.array(np.random.randn(11,6), dtype=theano.config.floatX).flatten())
w_hidden = w_h_flat.reshape((11,6))
w_o_flat = theano.shared(np.array(np.random.randn(7,1), dtype=theano.config.floatX).flatten())
w_output = w_o_flat.reshape((7,1))
# ... other mlp stuff, define the cost, etc ...
train = theano.function(inputs=[x, y], outputs=cost, updates=[(w_h_flat, gradient_descent(cost, w_hidden, w_h_flat)), (w_o_flat, gradient_descent(cost, w_output, w_o_flat))])
Would really appreciate any pointers as to what I'm doing wrong.

Categories