Why is computing the loss from logits more numerically stable?

Why is computing the loss from logits more numerically stable? - python

In TensorFlow the documentation for SparseCategoricalCrossentropy states that using from_logits=True and therefore excluding the softmax operation in the last model layer is more numerically stable for the loss calculation.
Why is this the case?

A bit late to the party, but I think the numerical stability has something to do with precision of floats and overflows.
Say you want to calculate np.exp(2000), it gives you an overflow error. However, calculating np.log(np.exp(2000)) can be simplified to 2000.
By using logits you can circumvent large numbers in intermediate steps, avoiding overflows and low precision.

First of all here I think a good explanation about should you worry about numerical stability or not. Check this answer but in general most likely you should not care about it.
To answer your question "Why is this the case?" let's take a look on source code:
def sparse_categorical_crossentropy(target, output, from_logits=False, axis=-1):
""" ...
"""
...
# Note: tf.nn.sparse_softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output)
...
You could see that if from_logits is False then output value is clipped to epsilon and 1-epsilon.
That means that if the value is slightly changing outside of this bounds the result will not react on it.
However in my knowledge it's quite exotic situation when it really matters.

Related

Conceptual understanding of tf.nn.dynamic_rnn() "outputs" vs. "state"

Context
I'm reading through part II of Hands on ML and am looking for some clarity on when to use "outputs" and when to use "state" in the loss calculation for a RNN.
In the book (p.396 for those that have the book), the author says, "Note that the fully connected layer is connected to the states tensor, which contains only the final states of the RNN," referring to a sequence classifier that is unrolled over 28 steps. Since the states variable will have len(states) == <number_of_hidden_layers>, when building a deep RNN I have been using states[-1] to only connect to the final state of the final layer. For example:
# hidden_layer_architecture = list of ints defining n_neurons in each layer
# example: hidden_layer_architecture = [100 for _ in range(5)]
layers = []
for layer_id, n_neurons in enumerate(hidden_layer_architecture):
hidden_layer = tf.contrib.rnn.BasicRNNCell(n_neurons,
activation=tf.nn.tanh,
name=f'hidden_layer_{layer_id}')
layers.append(hidden_layer)
recurrent_hidden_layers = tf.contrib.rnn.MultiRNNCell(layers)
outputs, states = tf.nn.dynamic_rnn(recurrent_hidden_layers,
X_, dtype=tf.float32)
logits = tf.layers.dense(states[-1], n_outputs, name='outputs')
This works as expected given the author's previous statement. However, I don't understand when one would use the outputs variable (first output of tf.nn.dynamic_rnn())
I have looked at this question, which does a pretty good job of answering the minutia, and mentioned that, "If you are only interested in the last output of the cell, you can just slice the time dimension to pick just the last element (e.g. outputs[:, -1, :])." I inferred this to mean something along the lines of states[-1] == outputs[:, -1, :], which when tested was false. Why would this not be the case? If the outputs are the outputs of the cell at each time step, why wouldn't this be the case? In general...
Question
When does one use the outputs variable from tf.nn.dynamic_rnn() in the loss function and when would one use the states variable? How does this change the abstracted architecture of the network?
Any clarity would be greatly appreciated.

This basically breaks it down:
outputs: Full sequence of outputs of the top-level of the RNN. This means that, should you be using MultiRNNCell, this will only be the top cell; nothing from the lower cells is in here.
In general, with custom RNNCell implementations, this could be pretty much anything, however pretty much all the standard cells with return the sequence of states here, however you could also write a custom cell yourself that does something to the state sequence (e.g. a linear transformation) before returning it as outputs.
state (note that this is what the docs call it, not states) is the full state of the last time step. One important difference is that, in the case of MultiRNNCell, this will contain the final states of all cells in the sequence, not just the top one! Also, the precise format/type of this output varies heavily depending on the RNNCell used (e.g. it could be a tensor, or a tuple of tensors...).
As such, if all you care about is the top-most state of the last time step in a MultiRNNCell, you really have two options that should be identical, coming down to personal preference/"clarity":
outputs[:, -1, :] (assuming batch-major format) extracts only the last time-step from the sequence of top-level states.
state[-1] extracts only the top-level state from the tuple of final states for all layers.
There are other scenarios where you might not have this choice:
If you actually need the full sequence output, you need to use outputs.
If you need the final states from lower layers in a MultiRNNCell, you need to use state.
As for why the equality check fails: If you actually used ==, I believe this checks for equality of the tensor objects which are obviously different. You could instead try to inspect the values of the two objects for some simple toy scenario (tiny state size/sequence length) -- they should be the same.

Numpy Overflow in calculations disrupting code

I am trying to train a neural network in Python 3.7. For this, I am using Numpy to perform calculations and do my matrix multiplications. I find this error
RuntimeWarning: overflow encountered in multiply (when I am multiplying matrices)
This, in turn, results in nan values, which raises errors like
RuntimeWarning: invalid value encountered in multiply
RuntimeWarning: invalid value encountered in sign
Now, I have seen many answers related to this question, all explaining why this happens. But I want to know, "How do I solve this problem?". I have tried using the default math module, but that still doesn't work and raises errors like
TypeError: only size-1 arrays can be converted to Python scalars
I know I can use for loops to do the multiplications, but that is computationally very expensive, and also lengthens and complicates the code a lot. Is there any solution to this problem? Like doing something with Numpy (I am aware that there are ways to handle exceptions, but not solve them), and if not, then perhaps alternative to Numpy, which doesn't require me to change my code much?
I don't really mind if the precision of my data is compromised a bit. (If it helps the dtype for the matrices is float64)
EDIT:
Here is a dummy version of my code:
import numpy as np
network = np.array([np.ones(10), np.ones(5)])
for i in range(100000):
for lindex, layer in enumerate(network):
network[lindex] *= abs(np.random.random(len(layer)))*200000
I think the overflow error occurs when I am adding large values to the network.

This is a problem I too have faced with my neural network while using ReLu activators because of the infinite range on the positive side. There are two solutions to this problem:
A) Use another activation function: atan,tanh,sigmoid or any other one with limited range
However if you do not find those suitable:
B)Dampen the ReLu activations. This can be done by scaling down all values of the ReLu and ReLu prime function. Here's the difference in code:
##Normal Code
def ReLu(x,derivative=False):
if derivative:
return 0 if x<0 else 1
return 0 if x<0 else x
##Adjusted Code
def ReLu(x,derivative=False):
scaling_factor = 0.001
if derivative:
return 0 if x<0 else scaling_factor
return 0 if x<0 else scaling_factor*x
Since you are willing to compromise on the precision this is a perfect solution for you! In the ending, you can multiply by the inverse of the scaling_factor to get the approximate solution- approximate because of rounding discrepancies.

How to solve nan loss?

Problem
I'm running a Deep Neural Network on the MNIST where the loss defined as follow:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))
The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logits is giving me this error.
This is strange, because the code just contains mul and add operations.
Possible Solution
Maybe I can use:
if cost == "nan":
optimizer = an empty optimizer
else:
...
optimizer = real optimizer
But I cannot find the type of nan. How can I check a variable is nan or not?
How else can I solve this problem?

I find a similar problem here TensorFlow cross_entropy NaN problem
Thanks to the author user1111929
tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
Replacing it with
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))
Or
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
Solved nan problem.

The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:
import math
val = float('nan')
val
if math.isnan(val):
print('Detected NaN')
import pdb; pdb.set_trace() # Break into debugger to look around

Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.

I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logits should be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:
Implement your own softmax_cross_entropy_with_logits function, e.g. try (source):
epsilon = tf.constant(value=0.00001, shape=shape)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
Update your data so that it does have a valid probability distribution

How can I minimize a function in Python, without using gradients, and using constraints and ranges?

EDIT: looks like this was already answered before here
It didn't show up in my searches because I didn't know the right nomenclature. I'll leave the question here for now in case someone arrives here because of the constraints.
I'm trying to optimize a function which is flat on almost all points ("steps function", but in a higher dimension).
The objective is to optimize a set of weights, that must sum to one, and are the parameters of a function which I need to minimize.
The problem is that, as the function is flat at most points, gradient techniques fail because they immediately converge on the starting "guess".
My hypothesis is that this could be solved with (a) Annealing or (b) Genetic Algos. Scipy sends me to basinhopping. However, I cannot find any way to use the constraint (the weights must sum to 1) or ranges (weights must be between 0 and 1) using scipy.
Actual question: How can I solve a minimization problem without gradients, and also use constraints and ranges for the input variables?
The following is a toy example (evidently this one could be solved using the gradient):
# import minimize
from scipy.optimize import minimize
# define a toy function to minimize
def my_small_func(g):
x = g[0]
y = g[1]
return x**2 - 2*y + 1
# define the starting guess
start_guess = [.5,.5]
# define the acceptable ranges (for [g1, g2] repectively)
my_ranges = ((0,1),(0,1))
# define the constraint (they must always sum to 1)
def constraint(g):
return g[0] + g[1] - 1
cons = {'type':'eq', 'fun': constraint}
# minimize
minimize(my_small_func, x0=start_guess, method='SLSQP',
bounds=rranges, constraints=cons)

I usually use R so maybe this is a bad answer, but anyway here goes.
You can solve optimization problems like the using a global optimizer. An example of this is Differential Evolution. The linked method does not use gradients. As for constraints, I usually build them manually. That looks something like this:
# some dummy function to minimize
def objective.function(a, b)
if a + b != 1 # if some constraint is not met
# return a very high value, indicating a very bad fit
return(10^90)
else
# do actual stuff of interest
return(fit.value)
Then you simply feed this function to the differential evolution package function and that should do the trick. Methods like differential evolution are made to solve in particular very high dimensional problems. However the constraint you mentioned can be a problem as it will likely result in very many invalid parameter configurations. This is not necessarily a problem for the algorithm, but is simply means you need to do a lot of tweaking and need to expect a lot of waiting time. Depending on your problem, you could try optimizing weights/ parameters in blocks. That means, optimize parameters given a set of weights, then optimize weights given the previous set of parameters and repeat that many times.
Hope this helps :)

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!

This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.