Numpy Overflow in calculations disrupting code

Numpy Overflow in calculations disrupting code - python

I am trying to train a neural network in Python 3.7. For this, I am using Numpy to perform calculations and do my matrix multiplications. I find this error
RuntimeWarning: overflow encountered in multiply (when I am multiplying matrices)
This, in turn, results in nan values, which raises errors like
RuntimeWarning: invalid value encountered in multiply
RuntimeWarning: invalid value encountered in sign
Now, I have seen many answers related to this question, all explaining why this happens. But I want to know, "How do I solve this problem?". I have tried using the default math module, but that still doesn't work and raises errors like
TypeError: only size-1 arrays can be converted to Python scalars
I know I can use for loops to do the multiplications, but that is computationally very expensive, and also lengthens and complicates the code a lot. Is there any solution to this problem? Like doing something with Numpy (I am aware that there are ways to handle exceptions, but not solve them), and if not, then perhaps alternative to Numpy, which doesn't require me to change my code much?
I don't really mind if the precision of my data is compromised a bit. (If it helps the dtype for the matrices is float64)
EDIT:
Here is a dummy version of my code:
import numpy as np
network = np.array([np.ones(10), np.ones(5)])
for i in range(100000):
for lindex, layer in enumerate(network):
network[lindex] *= abs(np.random.random(len(layer)))*200000
I think the overflow error occurs when I am adding large values to the network.

This is a problem I too have faced with my neural network while using ReLu activators because of the infinite range on the positive side. There are two solutions to this problem:
A) Use another activation function: atan,tanh,sigmoid or any other one with limited range
However if you do not find those suitable:
B)Dampen the ReLu activations. This can be done by scaling down all values of the ReLu and ReLu prime function. Here's the difference in code:
##Normal Code
def ReLu(x,derivative=False):
if derivative:
return 0 if x<0 else 1
return 0 if x<0 else x
##Adjusted Code
def ReLu(x,derivative=False):
scaling_factor = 0.001
if derivative:
return 0 if x<0 else scaling_factor
return 0 if x<0 else scaling_factor*x
Since you are willing to compromise on the precision this is a perfect solution for you! In the ending, you can multiply by the inverse of the scaling_factor to get the approximate solution- approximate because of rounding discrepancies.

Related

scipy's fsolve gives answers that aren't roots

I am trying to find roots of a function in python using fsolve:
import math
import scipy
def f(a):
eq=-2*a**2 - 2*a**2*(math.sin(25*a**(1/4)))**2 - 2*a**2*(math.cos(25*a**(1/4)))**2 - 2*math.exp(-25*a**(1/4))*a**2*math.cos(25*a**(1/4)) - 2*math.exp(25*a**(1/4))*a**2*math.cos(25*a**(1/4))
return eq
print(f(scipy.optimize.fsolve(f,10)))
and it returns the following value:
[1234839.75468454]
That doesn't seem very close to 0 to me... Does it simply lack the computational power to calculate more decimals for the root? If so, what would be a good alternative for fsolve that could also calculate roots, just more accurately?

To better understand what happens, a first step would be to look at the infos of the run, which you can get by setting the full_output argument to True (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html for more details).
When starting with an initial point of 10 as you do, the algorithm claims to converge and do so in less evaluations than the maximum allocated, so it is not a problem of computing power.

Why is computing the loss from logits more numerically stable?

In TensorFlow the documentation for SparseCategoricalCrossentropy states that using from_logits=True and therefore excluding the softmax operation in the last model layer is more numerically stable for the loss calculation.
Why is this the case?

A bit late to the party, but I think the numerical stability has something to do with precision of floats and overflows.
Say you want to calculate np.exp(2000), it gives you an overflow error. However, calculating np.log(np.exp(2000)) can be simplified to 2000.
By using logits you can circumvent large numbers in intermediate steps, avoiding overflows and low precision.

First of all here I think a good explanation about should you worry about numerical stability or not. Check this answer but in general most likely you should not care about it.
To answer your question "Why is this the case?" let's take a look on source code:
def sparse_categorical_crossentropy(target, output, from_logits=False, axis=-1):
""" ...
"""
...
# Note: tf.nn.sparse_softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output)
...
You could see that if from_logits is False then output value is clipped to epsilon and 1-epsilon.
That means that if the value is slightly changing outside of this bounds the result will not react on it.
However in my knowledge it's quite exotic situation when it really matters.

python sum on array of tensors vs tf.add_n

So I've got some code
tensors = [] //its filled with 3D float tensors
total = sum(tensors)
if I change that last line to
total = tf.add_n(tensors)
then the code produces the same output but runs much more slowly and soon causes
an out-of-memory exception. Whats going on here? Can someone explain how pythons built in sum function and tf.add_n interact with an array of tensors respectively and why pythons sum would seemingly just be a better version?

When you use sum, you call a standard python algorithm that call __add__ recursively on the elements of the array. Since __add__ (or +) indeed is overloaded on tensorflow's tensors, it works as expected: it creates a graph that can be executed during a session. It is not optimal, however, because you add as many operation as there are elements in your list; also, you are enforcing the order of the operation (add the first two elements, then the third to the result, and so on), which is also not optimal.
By contrast, add_n is a specialized operation to do just that. Looking at the graph is really telling I think:
import tensorflow as tf
with tf.variable_scope('sum'):
xs = [tf.zeros(()) for _ in range(10)]
sum(xs)
with tf.variable_scope('add_n'):
xs = [tf.zeros(()) for _ in range(10)]
tf.add_n(xs)
However – contrary to what I thought earlier – add_n takes up more memory because it waits – and store – for all incoming inputs before storing them. If the number of inputs is large, then the difference can be substantial.
The behavior I was expecting from add_n, that is, summation of inputs as they are available, is actually achieved by tf.accumulate_n. This should be the superior alternative, as it takes less memory than add_n but does not enforce the order of summation like sum.
Why did the authors of tensorflow-wavenet used sum instead of tf.accumulate_n? Certainly because before this function is not differentiable on TF < 1.7. So if you have to support TF < 1.7 and be memory efficient, good old sum is actually quite a good option.

The sum() built-in only takes iterables and therefor would seem to gain the advantage of using generators in regards to memory profile.
the add_n() function for tensor takes a list of tensors and seem to retain that data structure throughout handling based on it's requirement for shape comparison.
In [29]: y = [1,2,3,4,5,6,7,8,9,10]
In [30]: y.__sizeof__()
Out[30]: 120
In [31]: x = iter(y)
In [32]: x.__sizeof__()
Out[32]: 32

fsolve problems with the starting point

I'm using fsolve in order to solve a non linear equation. My problem is that, depending on the starting point the solutions change and I am not sure that the ones that I found are the most reasonable.
This is the code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve, brentq,newton
A = np.arange(0.05,0.95,0.01)
PHI = np.deg2rad(np.arange(0,90,1))
def f(b):
return np.angle((1+3*a**4-3*a**2)+(a**4-a**6)*(np.exp(2j*b)+2*np.exp(-1j*b))+(a**2-2*a**4+a**6)*(np.exp(-2j*b)+2*np.exp(1j*b)))-Phi
B = np.zeros((len(A),len(PHI)))
for i in range(len(A)):
for j in range(len(PHI)):
a = A[i]
Phi = PHI[j]
b = fsolve(f, 1)
B[i,j]= b
I fixed x0 = 1 because it seems to give the more reasonable values. But sometimes, I think the method doesn't converge and the resulting values are too big.
What can I do to find the best solution?
Many thanks!

The eternal issue with turning non-linear solvers loose is having a really good understanding of your function, your initial guess, the solver itself, and the problem you are trying to address.
I note that there are many (a,Phi) combinations where your function does not have real roots. You should do some math, directed by the actual problem you are trying to solve, and determine where the function should have roots. Not knowing the actual problem, I can't do that for you.
Also, as noted on a (since deleted) answer, this is cyclical on b, so using a bounded solver (such as scipy.optimize.minimize using method='L-BFGS-B' might help to keep things under control. Note that to find roots with a minimizer you use the square of your function. If the found minimum is not close to zero (for you to define based on the problem), the real minima might be a complex conjugate pair.
Good luck.

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!

This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.