Log overflow for no reason? [duplicate] - python

I am using numpy.log10 to calculate the log of an array of probability values. There are some zeros in the array, and I am trying to get around it using
result = numpy.where(prob > 0.0000000001, numpy.log10(prob), -10)
However, RuntimeWarning: divide by zero encountered in log10 still appeared and I am sure it is this line caused the warning.
Although my problem is solved, I am confused why this warning appeared again and again?

You can turn it off with seterr
numpy.seterr(divide = 'ignore')
and back on with
numpy.seterr(divide = 'warn')

numpy.log10(prob) calculates the base 10 logarithm for all elements of prob, even the ones that aren't selected by the where. If you want, you can fill the zeros of prob with 10**-10 or some dummy value before taking the logarithm to get rid of the problem. (Make sure you don't compute prob > 0.0000000001 with dummy values, though.)

Just use the where argument in np.log10
import numpy as np
np.random.seed(0)
prob = np.random.randint(5, size=4) /4
print(prob)
result = np.where(prob > 0.0000000001, prob, -10)
# print(result)
np.log10(result, out=result, where=result > 0)
print(result)
Output
[1. 0. 0.75 0.75]
[ 0. -10. -0.12493874 -0.12493874]

I solved this by finding the lowest non-zero number in the array and replacing all zeroes by a number lower than the lowest :p
Resulting in a code that would look like:
def replaceZeroes(data):
min_nonzero = np.min(data[np.nonzero(data)])
data[data == 0] = min_nonzero
return data
...
prob = replaceZeroes(prob)
result = numpy.where(prob > 0.0000000001, numpy.log10(prob), -10)
Note that all numbers get a tiny fraction added to them.

Just specify where to calculate log10 as follows:
result = np.log10(prob,where=prob>0)
Here is a demo:

This solution worked for me, use numpy.sterr to turn warnings off followed by where
numpy.seterr(divide = 'ignore')
df_train['feature_log'] = np.where(df_train['feature']>0, np.log(df_train['feature']), 0)

Related

Using Python to use Euler's formula into matrix approximation

I am trying to use Python and NumPy to use Euler’s formula e^i(π) represented as a matrix as e^A where
A = [0 -π]
[π 0]
,and then apply it to the Maclaurin series for an exponential function e^x as
SUMMATION(n=0, infinity) x^n/n! = 1 + x + x^2/2! + x^3/3! +...
So I am trying to compute an approximation matrix S^N+1 and print the matrix and it's four entries.
I have tried emulating euler's and maclaurin's series, which i think the final approximation matrix for this will be when N = 20, but currently my values do not add up. I am also trying to use np.linalg.norm to compute a 2 norm as well.
import math
import numpy as np
n = 0
A = np.eye(2)
A = math.pi * np.rot90(A)
A[0,1] = -A[0,1]
A
mac_series = 0
while n < 120:
print(n)
n += 1
mac_series = (A**n) / (math.factorial(n))
print("\n",mac_series)
np.linalg.norm(mac_series)
The main problem here is that you are confusing A**3 with A#A#A.
Just look at case n=0.
A**0
#array([[1., 1.],
# [1., 1.]])
I am pretty sure, you were expecting A⁰ to be identity (that is only that way that this thinking of x+iy ⇔ np.array([[x,-y],[y,x]]) makes sense)
In numpy, you have np.linalg.matrix_power for that (or you could just accumulate power your self)
sum(np.linalg.matrix_power(A,i) / math.factorial(i) for i in range(20))
is
array([[-1.00000000e+00, 5.28918267e-10],
[-5.28918267e-10, -1.00000000e+00]])
for example. Pretty sure that is what you were expecting (that is the matrix that represents real -1 using the same logic. And whole point of Euler identity is e^(iπ) = -1).
By comparison,
sum(A**i / math.factorial(i) for i in range(20))
returns
array([[ 1. , 0.04321392],
[23.14069263, 1. ]])
Which is just the maclaurin series computed for all four elements of the matrix. In other words, since your matrix is [[0,-π],[π,0]], you are evaluating using a MacLauring series [[e⁰, exp(-π)], [exp(π), e⁰]]. And it works. e⁰=1, obviously. exp(π) is 23.140692632779267, so we got a very good approximation in our result. And exp(-π) is the inverse, 0.04321391826377226. We also got a good approximation.
So it works. Just not at all to do what you obviously intend to do: prove Euler identity's in matrix form; compute exp(iπ) not just exp(π).
Without matrix_power, and with a code closer to your initial code, you could
n=0
mac_series = 0
Apowern=np.eye(2) # A⁰=Id for now
while n < 20:
print(n)
mac_series += Apowern / (math.factorial(n))
Apowern = Apowern # A # # is the matrix multiplication operator
n+=1
Note that I've also moved n+=1 which was misplaced in your code. You were stacking Aⁿ⁺¹/(n+1)! not Aⁿ/n! with your code (in other words, your sum misses the A⁰/0!=Id term).
With this, I get the expected result
>>> mac_series
array([[-1.00000000e+00, 5.28918724e-10],
[-5.28918724e-10, -1.00000000e+00]])
Last problem, more subtle: you may have noticed that I do only 20 iterations, not 120. That is because after 20, you start to have a numerical problem. Apowern (or np.linalg.matrix_power(A,n), it is the same problem for both methods) becomes to big. Since it is divided by n! in the stacking, that doesn't prevent convergence. But it does prevent numeric convergence. And, in practice, after a while, numpy change the type of Apowern.
So, we should not have big matrix divided by big number, and try to iterate things that stay small enough. Like this for example
n=0
mac_series = 0
NthTerm=np.eye(2) # Aⁿ/n!. A⁰/0!=Id for now
while n < 120: # 120 is no longer a problem
print(n)
mac_series += NthTerm
n += 1
NthTerm = (NthTerm # A) / n # so if nthterm was
# Aⁿ/n!, now it becomes Aⁿ/n! # A/(n+1) = Aⁿ⁺¹/(n+1)!
Result
>>> mac_series
array([[-1.00000000e+00, -2.34844612e-16],
[ 2.34844612e-16, -1.00000000e+00]])
tl;dr
You have 4 problems
The one already mentioned by Roy: you are not accumulating the Aⁿ/n!, just replacing them, and eventually keeping only the last. In other words, you need a += instead of =
A**n is not Aⁿ. It is just A, with all the elements to the power n. Said otherwise [[x,-y],[y,x]]**n is not [[x,-y],[y,x]]ⁿ it is [[xⁿ,(-y)ⁿ],[yⁿ,xⁿ]]. So you'll end up computing [[e⁰, 1/e^π], [e^π, e⁰]] ≈ [[1, 0.0432], [23.14, 1]] which is irrelevant.
n+=1 is misplaced
The numerical problem due to Aⁿ becoming huge (even if you intend to divide it by a even huger n!, so it does not theoretically/mathematically pose a problem, but numerically it does, since intermediate result is to big for computer)

`numpy.nanpercentile` is extremely slow

numpy.nanpercentile is extremely slow.
So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet.
Do someone have solution for it?
I also had a problem with np.nanpercentile being very slow for my datasets. I found a wokraround that lets you use the standard np.percentile. And it can also be applied to many other libs.
This one should solve your problem. And it also works alot faster than np.nanpercentile:
arr = np.array([[np.nan,2,3,1,2,3],
[np.nan,np.nan,1,3,2,1],
[4,5,6,7,np.nan,9]])
mask = (arr >= np.nanmin(arr)).astype(int)
count = mask.sum(axis=1)
groups = np.unique(count)
groups = groups[groups > 0]
p90 = np.zeros((arr.shape[0]))
for g in range(len(groups)):
pos = np.where (count == groups[g])
values = arr[pos]
values = np.nan_to_num (values, nan=(np.nanmin(arr)-1))
values = np.sort (values, axis=1)
values = values[:,-groups[g]:]
p90[pos] = np.percentile (values, 90, axis=1)
So instead of taking the percentile with the nans, it sorts the rows by the amount of valid data, and takes the percentile of those rows separated. Then adds everything back together. This also works for 3D-arrays, just add y_pos and x_pos instead of pos. And watch out for what axis you are calculating over.
def testset_gen(num):
init=[]
for i in range (num):
a=random.randint(65,122) # Dummy name
b=random.randint(1,100) # Dummy value: 11~100 and 10% of nan
if b<11:
b=np.nan # 10% = nan
init.append([a,b])
return np.array(init)
np_testset=testset_gen(30000000) # 468,751KB
def f1_np (arr, num):
return np.percentile (arr[:,1], num)
# 55.0, 0.523902416229248 sec
print (f1_np(np_testset[:,1], 50))
def cupy_nanpercentile (arr, num):
return len(cp.where(arr > num)[0]) / (len(arr) - cp.sum(cp.isnan(arr))) * 100
# 55.548758317136446, 0.3640251159667969 sec
# 43% faster
# If You need same result, use int(). But You lose saved time.
print (cupy_nanpercentile(cp_testset[:,1], 50))
I can't imagine How test result takes few days. With my computer, It seems 1 Trillion line of data or more. Because of this, I can't reproduce same problem due to lack of resource.
Here's an implementation with numba. After it's been compiled it is more than 7x faster than the numpy version.
Right now it is set up to take the percentile along the first axis, however it could be changed easily.
#numba.jit(nopython=True, cache=True)
def nan_percentile_axis0(arr, percentiles):
"""Faster implementation of np.nanpercentile
This implementation always takes the percentile along axis 0.
Uses numba to speed up the calculation by more than 7x.
Function is equivalent to np.nanpercentile(arr, <percentiles>, axis=0)
Params:
arr (np.array): Array to calculate percentiles for
percentiles (np.array): 1D array of percentiles to calculate
Returns:
(np.array) Array with first dimension corresponding to
values as passed in percentiles
"""
shape = arr.shape
arr = arr.reshape((arr.shape[0], -1))
out = np.empty((len(percentiles), arr.shape[1]))
for i in range(arr.shape[1]):
out[:,i] = np.nanpercentile(arr[:,i], percentiles)
shape = (out.shape[0], *shape[1:])
return out.reshape(shape)

numpy division with RuntimeWarning: invalid value encountered in double_scalars

I wrote the following script:
import numpy
d = numpy.array([[1089, 1093]])
e = numpy.array([[1000, 4443]])
answer = numpy.exp(-3 * d)
answer1 = numpy.exp(-3 * e)
res = answer.sum()/answer1.sum()
print res
But I got this result and with the error occurred:
nan
C:\Users\Desktop\test.py:16: RuntimeWarning: invalid value encountered in double_scalars
res = answer.sum()/answer1.sum()
It seems to be that the input element were too small that python turned them to be zeros, but indeed the division has its result.
How to solve this kind of problem?
You can't solve it. Simply answer1.sum()==0, and you can't perform a division by zero.
This happens because answer1 is the exponential of 2 very large, negative numbers, so that the result is rounded to zero.
nan is returned in this case because of the division by zero.
Now to solve your problem you could:
go for a library for high-precision mathematics, like mpmath. But that's less fun.
as an alternative to a bigger weapon, do some math manipulation, as detailed below.
go for a tailored scipy/numpy function that does exactly what you want! Check out #Warren Weckesser answer.
Here I explain how to do some math manipulation that helps on this problem. We have that for the numerator:
exp(-x)+exp(-y) = exp(log(exp(-x)+exp(-y)))
= exp(log(exp(-x)*[1+exp(-y+x)]))
= exp(log(exp(-x) + log(1+exp(-y+x)))
= exp(-x + log(1+exp(-y+x)))
where above x=3* 1089 and y=3* 1093. Now, the argument of this exponential is
-x + log(1+exp(-y+x)) = -x + 6.1441934777474324e-06
For the denominator you could proceed similarly but obtain that log(1+exp(-z+k)) is already rounded to 0, so that the argument of the exponential function at the denominator is simply rounded to -z=-3000. You then have that your result is
exp(-x + log(1+exp(-y+x)))/exp(-z) = exp(-x+z+log(1+exp(-y+x))
= exp(-266.99999385580668)
which is already extremely close to the result that you would get if you were to keep only the 2 leading terms (i.e. the first number 1089 in the numerator and the first number 1000 at the denominator):
exp(3*(1089-1000))=exp(-267)
For the sake of it, let's see how close we are from the solution of Wolfram alpha (link):
Log[(exp[-3*1089]+exp[-3*1093])/([exp[-3*1000]+exp[-3*4443])] -> -266.999993855806522267194565420933791813296828742310997510523
The difference between this number and the exponent above is +1.7053025658242404e-13, so the approximation we made at the denominator was fine.
The final result is
'exp(-266.99999385580668) = 1.1050349147204485e-116
From wolfram alpha is (link)
1.105034914720621496.. × 10^-116 # Wolfram alpha.
and again, it is safe to use numpy here too.
You can use np.logaddexp (which implements the idea in #gg349's answer):
In [33]: d = np.array([[1089, 1093]])
In [34]: e = np.array([[1000, 4443]])
In [35]: log_res = np.logaddexp(-3*d[0,0], -3*d[0,1]) - np.logaddexp(-3*e[0,0], -3*e[0,1])
In [36]: log_res
Out[36]: -266.99999385580668
In [37]: res = exp(log_res)
In [38]: res
Out[38]: 1.1050349147204485e-116
Or you can use scipy.special.logsumexp:
In [52]: from scipy.special import logsumexp
In [53]: res = np.exp(logsumexp(-3*d) - logsumexp(-3*e))
In [54]: res
Out[54]: 1.1050349147204485e-116

Normalize Small Probabilities in Python

I have a list of probabilities, which I need to normalize to equal 1.0.
e.g. probs = [0.01,0.03,0.005]
I realize that this is done by dividing each probability by the sum of probs. However, if the probabilities become really small, Python will tell me that sum(probs)=0.0. I understand that this is an underflow issue. I suppose I should use the log of each probability. How would I do this?
The sum of even very small floating point values will never truly be 0; they may be close to zero, but can never be exactly zero.
Just divide 1 by their sum, and multiply the probabilities by that factor:
def normalize(probs):
prob_factor = 1 / sum(probs)
return [prob_factor * p for p in probs]
Some probabilities may make up but a very small percentage in the total sum, of course, and that percentage may approach zero. But this just means that when normalising you may end up with normalized probabilities that are either very close to zero, or if smaller than the smallest representable floating point value, equal to zero. The latter only happens if there are probabilities in the list that are so much smaller than the others that they no longer represent anything close to something that'll ever occur.
Demo:
>>> def normalize(probs):
... prob_factor = 1 / sum(probs)
... return [prob_factor * p for p in probs]
...
>>> normalize([0.0000000001,0.000000000003,0.000000000000005])
[0.9708266589000533, 0.029124799767001597, 4.854133294500266e-05]
And the extreme case:
>>> import sys
>>> normalize([sys.float_info.max, sys.float_info.min])
[0.9999999999999999, 0.0]
>>> normalize([sys.float_info.max, sys.float_info.min])[-1] == 0
True
You can always use a scale factor to avoid the underflow problem, either manually entered or automatically calculated, e.g.:
import math
no_z = ([x for x in probs if x > 0.0])
if len(no_z) == 0:
print "Unable to calculate with 0.0 as all the probabilities"
order = int(-math.log10(min(no_z)))
if order > 0:
order = 0
sf = 10**order
scaled = [x * sf for x in probs]
tot = sum(scaled)
norm = [x/tot for x in scaled]
Of course you would probably be better off just using bigfloat or numpy and doing high precision maths.

Why is sin(180) not zero when using python and numpy?

Does anyone know why the below doesn't equal 0?
import numpy as np
np.sin(np.radians(180))
or:
np.sin(np.pi)
When I enter it into python it gives me 1.22e-16.
The number π cannot be represented exactly as a floating-point number. So, np.radians(180) doesn't give you π, it gives you 3.1415926535897931.
And sin(3.1415926535897931) is in fact something like 1.22e-16.
So, how do you deal with this?
You have to work out, or at least guess at, appropriate absolute and/or relative error bounds, and then instead of x == y, you write:
abs(y - x) < abs_bounds and abs(y-x) < rel_bounds * y
(This also means that you have to organize your computation so that the relative error is larger relative to y than to x. In your case, because y is the constant 0, that's trivial—just do it backward.)
Numpy provides a function that does this for you across a whole array, allclose:
np.allclose(x, y, rel_bounds, abs_bounds)
(This actually checks abs(y - x) < abs_ bounds + rel_bounds * y), but that's almost always sufficient, and you can easily reorganize your code when it's not.)
In your case:
np.allclose(0, np.sin(np.radians(180)), rel_bounds, abs_bounds)
So, how do you know what the right bounds are? There's no way to teach you enough error analysis in an SO answer. Propagation of uncertainty at Wikipedia gives a high-level overview. If you really have no clue, you can use the defaults, which are 1e-5 relative and 1e-8 absolute.
One solution is to switch to sympy when calculating sin's and cos's, then to switch back to numpy using sp.N(...) function:
>>> # Numpy not exactly zero
>>> import numpy as np
>>> value = np.cos(np.pi/2)
6.123233995736766e-17
# Sympy workaround
>>> import sympy as sp
>>> def scos(x): return sp.N(sp.cos(x))
>>> def ssin(x): return sp.N(sp.sin(x))
>>> value = scos(sp.pi/2)
0
just remember to use sp.pi instead of sp.np when using scos and ssin functions.
Faced same problem,
import numpy as np
print(np.cos(math.radians(90)))
>> 6.123233995736766e-17
and tried this,
print(np.around(np.cos(math.radians(90)), decimals=5))
>> 0
Worked in my case. I set decimal 5 not lose too many information. As you can think of round function get rid of after 5 digit values.
Try this... it zeros anything below a given tiny-ness value...
import numpy as np
def zero_tiny(x, threshold):
if (x.dtype == complex):
x_real = x.real
x_imag = x.imag
if (np.abs(x_real) < threshold): x_real = 0
if (np.abs(x_imag) < threshold): x_imag = 0
return x_real + 1j*x_imag
else:
return x if (np.abs(x) > threshold) else 0
value = np.cos(np.pi/2)
print(value)
value = zero_tiny(value, 10e-10)
print(value)
value = np.exp(-1j*np.pi/2)
print(value)
value = zero_tiny(value, 10e-10)
print(value)
Python uses the normal taylor expansion theory it solve its trig functions and since this expansion theory has infinite terms, its results doesn't reach exact but it only approximates.
For e.g
sin(x) = x - x³/3! + x⁵/5! - ...
=> Sin(180) = 180 - ... Never 0 bout approaches 0.
That is my own reason by prove.
Simple.
np.sin(np.pi).astype(int)
np.sin(np.pi/2).astype(int)
np.sin(3 * np.pi / 2).astype(int)
np.sin(2 * np.pi).astype(int)
returns
0
1
0
-1

Categories