Resolving Zeros in Product of items in list - python

Given that we can easily convert between product of items in list with sum of logarithm of items in list if there are no 0 in the list, e.g:
>>> from operator import mul
>>> pn = [0.4, 0.3, 0.2, 0.1]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.22133638394006433
>>> math.exp(sum(0.25 * math.log(p) for p in pn))
0.22133638394006436
How should we handle cases where there are 0s in the list and in Python (in a programatically and mathematically correct way)?
More specifically, how should we handle cases like:
>>> pn = [0.4, 0.3, 0, 0]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.0
>>> math.exp(sum(1./len(pn) * math.log(p) for p in pn))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
ValueError: math domain error
Is returning 0 really the right way to handle this? What is an elegant solution such that we considers the 0s in the list but not end up with 0s?
Since it's some sort of a geometric average (product of list) and it's not exactly useful when we return 0 just because there is a single 0 in the list.
Spill over from Math Stackexchange:
https://math.stackexchange.com/questions/1727497/resolving-zeros-in-product-of-items-in-list, No answer from the math people, maybe the python/code Jedis have better ideas at resolving this.

TL;DR: Yes, returning 0 is the only right way.
(But see Conclusion.)
Mathematical background
In real analysis (i.e. not for complex numbers), when logarithms are considered, we traditionally assume the domain of log are real positive numbers. We have the identity:
x = exp(log(x)), for x>0.
It can be naturally extended to x=0 since the limit of the right hand side expression is well defined at x->0+ and equal to 0. Moreover, it's legit to set log(0)=-inf and exp(-inf)=0 (again: only for real, not complex, numbers). Formally, we extend the set of real numbers adding two elements -inf, +inf and defining consistent arithmetic etc. (For our purposes, we need to have inf + x = inf, x * inf = inf for a real x, inf + inf = inf etc.)
The other identity x = log(exp(x)) is less troublesome and holds for all real numbers (and even x=-inf or +inf).
Geometric mean
The geometric mean can be defined for nonnegative numbers (possibly equal to zeros). For two numbers a, b (it naturally generalizes to more numbers, so I'll be using only two further on), it is
gm(a,b) = sqrt(a*b), for a,b >= 0.
Certainly, gm(0,b)=0. Taking log, we get:
log(gm(a,b)) = (log(a) + log(b))/2
and it is well defined if a or b is zero. (We can plug in log(0) = -inf and the identity still holds true thanks to the extended arithmetic we defined earlier.)
Interpretation
Not surprisingly, the notion of the geometric mean hails from geometry and was originally (in ancient Greece) used for strictly positive numbers.
Suppose, we have a rectangular with sides of lengths a and b. Find a square with the area equal to the area of the rectangular. Easy to see, that the side of the square is the geometric mean of a and b.
Now, if we take a = 0, then we don't really have a rectangular and this geometric interpretation breaks. Similar problems can arise with other interpretations. We can mitigate it by considering, for example, degenerate rectangulars and squares but it may not always be a plausible approach.
Conclusion
It's up to a user (mathematician, engineer, programmer) how she understands the meaning of a geometric mean being zero. If it causes serious problems with interpretation of the results or breaks a computer program, then in the first place, maybe the choice of the geometric mean was not justified as a mathematical model.
Python
As already mentioned in the other answers, python has infinity implemented. It raises a runtime warning (division by zero) when executing np.exp(np.log(0)) but the result of the operation is correct.

Whether or not 0 is the correct result depends on what you're trying to accomplish. ptrj did a great job with their answer, so I will only add one thing to consider.
You may want to consider using an epsilon-adjusted geometric mean. Whereas a standard geometric mean is of the form (a_1*a_2*...*a_n)^(1/n), the epsilon-adjusted geometric mean is of the form ( (a_1+e)*(a_2+e)*...*(a_n+e) )^(1/n) - e. The appropriate value for epsilon (e) depends again on your task.
Epsilon-adjusted geometric means are sometimes used in data retrieval where a 0 in the set shouldn't cause a record's score to vanish entirely, though it should still penalize the record's score. See for example Score Aggregation Techniques in Retrieval Experimentation.
For example, with your data and an epsilon adjustment of 0.01
>>> from operator import mul
>>> pn=[0.4, 0.3, 0, 0]
>>> e=0.01
>>> pow(reduce(mul, [x+e for x in pn], 1), 1./len(pn)) - e
0.04970853116594962

You should return -math.inf in python 3.5 or -float('inf') in older versions. This is because the logarithm of numbers very close to 0 goes to negative infinity. This float value with preserve the correct inequalities between the sum of logs between lists, for instance one would expect that
sumlog([5, 4, 1, 0, 2]) < sumlog([5, 1, 4, 0.0001, 1])
This inequality is held if you return negative infinity.

You can try using list comprehensions in Python. They can be very powerful for customising the way your data is handled. This example uses list comprehension and an error number of -999.
>>> [math.log(i) if i > 0 else -999 for i in pn]
>>> [-0.916290731874155, -1.2039728043259361, -999, -999]
If you're only using the if and not the else, then the if goes after the for i in pn part.

Related

When I take an nth root in Python and NumPy, which of the n existing roots do I actually get?

Entailed by the fundamental theorem of algebra is the existence of n complex roots for the formula z^n=a where a is a real number, n is a positive integer, and z is a complex number. Some roots will also be real in addition to complex (i.e. a+bi where b=0).
One example where there are multiple real roots is z^2=1 where we obtain z = ±sqrt(1) = ± 1. The solution z = 1 is immediate. The solution z = -1 is obtained by z = sqrt(1) = sqrt(-1 * -1) = I * I = -1, which I is the imaginary unit.
In Python/NumPy (as well as many other programming languages and packages) only a single value is returned. Here are two examples for 5^{1/3}, which has 3 roots.
>>> 5 ** (1 / 3)
1.7099759466766968
>>> import numpy as np
>>> np.power(5, 1/3)
1.7099759466766968
It is not a problem for my use case that only one of the possible roots are returned, but it would be informative to know 'which' root is systematically calculated in the contexts of Python and NumPy. Perhaps there is an (ISO) standard stating which root should be returned, or perhaps there is a commonly-used algorithm that happens to return a specific root. I've imagined of an equivalence class such as "the maximum of the real-valued solutions", but I do not know.
Question: When I take an nth root in Python and NumPy, which of the n existing roots do I actually get?
Since typically the idenity xᵃ = exp(a⋅log(x)) is used to define the general power, you'll get the root corresponding to the chosen branch cut of the complex logarithm.
With regards to this, the numpy documentation says:
For real-valued input data types, log always returns real output. For each value that cannot be expressed as a real number or infinity, it yields nan and sets the invalid floating point error flag.
For complex-valued input, log is a complex analytical function that has a branch cut [-inf, 0] and is continuous from above on it. log handles the floating-point negative zero as an infinitesimal negative number, conforming to the C99 standard.
So for example, np.power(-1 +0j, 1/3) = 0.5 + 0.866j = np.exp(np.log(-1+0j)/3).

Logarithm over x

Since the following expansion for the logarithm holds:
log(1-x)=-x-x^2/2-x^3/3-...
one can calculate the following functions which have removable singularities at x:
log(1-x)/x=-1-x/2-...
(log(1-x)/x+1)/x=-1/2-x/3-...
((log(1-x)/x+1)/x+1/2)/x=-1/3-x/4-...
I am trying to use NumPy for these calculations, and specifically the log1p function, which is accurate near x=0. However, convergence for the aforementioned functions is still problematic.
Do you have any ideas for any existing functions implementing these formulas or should I write one myself using the previous expansions, which will not be as efficient, however?
The simplest thing to do is something like
In [17]: def logf(x, eps=1e-6):
...: if abs(x) < eps:
...: return -0.5 - x/3.
...: else:
...: return (1. + log1p(-x)/x)/x
and play a bit with the threshold eps.
If you want a numpy-like, vectorized solution, replace an if with a np.where
>>> np.where(x > eps, 1. + log1p(-x)/x) / x, -0.5 - x/3.)
Why not successively take the Square of the candidate, after initially extracting the exponent component? When the square results in a number greater than 2, divide by two, and set the bit in the mantissa of your result that corresponds to the iteration. This is a much quicker and simpler way of determining log base 2, which can then in a single multiplication, be transformed to the e or 10 base.
Some predefined functions don't work at singularity points. One simple-minded solution is to compute the series by adding terms from a peculiar sequence.
For your example, the sequence would be :
sum = 0
for i in range(n):
sum+= x^k/k
sum = -sum
for log(1-x)
Then you keep adding a lot of terms or until the last term is under a small threshold.

Extremely low values from NumPy

I am attempting to do a few different operations in Numpy (mean and interp), and with both operations I am getting the result 2.77555756156e-17 at various times, usually when I'm expecting a zero. Even attempting to filter these out with array[array < 0.0] = 0.0 fails to remove the values.
I assume there's some sort of underlying data type or environment error that's causing this. The data should all be float.
Edit: It's been helpfully pointed out that I was only filtering out the values of -2.77555756156e-17 but still seeing positive 2.77555756156e-17. The crux of the question is what might be causing these wacky values to appear when doing simple functions like interpolating values between 0-10 and taking a mean of floats in the same range, and how can I avoid it without having to explicitly filter the arrays after every statement.
You're running into numerical precision, which is a huge topic in numerical computing; when you do any computation with floating point numbers, you run the risk of running into tiny values like the one you've posted here. What's happening is that your calculations are resulting in values that can't quite be expressed with floating-point numbers.
Floating-point numbers are expressed with a fixed amount of information (in Python, this amount defaults to 64 bits). You can read more about how that information is encoded on the very good Floating point Wikipedia page. In short, some calculation that you're performing in the process of computing your mean produces an intermediate value that cannot be precisely expressed.
This isn't a property of numpy (and it's not even really a property of Python); it's really a property of the computer itself. You can see this is normal Python by playing around in the repl:
>>> repr(3.0)
'3.0'
>>> repr(3.0 + 1e-10)
'3.0000000001'
>>> repr(3.0 + 1e-18)
'3.0'
For the last result, you would expect 3.000000000000000001, but that number can't be expressed in a 64-bit floating point number, so the computer uses the closest approximation, which in this case is just 3.0. If you were trying to average the following list of numbers:
[3., -3., 1e-18]
Depending on the order in which you summed them, you could get 1e-18 / 3., which is the "correct" answer, or zero. You're in a slightly stranger situation; two numbers that you expected to cancel didn't quite cancel out.
This is just a fact of life when you're dealing with floating point mathematics. The common way of working around it is to eschew the equals sign entirely and to only perform "numerically tolerant comparison", which means equality-with-a-bound. So this check:
a == b
Would become this check:
abs(a - b) < TOLERANCE
For some tolerance amount. The tolerance depends on what you know about your inputs and the precision of your computer; if you're using a 64-bit machine, you want this to be at least 1e-10 times the largest amount you'll be working with. For example, if the biggest input you'll be working with is around 100, it's reasonable to use a tolerance of 1e-8.
You can round your values to 15 digits:
a = a.round(15)
Now the array a should show you 0.0 values.
Example:
>>> a = np.array([2.77555756156e-17])
>>> a.round(15)
array([ 0.])
This is most likely the result of floating point arithmetic errors. For instance:
In [3]: 0.1 + 0.2 - 0.3
Out[3]: 5.551115123125783e-17
Not what you would expect? Numpy has a built in isclose() method that can deal with these things. Also, you can see the machine precision with
eps = np.finfo(np.float).eps
So, perhaps something like this could work too:
a = np.array([[-1e-17, 1.0], [1e-16, 1.0]])
a[np.abs(a) <= eps] = 0.0

Why is the value of 1**Inf equal to 1, not NaN?

Why is 1**Inf == 1 ?
I believe it should be NaN, just like Inf-Inf or Inf/Inf.
How is exponentiation implemented on floats in python?
exp(y*log(x)) would get correct result :/
You are right, mathematically, the value of 1∞ is indeterminate.
However, Python doesn't follow the maths exactly in this case. The document of math.pow says:
math.pow(x, y)
Return x raised to the power y. Exceptional cases follow Annex ‘F’ of the C99 standard as far as possible. In particular, pow(1.0, x) and pow(x, 0.0) always return 1.0, even when x is a zero or a NaN.
Floating-point arithmetic is not real-number arithmetic. Notions of "correct" informed by real analysis do not necessarily apply to floating-point.
In this case, however, the trouble is just that pow fundamentally represents two similar but distinct functions:
Exponentiation with an integer power, which is naturally a function RxZ --> R (or RxN --> R).
The two-variable complex function given by pow(x,y) = exp(y * log(x)) restricted to the real line.
These functions agree for normal values, but differ in their edge cases at zero, infinity, and along the negative real axis (which is traditionally the branch cut for the second function).
These two functions are sometimes divided up to make the edge cases more reasonable; when that's done the first function is called pown and the second is called powr; as you have noticed pow is a conflation of the two functions, and uses the edge cases for these values that come from pown.
Technically 1^inf is defined as limit(1^x, x->inf). 1^x = 1 for any x >1, so it should be limit(1,x->inf) = 1, not NaN

In what contexts do programming languages make real use of an Infinity value?

So in Ruby there is a trick to specify infinity:
1.0/0
=> Infinity
I believe in Python you can do something like this
float('inf')
These are just examples though, I'm sure most languages have infinity in some capacity. When would you actually use this construct in the real world? Why would using it in a range be better than just using a boolean expression? For instance
(0..1.0/0).include?(number) == (number >= 0) # True for all values of number
=> true
To summarize, what I'm looking for is a real world reason to use Infinity.
EDIT: I'm looking for real world code. It's all well and good to say this is when you "could" use it, when have people actually used it.
Dijkstra's Algorithm typically assigns infinity as the initial edge weights in a graph. This doesn't have to be "infinity", just some arbitrarily constant but in java I typically use Double.Infinity. I assume ruby could be used similarly.
Off the top of the head, it can be useful as an initial value when searching for a minimum value.
For example:
min = float('inf')
for x in somelist:
if x<min:
min=x
Which I prefer to setting min initially to the first value of somelist
Of course, in Python, you should just use the min() built-in function in most cases.
There seems to be an implied "Why does this functionality even exist?" in your question. And the reason is that Ruby and Python are just giving access to the full range of values that one can specify in floating point form as specified by IEEE.
This page seems to describe it well:
http://steve.hollasch.net/cgindex/coding/ieeefloat.html
As a result, you can also have NaN (Not-a-number) values and -0.0, while you may not immediately have real-world uses for those either.
In some physics calculations you can normalize irregularities (ie, infinite numbers) of the same order with each other, canceling them both and allowing a approximate result to come through.
When you deal with limits, calculations like (infinity / infinity) -> approaching a finite a number could be achieved. It's useful for the language to have the ability to overwrite the regular divide-by-zero error.
Use Infinity and -Infinity when implementing a mathematical algorithm calls for it.
In Ruby, Infinity and -Infinity have nice comparative properties so that -Infinity < x < Infinity for any real number x. For example, Math.log(0) returns -Infinity, extending to 0 the property that x > y implies that Math.log(x) > Math.log(y). Also, Infinity * x is Infinity if x > 0, -Infinity if x < 0, and 'NaN' (not a number; that is, undefined) if x is 0.
For example, I use the following bit of code in part of the calculation of some log likelihood ratios. I explicitly reference -Infinity to define a value even if k is 0 or n AND x is 0 or 1.
Infinity = 1.0/0.0
def Similarity.log_l(k, n, x)
unless x == 0 or x == 1
k * Math.log(x.to_f) + (n-k) * Math.log(1.0-x)
end
-Infinity
end
end
Alpha-beta pruning
I use it to specify the mass and inertia of a static object in physics simulations. Static objects are essentially unaffected by gravity and other simulation forces.
In Ruby infinity can be used to implement lazy lists. Say i want N numbers starting at 200 which get successively larger by 100 units each time:
Inf = 1.0 / 0.0
(200..Inf).step(100).take(N)
More info here: http://banisterfiend.wordpress.com/2009/10/02/wtf-infinite-ranges-in-ruby/
I've used it for cases where you want to define ranges of preferences / allowed.
For example in 37signals apps you have like a limit to project number
Infinity = 1 / 0.0
FREE = 0..1
BASIC = 0..5
PREMIUM = 0..Infinity
then you can do checks like
if PREMIUM.include? current_user.projects.count
# do something
end
I used it for representing camera focus distance and to my surprise in Python:
>>> float("inf") is float("inf")
False
>>> float("inf") == float("inf")
True
I wonder why is that.
I've used it in the minimax algorithm. When I'm generating new moves, if the min player wins on that node then the value of the node is -∞. Conversely, if the max player wins then the value of that node is +∞.
Also, if you're generating nodes/game states and then trying out several heuristics you can set all the node values to -∞/+∞ which ever makes sense and then when you're running a heuristic its easy to set the node value:
node_val = -∞
node_val = max(heuristic1(node), node_val)
node_val = max(heuristic2(node), node_val)
node_val = max(heuristic2(node), node_val)
I've used it in a DSL similar to Rails' has_one and has_many:
has 0..1 :author
has 0..INFINITY :tags
This makes it easy to express concepts like Kleene star and plus in your DSL.
I use it when I have a Range object where one or both ends need to be open
I've used symbolic values for positive and negative infinity in dealing with range comparisons to eliminate corner cases that would otherwise require special handling:
Given two ranges A=[a,b) and C=[c,d) do they intersect, is one greater than the other, or does one contain the other?
A > C iff a >= d
A < C iff b <= c
etc...
If you have values for positive and negative infinity that respectively compare greater than and less than all other values, you don't need to do any special handling for open-ended ranges. Since floats and doubles already implement these values, you might as well use them instead of trying to find the largest/smallest values on your platform. With integers, it's more difficult to use "infinity" since it's not supported by hardware.
I ran across this because I'm looking for an "infinite" value to set for a maximum, if a given value doesn't exist, in an attempt to create a binary tree. (Because I'm selecting based on a range of values, and not just a single value, I quickly realized that even a hash won't work in my situation.)
Since I expect all numbers involved to be positive, the minimum is easy: 0. Since I don't know what to expect for a maximum, though, I would like the upper bound to be Infinity of some sort. This way, I won't have to figure out what "maximum" I should compare things to.
Since this is a project I'm working on at work, it's technically a "Real world problem". It may be kindof rare, but like a lot of abstractions, it's convenient when you need it!
Also, to those who say that this (and other examples) are contrived, I would point out that all abstractions are somewhat contrived; that doesn't mean they are useful when you contrive them.
When working in a problem domain where trig is used (especially tangent) infinity is an answer that can come up. Trig ends up being used heavily in graphics applications, games, and geospatial applications, plus the obvious math applications.
I'm sure there are other ways to do this, but you could use Infinity to check for reasonable inputs in a String-to-Float conversion. In Java, at least, the Float.isNaN() static method will return false for numbers with infinite magnitude, indicating they are valid numbers, even though your program might want to classify them as invalid. Checking against the Float.POSITIVE_INFINITY and Float.NEGATIVE_INFINITY constants solves that problem. For example:
// Some sample values to test our code with
String stringValues[] = {
"-999999999999999999999999999999999999999999999",
"12345",
"999999999999999999999999999999999999999999999"
};
// Loop through each string representation
for (String stringValue : stringValues) {
// Convert the string representation to a Float representation
Float floatValue = Float.parseFloat(stringValue);
System.out.println("String representation: " + stringValue);
System.out.println("Result of isNaN: " + floatValue.isNaN());
// Check the result for positive infinity, negative infinity, and
// "normal" float numbers (within the defined range for Float values).
if (floatValue == Float.POSITIVE_INFINITY) {
System.out.println("That number is too big.");
} else if (floatValue == Float.NEGATIVE_INFINITY) {
System.out.println("That number is too small.");
} else {
System.out.println("That number is jussssst right.");
}
}
Sample Output:
String representation: -999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too small.
String representation: 12345
Result of isNaN: false
That number is jussssst right.
String representation: 999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too big.
It is used quite extensively in graphics. For example, any pixel in a 3D image that is not part of an actual object is marked as infinitely far away. So that it can later be replaced with a background image.
I'm using a network library where you can specify the maximum number of reconnection attempts. Since I want mine to reconnect forever:
my_connection = ConnectionLibrary(max_connection_attempts = float('inf'))
In my opinion, it's more clear than the typical "set to -1 to retry forever" style, since it's literally saying "retry until the number of connection attempts is greater than infinity".
Some programmers use Infinity or NaNs to show a variable has never been initialized or assigned in the program.
If you want the largest number from an input but they might use very large negatives. If I enter -13543124321.431 it still works out as the largest number since it's bigger than -inf.
enter code here
initial_value = float('-inf')
while True:
try:
x = input('gimmee a number or type the word, stop ')
except KeyboardInterrupt:
print("we done - by yo command")
break
if x == "stop":
print("we done")
break
try:
x = float(x)
except ValueError:
print('not a number')
continue
if x > initial_value: initial_value = x
print("The largest number is: " + str(initial_value))
You can to use:
import decimal
decimal.Decimal("Infinity")
or:
from decimal import *
Decimal("Infinity")
For sorting
I've seen it used as a sort value, to say "always sort these items to the bottom".
To specify a non-existent maximum
If you're dealing with numbers, nil represents an unknown quantity, and should be preferred to 0 for that case. Similarly, Infinity represents an unbounded quantity, and should be preferred to (arbitrarily_large_number) in that case.
I think it can make the code cleaner. For example, I'm using Float::INFINITY in a Ruby gem for exactly that: the user can specify a maximum string length for a message, or they can specify :all. In that case, I represent the maximum length as Float::INFINITY, so that later when I check "is this message longer than the maximum length?" the answer will always be false, without needing a special case.

Categories