I'm defining a function that calculates the standard deviation of a list. Sometimes the mean of this list is negative and so my function can't take the square root of a negative, returning me with an error.
This seems simple, I just can't think of it. I want to write a conditional for my function saying that if there is a negative number, to multiply by -1 since the square root of a negative number cannot be taken.
How can I write this statement?
def stdevValue(lst):
"""calculates the standard deviation of a list of numbers
input: list of numbers
output: float that is the standard deviation
"""
stdev = 0
stdevCalc = (((sum(lst) - (meanValue(x)))/(len(lst)-1)))**0.5
stdev += stdevCalc
return stdev
You appear to have misapplied the formula for standard deviation. You shouldn't need to handle the case of square root of negative numbers at all. You need to square each difference between the value and the mean before summing, like this:
def stdevValue(lst):
m = meanValue(x) # wherever this comes from
return (sum((x - m) ** 2 for x in lst) / len(lst)) ** 0.5
This ensures that the sum is nonnegative, so you can take the square root without being concerned about negative values. (If you want sample standard deviation, divide by (len(lst) - 1)).
See the Wikipedia article on Standard Deviation for more information and examples.
Ignoring the context of the question, the answer is to use the built-in abs. But if you have a mathematical expression of type y = sqrt(x), and y cannot be complex, then x cannot be negative. Any negative x denotes a problem, which could be rounding, wrapping, or, as in your case, an incorrect formula. Simply multiplying by -1, or taking the abs, will not fix the problem, it will give you the wrong answer. You should maybe consider how to deal with these cases (although I appreciate that for the case of standard deviation these errors are unlikely to arise).
If you want to really get creative, you can square and square-root:
>>> import math
>>> x = -5
>>> math.sqrt((-5)**2)
5.0
Cheers
Related
Entailed by the fundamental theorem of algebra is the existence of n complex roots for the formula z^n=a where a is a real number, n is a positive integer, and z is a complex number. Some roots will also be real in addition to complex (i.e. a+bi where b=0).
One example where there are multiple real roots is z^2=1 where we obtain z = ±sqrt(1) = ± 1. The solution z = 1 is immediate. The solution z = -1 is obtained by z = sqrt(1) = sqrt(-1 * -1) = I * I = -1, which I is the imaginary unit.
In Python/NumPy (as well as many other programming languages and packages) only a single value is returned. Here are two examples for 5^{1/3}, which has 3 roots.
>>> 5 ** (1 / 3)
1.7099759466766968
>>> import numpy as np
>>> np.power(5, 1/3)
1.7099759466766968
It is not a problem for my use case that only one of the possible roots are returned, but it would be informative to know 'which' root is systematically calculated in the contexts of Python and NumPy. Perhaps there is an (ISO) standard stating which root should be returned, or perhaps there is a commonly-used algorithm that happens to return a specific root. I've imagined of an equivalence class such as "the maximum of the real-valued solutions", but I do not know.
Question: When I take an nth root in Python and NumPy, which of the n existing roots do I actually get?
Since typically the idenity xᵃ = exp(a⋅log(x)) is used to define the general power, you'll get the root corresponding to the chosen branch cut of the complex logarithm.
With regards to this, the numpy documentation says:
For real-valued input data types, log always returns real output. For each value that cannot be expressed as a real number or infinity, it yields nan and sets the invalid floating point error flag.
For complex-valued input, log is a complex analytical function that has a branch cut [-inf, 0] and is continuous from above on it. log handles the floating-point negative zero as an infinitesimal negative number, conforming to the C99 standard.
So for example, np.power(-1 +0j, 1/3) = 0.5 + 0.866j = np.exp(np.log(-1+0j)/3).
I calculate the first derivative using the following code:
def f(x):
f = np.exp(x)
return f
def dfdx(x):
Df = (f(x+h)-f(x-h)) / (2*h)
return Df
For example, for x == 10 this works fine. But when I set h to around 10E-14 or below, Df starts
to get values that are really far away from the expected value f(10) and the relative error between the expected value and Df becomes huge.
Why is that? What is happening here?
The evaluation of f(x) has, at best, a rounding error of |f(x)|*mu where mu is the machine constant of the floating point type. The total error of the central difference formula is thus approximately
2*|f(x)|*mu/(2*h) + |f'''(x)|/6 * h^2
In the present case, the exponential function is equal to all of its derivatives, so that the error is proportional to
mu/h + h^2/6
which has a minimum at h = (3*mu)^(1/3), which for the double format with mu=1e-16 is around h=1e-5.
The precision is increased if instead of 2*h the actual difference (x+h)-(x-h) between the evaluation points is used in the denominator. This can be seen in the following loglog plot of the distance to the exact derivative.
You are probably encountering some numerical instability, as for x = 10 and h =~ 1E-13, the argument for np.exp is very close to 10 whether h is added or subtracted, so small approximation errors in the value of np.exp are scaled significantly by the division with the very small 2 * h.
In addition to the answer by #LutzL I will add some info from a great book Numerical Recipes 3rd Edition: The Art of Scientific Computing from chapter 5.7 about Numerical Derivatives, especially about the choice of optimal h value for given x:
Always choose h so that h and x differ by an exactly representable number. Funny stuff like 1/3 should be avoided, except when x is equal to something along the lines of 14.3333333.
Round-off error is approximately epsilon * |f(x) * h|, where epsilon is floating point accuracy, Python represents floating point numbers with double precision so it's 1e-16. It may differ for more complicated functions (where precision errors arise further), though it's not your case.
Choice of optimal h: Not getting into details it would be sqrt(epsilon) * x for simple forward case, except when your x is near zero (you will find more information in the book), which is your case. You may want to use higher x values in such cases, complementary answer is already provided. In the case of f(x+h) - f(x-h) as in your example it would amount to epsilon ** 1/3 * x, so approximately 5e-6 times x, which choice might be a little difficult in case of small values like yours. Quite close (if one can say so bearing in mind floating point arithmetic...) to practical results posted by #LutzL though.
You may use other derivative formulas, except the symmetric one you are using. You may want to use the forward or backward evaluation(if the function is costly to evaluate and you have calculated f(x) beforehand. If your function is cheap to evaluate, you may want to evaluate it multiple times using higher order methods to make the precision error smaller (see five-point stencil on wikipedia as provided in the comment to your question).
This Python tutorial explains the reason behind the limited precision. In summary, decimals are ultimately represented in binary and the precision is about 17 significant digits. So, you are right that it gets fuzzy beyond 10E-14.
Since the following expansion for the logarithm holds:
log(1-x)=-x-x^2/2-x^3/3-...
one can calculate the following functions which have removable singularities at x:
log(1-x)/x=-1-x/2-...
(log(1-x)/x+1)/x=-1/2-x/3-...
((log(1-x)/x+1)/x+1/2)/x=-1/3-x/4-...
I am trying to use NumPy for these calculations, and specifically the log1p function, which is accurate near x=0. However, convergence for the aforementioned functions is still problematic.
Do you have any ideas for any existing functions implementing these formulas or should I write one myself using the previous expansions, which will not be as efficient, however?
The simplest thing to do is something like
In [17]: def logf(x, eps=1e-6):
...: if abs(x) < eps:
...: return -0.5 - x/3.
...: else:
...: return (1. + log1p(-x)/x)/x
and play a bit with the threshold eps.
If you want a numpy-like, vectorized solution, replace an if with a np.where
>>> np.where(x > eps, 1. + log1p(-x)/x) / x, -0.5 - x/3.)
Why not successively take the Square of the candidate, after initially extracting the exponent component? When the square results in a number greater than 2, divide by two, and set the bit in the mantissa of your result that corresponds to the iteration. This is a much quicker and simpler way of determining log base 2, which can then in a single multiplication, be transformed to the e or 10 base.
Some predefined functions don't work at singularity points. One simple-minded solution is to compute the series by adding terms from a peculiar sequence.
For your example, the sequence would be :
sum = 0
for i in range(n):
sum+= x^k/k
sum = -sum
for log(1-x)
Then you keep adding a lot of terms or until the last term is under a small threshold.
Given that we can easily convert between product of items in list with sum of logarithm of items in list if there are no 0 in the list, e.g:
>>> from operator import mul
>>> pn = [0.4, 0.3, 0.2, 0.1]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.22133638394006433
>>> math.exp(sum(0.25 * math.log(p) for p in pn))
0.22133638394006436
How should we handle cases where there are 0s in the list and in Python (in a programatically and mathematically correct way)?
More specifically, how should we handle cases like:
>>> pn = [0.4, 0.3, 0, 0]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.0
>>> math.exp(sum(1./len(pn) * math.log(p) for p in pn))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
ValueError: math domain error
Is returning 0 really the right way to handle this? What is an elegant solution such that we considers the 0s in the list but not end up with 0s?
Since it's some sort of a geometric average (product of list) and it's not exactly useful when we return 0 just because there is a single 0 in the list.
Spill over from Math Stackexchange:
https://math.stackexchange.com/questions/1727497/resolving-zeros-in-product-of-items-in-list, No answer from the math people, maybe the python/code Jedis have better ideas at resolving this.
TL;DR: Yes, returning 0 is the only right way.
(But see Conclusion.)
Mathematical background
In real analysis (i.e. not for complex numbers), when logarithms are considered, we traditionally assume the domain of log are real positive numbers. We have the identity:
x = exp(log(x)), for x>0.
It can be naturally extended to x=0 since the limit of the right hand side expression is well defined at x->0+ and equal to 0. Moreover, it's legit to set log(0)=-inf and exp(-inf)=0 (again: only for real, not complex, numbers). Formally, we extend the set of real numbers adding two elements -inf, +inf and defining consistent arithmetic etc. (For our purposes, we need to have inf + x = inf, x * inf = inf for a real x, inf + inf = inf etc.)
The other identity x = log(exp(x)) is less troublesome and holds for all real numbers (and even x=-inf or +inf).
Geometric mean
The geometric mean can be defined for nonnegative numbers (possibly equal to zeros). For two numbers a, b (it naturally generalizes to more numbers, so I'll be using only two further on), it is
gm(a,b) = sqrt(a*b), for a,b >= 0.
Certainly, gm(0,b)=0. Taking log, we get:
log(gm(a,b)) = (log(a) + log(b))/2
and it is well defined if a or b is zero. (We can plug in log(0) = -inf and the identity still holds true thanks to the extended arithmetic we defined earlier.)
Interpretation
Not surprisingly, the notion of the geometric mean hails from geometry and was originally (in ancient Greece) used for strictly positive numbers.
Suppose, we have a rectangular with sides of lengths a and b. Find a square with the area equal to the area of the rectangular. Easy to see, that the side of the square is the geometric mean of a and b.
Now, if we take a = 0, then we don't really have a rectangular and this geometric interpretation breaks. Similar problems can arise with other interpretations. We can mitigate it by considering, for example, degenerate rectangulars and squares but it may not always be a plausible approach.
Conclusion
It's up to a user (mathematician, engineer, programmer) how she understands the meaning of a geometric mean being zero. If it causes serious problems with interpretation of the results or breaks a computer program, then in the first place, maybe the choice of the geometric mean was not justified as a mathematical model.
Python
As already mentioned in the other answers, python has infinity implemented. It raises a runtime warning (division by zero) when executing np.exp(np.log(0)) but the result of the operation is correct.
Whether or not 0 is the correct result depends on what you're trying to accomplish. ptrj did a great job with their answer, so I will only add one thing to consider.
You may want to consider using an epsilon-adjusted geometric mean. Whereas a standard geometric mean is of the form (a_1*a_2*...*a_n)^(1/n), the epsilon-adjusted geometric mean is of the form ( (a_1+e)*(a_2+e)*...*(a_n+e) )^(1/n) - e. The appropriate value for epsilon (e) depends again on your task.
Epsilon-adjusted geometric means are sometimes used in data retrieval where a 0 in the set shouldn't cause a record's score to vanish entirely, though it should still penalize the record's score. See for example Score Aggregation Techniques in Retrieval Experimentation.
For example, with your data and an epsilon adjustment of 0.01
>>> from operator import mul
>>> pn=[0.4, 0.3, 0, 0]
>>> e=0.01
>>> pow(reduce(mul, [x+e for x in pn], 1), 1./len(pn)) - e
0.04970853116594962
You should return -math.inf in python 3.5 or -float('inf') in older versions. This is because the logarithm of numbers very close to 0 goes to negative infinity. This float value with preserve the correct inequalities between the sum of logs between lists, for instance one would expect that
sumlog([5, 4, 1, 0, 2]) < sumlog([5, 1, 4, 0.0001, 1])
This inequality is held if you return negative infinity.
You can try using list comprehensions in Python. They can be very powerful for customising the way your data is handled. This example uses list comprehension and an error number of -999.
>>> [math.log(i) if i > 0 else -999 for i in pn]
>>> [-0.916290731874155, -1.2039728043259361, -999, -999]
If you're only using the if and not the else, then the if goes after the for i in pn part.
Is there a built-in way to calculate the correctly rounded n-th root of a Python 3 decimal object?
According to the documentation, there is a function power(x,y) :
With two arguments, compute x**y. If x is negative then y must be
integral. The result will be inexact unless y is integral and the
result is finite and can be expressed exactly in ‘precision’ digits.
The result should always be correctly rounded, using the rounding mode
of the current thread’s context
This implies that power(x, 1.0/n) should give you what you want.
You can also take the nth root with
nthRoot = Decimal(x) ** (Decimal(1.0) / Decimal(n) )
Not sure if you consider either of these "built in" as you have to compute the reciprocal of n explicitly to get the nth root.