Dealing with floating points: With numpy python in particular [duplicate] - python

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
I wanted values like -
1,1.02,1.04,1.06,1.08 etc...
So used numpy in python:
y = [x for x in numpy.arange(1,2,0.02)]
I got the values-
1.0,
1.02,
1.04,
1.0600000000000001,
1.0800000000000001,
I have three questions here:
How do I get exactly values 1,1.02,1.04,1.06,1.08 etc....
Why correct values for 1.02, and 1.04, and not for 1.0600000000000001,
How reliable our programs can be when we can't trust such basic operations in programs that can run into thousands of lines of code, and we do so many calculations in that? How do we cope with such scenario?
There are very similar questions that address problems with floating points in general and numpy library in particular -
Is floating point math broken?
Why Are Floating Point Numbers Inaccurate?
While they address why such a thing happens, here I'm concerned more about how do I deal with such scenario in everyday programming, particularly in numpy python? Hence I've these questions.

First pitfall : do not confuse accuracy and printing policies.
In your example :
In [6]: [x.as_integer_ratio() for x in arange(1,1.1,0.02)]
Out[6]:
[(1, 1),
(2296835809958953, 2251799813685248),
(1170935903116329, 1125899906842624),
(2386907802506363, 2251799813685248),
(607985949695017, 562949953421312),
(2476979795053773, 2251799813685248)]
shows that only 1 has an exact float representation.
In [7]: ['{:1.18f}'.format(f) for f in arange(1,1.1,.02)]
Out[7]:
['1.000000000000000000',
'1.020000000000000018',
'1.040000000000000036',
'1.060000000000000053',
'1.080000000000000071',
'1.100000000000000089']
shows intern accuracy.
In [8]: arange(1,1.1,.02)
Out[8]: array([ 1. , 1.02, 1.04, 1.06, 1.08, 1.1 ])
shows how numpy deals with printing, rounding to at most 6 digits, discarding trailing 0.
In [9]: [f for f in arange(1,1.1,.02)]
Out[9]: [1.0, 1.02, 1.04, 1.0600000000000001, 1.0800000000000001, 1.1000000000000001]
shows how python deals with printing, rounding to at most 16 digits, discarding trailing 0 after first digit.
Some advise for How do I deal with such scenario in everyday programming ?
Furthermore, each operation on floats can deteriorate accuracy.
Natural float64 accuracy is roughly 1e-16, which is sufficient for a lot of applications. Subtract is the most common source of precision loss, as in this example where exact result is 0. :
In [5]: [((1+10**(-exp))-1)/10**(-exp)-1 for exp in range(0,24,4)]
Out[5]:
[0.0,
-1.1013412404281553e-13,
-6.07747097092215e-09,
8.890058234101161e-05,
-1.0,
-1.0]

Related

Behaviour of `__str__` method of type `complex`

So I was typing random stuff in my Python shell, and saw this:
>>> print(-1j)
(-0-1j)
Which troubled me. So and continued and tried something else and saw:
>>> print((-0-1j))
-1j
(I do know the extra parentheses are redundant, they’re there for comical effects)
Immediately looking at this, it reminded me of all the JavaScript type-casting memes and the related questions on SO, so I posted this meme on reddit.
But I am still puzzled by this behaviour, which I believe is caused by the complex.__str__ method.
I couldn’t find any documentation related to it, so can someone please explain to me what really is happening?
Python does not have a separate imaginary number type, or a separate Gaussian integer type. All imaginary numbers are represented as complex numbers, and all complex number components are floating point.
IEEE 754 floating point has a -0.0 value, which can be a complex number component. complex.__repr__ will omit a real component of regular 0.0, but not -0.0, and will format integer-valued components without a .0. Unfortunately, that leads to the output you see.
-1j is a - operator applied to 1j, and 1j is a complex number with real part 0.0 and imaginary part 1.0. In other words, 1j is complex(0.0, 1.0).
Negating 1j produces a real part of -0.0 - floating point negative zero - and an imaginary part of -1.0. In other words, -1j is complex(-0.0, -1.0), which displays as -0-1j, displaying the -0.0 but dropping the .0s.
However, integers don't have negative zero, so -0 is just 0. -0-1j subtracts 1j from 0. After converting 0 to complex, both real components are 0.0, so the subtraction produces a real component of 0.0 - 0.0, which is regular 0.0 instead of -0.0. The result is complex(0.0, -1.0), which displays as -1j.

How to avoid bad decimals in np.cumsum? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 5 years ago.
I have the following list
alist = [0.141, 0.29, 0.585, 0.547, 0.233]
and I find the cumulative summation with
np.cumsum(alist)
array([ 0.141, 0.431, 1.016, 1.563, 1.796])
When I convert this array to a list, some decimal values show up. How can I avoid these decimals?
list(np.cumsum(alist))
[0.14099999999999999,
0.43099999999999994,
1.016,
1.5630000000000002,
1.7960000000000003]
This may be a duplicate, but I couldn't find the answer.
It's important to understand that floating point numbers are not stored in base 10 decimal format.
Therefore, you have to be crystal clear why you want to remove "the extra decimal places" that you see:
Change formatting to make the numbers look prettier / consistent.
Work with precision as if you are performing base-10 arithmetic.
If the answer is (1), then use np.round.
If the answer is (2), then use python's decimal module.
The below example demonstrates that np.round does not change the underlying representation of floats.
import numpy as np
alist = [0.141, 0.29, 0.585, 0.547, 0.233]
lst = np.round(np.cumsum(alist), decimals=3)
print(lst)
# [ 0.141 0.431 1.016 1.563 1.796]
np.set_printoptions(precision=20)
print(lst)
# [ 0.14099999999999998646 0.43099999999999999423 1.01600000000000001421
# 1.56299999999999994493 1.79600000000000004086]
np.set_printoptions(precision=3)
print(lst)
# [ 0.141 0.431 1.016 1.563 1.796]

Matlab range in Python

I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!
You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).

Extremely low values from NumPy

I am attempting to do a few different operations in Numpy (mean and interp), and with both operations I am getting the result 2.77555756156e-17 at various times, usually when I'm expecting a zero. Even attempting to filter these out with array[array < 0.0] = 0.0 fails to remove the values.
I assume there's some sort of underlying data type or environment error that's causing this. The data should all be float.
Edit: It's been helpfully pointed out that I was only filtering out the values of -2.77555756156e-17 but still seeing positive 2.77555756156e-17. The crux of the question is what might be causing these wacky values to appear when doing simple functions like interpolating values between 0-10 and taking a mean of floats in the same range, and how can I avoid it without having to explicitly filter the arrays after every statement.
You're running into numerical precision, which is a huge topic in numerical computing; when you do any computation with floating point numbers, you run the risk of running into tiny values like the one you've posted here. What's happening is that your calculations are resulting in values that can't quite be expressed with floating-point numbers.
Floating-point numbers are expressed with a fixed amount of information (in Python, this amount defaults to 64 bits). You can read more about how that information is encoded on the very good Floating point Wikipedia page. In short, some calculation that you're performing in the process of computing your mean produces an intermediate value that cannot be precisely expressed.
This isn't a property of numpy (and it's not even really a property of Python); it's really a property of the computer itself. You can see this is normal Python by playing around in the repl:
>>> repr(3.0)
'3.0'
>>> repr(3.0 + 1e-10)
'3.0000000001'
>>> repr(3.0 + 1e-18)
'3.0'
For the last result, you would expect 3.000000000000000001, but that number can't be expressed in a 64-bit floating point number, so the computer uses the closest approximation, which in this case is just 3.0. If you were trying to average the following list of numbers:
[3., -3., 1e-18]
Depending on the order in which you summed them, you could get 1e-18 / 3., which is the "correct" answer, or zero. You're in a slightly stranger situation; two numbers that you expected to cancel didn't quite cancel out.
This is just a fact of life when you're dealing with floating point mathematics. The common way of working around it is to eschew the equals sign entirely and to only perform "numerically tolerant comparison", which means equality-with-a-bound. So this check:
a == b
Would become this check:
abs(a - b) < TOLERANCE
For some tolerance amount. The tolerance depends on what you know about your inputs and the precision of your computer; if you're using a 64-bit machine, you want this to be at least 1e-10 times the largest amount you'll be working with. For example, if the biggest input you'll be working with is around 100, it's reasonable to use a tolerance of 1e-8.
You can round your values to 15 digits:
a = a.round(15)
Now the array a should show you 0.0 values.
Example:
>>> a = np.array([2.77555756156e-17])
>>> a.round(15)
array([ 0.])
This is most likely the result of floating point arithmetic errors. For instance:
In [3]: 0.1 + 0.2 - 0.3
Out[3]: 5.551115123125783e-17
Not what you would expect? Numpy has a built in isclose() method that can deal with these things. Also, you can see the machine precision with
eps = np.finfo(np.float).eps
So, perhaps something like this could work too:
a = np.array([[-1e-17, 1.0], [1e-16, 1.0]])
a[np.abs(a) <= eps] = 0.0

Normalization using Numpy vs hard coded

import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?
As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.
You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.

Categories