I have a matrix with float values and I try to get the summary of columns and rows. This matrix is symmetric.
>>> np.sum(n2[1,:]) #summing second row
0.80822400592582844
>>> np.sum(n2[:,1]) #summing second col
0.80822400592582844
>>> np.sum(n2, axis=0)[1]
0.80822400592582899
>>> np.sum(n2, axis=1)[1]
0.80822400592582844
It gives different results. Why?
The numbers numpy uses are doubles, with accuracy up to 16 decimal places. It looks like the differences happen at the 16th place, with the rest of the digits being equal. If you don't need this accuracy, you could use the rounding function np.around(), or you could actually try using the np.longdouble type to get a higher degree of accuracy.
You can check the accuracy of the types using np.finfo:
>>> print np.finfo(np.double).precision
>>> 15
Some numpy functions won't accept long doubles I believe, and will cast it down to a double, truncating the extra digits. Numpy precision
Related
I am attempting to do a few different operations in Numpy (mean and interp), and with both operations I am getting the result 2.77555756156e-17 at various times, usually when I'm expecting a zero. Even attempting to filter these out with array[array < 0.0] = 0.0 fails to remove the values.
I assume there's some sort of underlying data type or environment error that's causing this. The data should all be float.
Edit: It's been helpfully pointed out that I was only filtering out the values of -2.77555756156e-17 but still seeing positive 2.77555756156e-17. The crux of the question is what might be causing these wacky values to appear when doing simple functions like interpolating values between 0-10 and taking a mean of floats in the same range, and how can I avoid it without having to explicitly filter the arrays after every statement.
You're running into numerical precision, which is a huge topic in numerical computing; when you do any computation with floating point numbers, you run the risk of running into tiny values like the one you've posted here. What's happening is that your calculations are resulting in values that can't quite be expressed with floating-point numbers.
Floating-point numbers are expressed with a fixed amount of information (in Python, this amount defaults to 64 bits). You can read more about how that information is encoded on the very good Floating point Wikipedia page. In short, some calculation that you're performing in the process of computing your mean produces an intermediate value that cannot be precisely expressed.
This isn't a property of numpy (and it's not even really a property of Python); it's really a property of the computer itself. You can see this is normal Python by playing around in the repl:
>>> repr(3.0)
'3.0'
>>> repr(3.0 + 1e-10)
'3.0000000001'
>>> repr(3.0 + 1e-18)
'3.0'
For the last result, you would expect 3.000000000000000001, but that number can't be expressed in a 64-bit floating point number, so the computer uses the closest approximation, which in this case is just 3.0. If you were trying to average the following list of numbers:
[3., -3., 1e-18]
Depending on the order in which you summed them, you could get 1e-18 / 3., which is the "correct" answer, or zero. You're in a slightly stranger situation; two numbers that you expected to cancel didn't quite cancel out.
This is just a fact of life when you're dealing with floating point mathematics. The common way of working around it is to eschew the equals sign entirely and to only perform "numerically tolerant comparison", which means equality-with-a-bound. So this check:
a == b
Would become this check:
abs(a - b) < TOLERANCE
For some tolerance amount. The tolerance depends on what you know about your inputs and the precision of your computer; if you're using a 64-bit machine, you want this to be at least 1e-10 times the largest amount you'll be working with. For example, if the biggest input you'll be working with is around 100, it's reasonable to use a tolerance of 1e-8.
You can round your values to 15 digits:
a = a.round(15)
Now the array a should show you 0.0 values.
Example:
>>> a = np.array([2.77555756156e-17])
>>> a.round(15)
array([ 0.])
This is most likely the result of floating point arithmetic errors. For instance:
In [3]: 0.1 + 0.2 - 0.3
Out[3]: 5.551115123125783e-17
Not what you would expect? Numpy has a built in isclose() method that can deal with these things. Also, you can see the machine precision with
eps = np.finfo(np.float).eps
So, perhaps something like this could work too:
a = np.array([[-1e-17, 1.0], [1e-16, 1.0]])
a[np.abs(a) <= eps] = 0.0
The following code causes the print statements to be executed:
import numpy as np
import math
foo = np.array([1/math.sqrt(2), 1/math.sqrt(2)], dtype=np.complex_)
total = complex(0, 0)
one = complex(1, 0)
for f in foo:
total = total + pow(np.abs(f), 2)
if(total != one):
print str(total) + " vs " + str(one)
print "NOT EQUAL"
However, my input of [1/math.sqrt(2), 1/math.sqrt(2)] results in the total being one:
(1+0j) vs (1+0j) NOT EQUAL
Is it something to do with mixing NumPy with Python's complex type?
When using floating point numbers it is important to keep in mind that working with these numbers is never accurate and thus computations are every time subject to rounding errors. This is caused by the design of floating point arithmetic and currently the most practicable way to do high arbitrary precision mathematics on computers with limited resources. You can't compute exactly using floats (means you have practically no alternative), as your numbers have to be cut off somewhere to fit in a reasonable amount of memory (in most cases at maximum 64 bits), this cut-off is done by rounding it (see below for an example).
To deal correctly with these shortcomings you should never compare to floats for equality, but for closeness. Numpy provides 2 functions for that: np.isclose for comparison of single values (or a item-wise comparison for arrays) and np.allclose for whole arrays. The latter is a np.all(np.isclose(a, b)), so you get a single value for an array.
>>> np.isclose(np.float32('1.000001'), np.float32('0.999999'))
True
But sometimes the rounding is very practicable and matches with our analytical expectation, see for example:
>>> np.float(1) == np.square(np.sqrt(1))
True
After squaring the value will be reduced in size to fit in the given memory, so in this case it's rounded to what we would expect.
These two functions have built-in absolute and relative tolerances (you can also give then as parameter) that are use to compare two values. By default they are rtol=1e-05 and atol=1e-08.
Also, don't mix different packages with their types. If you use Numpy, use Numpy-Types and Numpy-Functions. This will also reduce your rounding errors.
Btw: Rounding errors have even more impact when working with numbers which differ in their exponent widely.
I guess, the same considerations as for real numbers are applicable: never assume they can be equal, but rather close enough:
eps = 0.000001
if abs(a - b) < eps:
print "Equal"
I need to use a module that does some math on integers, however my input is in floats.
What I want to achieve is to convert a generic float value into a corresponding integer value and loose as little data as possible.
For example:
val : 1.28827339907e-08
result : 128827339906934
Which is achieved after multiplying by 1e22.
Unfortunately the range of values can change, so I cannot always multiply them by the same constant. Any ideas?
ADDED
To put it in other words, I have a matrix of values < 1, let's say from 1.323224e-8 to 3.457782e-6.
I want to convert them all into integers and loose as little data as possible.
The answers that suggest multiplying by a power of ten cause unnecessary rounding.
Multiplication by a power of the base used in the floating-point representation has no error in IEEE 754 arithmetic (the most common floating-point implementation) as long as there is no overflow or underflow.
Thus, for binary floating-point, you may be able to achieve your goal by multiplying the floating-point number by a power of two and rounding the result to the nearest integer. The multiplication will have no error. The rounding to integer may have an error up to .5, obviously.
You might select a power of two that is as large as possible without causing any of your numbers to exceed the bounds of the integer type you are using.
The most common conversion of floating-point to integer truncates, so that 3.75 becomes 3. I am not sure about Python semantics. To round instead of truncating, you might use a function such as round before converting to integer.
If you want to preserve the values for operations on matrices I would choose some value to multiply them all by.
For Example:
1.23423
2.32423
4.2324534
Multiply them all by 10000000 and you get
12342300
23242300
42324534
You can perform you multiplications, additions etc with your matrices. Once you have performed all your calculations you can convert them back to floats by dividing them all by the appropriate value depending on the operation you performed.
Mathematically it makes sense because
(Scalar multiplication)
M1` = M1 * 10000000
M2` = M2 * 10000000
Result = M1`.M2`
Result = (M1 x 10000000).(M2 x 10000000)
Result = (10000000 x 10000000) x (M1.M2)
So in the case of multiplication you would divide your result by 10000000 x 10000000.
If its addition / subtraction then you simply divide by 10000000.
You can either choose the value to multiply by through your knowledge of what decimals you expect to find or by scanning the floats and generating the value yourself at runtime.
Hope that helps.
EDIT: If you are worried about going over the maximum capacity of integers - then you would be happy to know that python automatically (and silently) converts integers to longs when it notices overflow is going to occur. You can see for yourself in a python console:
>>> i = 3423
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'long'>
If you are still worried about overflow, you can always choose a lower constant with a compromise for slightly less accuracy (since you will be losing some digits towards then end of the decimal point).
Also, the method proposed by Eric Postpischil seems to make sense - but I have not tried it out myself. I gave you a solution from a more mathematical perspective which also seems to be more "pythonic"
Perhaps consider counting the number of places after the decimal for each value to determine the value (x) of your exponent (1ex). Roughly something like what's addressed here. Cheers!
Here's one solution:
def to_int(val):
return int(repr(val).replace('.', '').split('e')[0])
Usage:
>>> to_int(1.28827339907e-08)
128827339907
For 1-D numpy arrays, this two expressions should yield the same result (theorically):
(a*b).sum()/a.sum()
dot(a, b)/a.sum()
The latter uses dot() and is faster. But which one is more accurate? Why?
Some context follows.
I wanted to compute the weighted variance of a sample using numpy.
I found the dot() expression in another answer, with a comment stating that it should be more accurate. However no explanation is given there.
Numpy dot is one of the routines that calls the BLAS library that you link on compile (or builds its own). The importance of this is the BLAS library can make use of Multiply–accumulate operations (usually Fused-Multiply Add) which limit the number of roundings that the computation performs.
Take the following:
>>> a=np.ones(1000,dtype=np.float128)+1E-14
>>> (a*a).sum()
1000.0000000000199948
>>> np.dot(a,a)
1000.0000000000199948
Not exact, but close enough.
>>> a=np.ones(1000,dtype=np.float64)+1E-14
>>> np.dot(a,a)
1000.0000000000176 #off by 2.3948e-12
>>> (a*a).sum()
1000.0000000000059 #off by 1.40948e-11
The np.dot(a, a) will be the more accurate of the two as it use approximately half the number of floating point roundings that the naive (a*a).sum() does.
A book by Nvidia has the following example for 4 digits of precision. rn stands for 4 round to the nearest 4 digits:
x = 1.0008
x2 = 1.00160064 # true value
rn(x2 − 1) = 1.6006 × 10−4 # fused multiply-add
rn(rn(x2) − 1) = 1.6000 × 10−4 # multiply, then add
Of course floating point numbers are not rounded to the 16th decimal place in base 10, but you get the idea.
Placing np.dot(a,a) in the above notation with some additional pseudo code:
out=0
for x in a:
out=rn(x*x+out) #Fused multiply add
While (a*a).sum() is:
arr=np.zeros(a.shape[0])
for x in range(len(arr)):
arr[x]=rn(a[x]*a[x])
out=0
for x in arr:
out=rn(x+out)
From this its easy to see that the number is rounded twice as many times using (a*a).sum() compared to np.dot(a,a). These small differences summed can change the answer minutely. Additional exmaples can be found here.
I have an numpy array of floats in Python.
When I print the array, the first value is:
[7.14519700e+04, ....
If, however, I print out just the first value on it's own, the print out reads:
71451.9699799
Obviously these numbers should be identical, so I just wondered, is the array just showing me a rounded version of the element? The second number here has 12 significant figures, and the first only has 9.
I guess I just wonder why these numbers are different?
It's just in the printing, not in the storage. The only confusion might occur because the first example uses numpy's print precision settings, the second example general python's print settings.
You can adjust the numpy precision and print by
numpy.set_printoptions(precision=20)
print myarray`
(adjust precision to your needs), or select the number of significant figures in standard python formatted print:
print ('%.20f' % myarray[0])
The internal representation of the number is always the same.
The types in a numpy array are well defined. You can get how they are stored by inspecting the numpy.dtype property of an array.
For example:
import numpy
a = numpy.zeros(10)
print a.dtype
will show float64, that is a 64-bit floating point number.
You can specify the type of the array explicitly using either the commonly accepted dtype argument, or the dtype type object (that is, the thing that makes the dtype).
a = numpy.zeros(10, dtype='complex32') # a 32-bit floating point
b = numpy.longdouble(a) # create a long-double array from a
Regarding the printing, this is just a formatting issue. You can twiddle how numpy prints an array using numpy.set_printoptions:
>>> a = numpy.random.randn(3) # for interest, randn annoyingly doesn't support the dtype arg
>>> print a
[ 0.12584756 0.73540009 -0.17108244 -0.96818512]
>>> numpy.set_printoptions(precision=3)
>>> print a
[ 0.126 0.735 -0.171 -0.968]