I am attempting to do a few different operations in Numpy (mean and interp), and with both operations I am getting the result 2.77555756156e-17 at various times, usually when I'm expecting a zero. Even attempting to filter these out with array[array < 0.0] = 0.0 fails to remove the values.
I assume there's some sort of underlying data type or environment error that's causing this. The data should all be float.
Edit: It's been helpfully pointed out that I was only filtering out the values of -2.77555756156e-17 but still seeing positive 2.77555756156e-17. The crux of the question is what might be causing these wacky values to appear when doing simple functions like interpolating values between 0-10 and taking a mean of floats in the same range, and how can I avoid it without having to explicitly filter the arrays after every statement.
You're running into numerical precision, which is a huge topic in numerical computing; when you do any computation with floating point numbers, you run the risk of running into tiny values like the one you've posted here. What's happening is that your calculations are resulting in values that can't quite be expressed with floating-point numbers.
Floating-point numbers are expressed with a fixed amount of information (in Python, this amount defaults to 64 bits). You can read more about how that information is encoded on the very good Floating point Wikipedia page. In short, some calculation that you're performing in the process of computing your mean produces an intermediate value that cannot be precisely expressed.
This isn't a property of numpy (and it's not even really a property of Python); it's really a property of the computer itself. You can see this is normal Python by playing around in the repl:
>>> repr(3.0)
'3.0'
>>> repr(3.0 + 1e-10)
'3.0000000001'
>>> repr(3.0 + 1e-18)
'3.0'
For the last result, you would expect 3.000000000000000001, but that number can't be expressed in a 64-bit floating point number, so the computer uses the closest approximation, which in this case is just 3.0. If you were trying to average the following list of numbers:
[3., -3., 1e-18]
Depending on the order in which you summed them, you could get 1e-18 / 3., which is the "correct" answer, or zero. You're in a slightly stranger situation; two numbers that you expected to cancel didn't quite cancel out.
This is just a fact of life when you're dealing with floating point mathematics. The common way of working around it is to eschew the equals sign entirely and to only perform "numerically tolerant comparison", which means equality-with-a-bound. So this check:
a == b
Would become this check:
abs(a - b) < TOLERANCE
For some tolerance amount. The tolerance depends on what you know about your inputs and the precision of your computer; if you're using a 64-bit machine, you want this to be at least 1e-10 times the largest amount you'll be working with. For example, if the biggest input you'll be working with is around 100, it's reasonable to use a tolerance of 1e-8.
You can round your values to 15 digits:
a = a.round(15)
Now the array a should show you 0.0 values.
Example:
>>> a = np.array([2.77555756156e-17])
>>> a.round(15)
array([ 0.])
This is most likely the result of floating point arithmetic errors. For instance:
In [3]: 0.1 + 0.2 - 0.3
Out[3]: 5.551115123125783e-17
Not what you would expect? Numpy has a built in isclose() method that can deal with these things. Also, you can see the machine precision with
eps = np.finfo(np.float).eps
So, perhaps something like this could work too:
a = np.array([[-1e-17, 1.0], [1e-16, 1.0]])
a[np.abs(a) <= eps] = 0.0
Related
I am trying to implement Gensim's most_similar function by hand but calculate the similarity between the query word and just one other word (avoiding the time to calculate it for the query word with all other words). So far I use
cossim = (np.dot(a, b)
/ np.linalg.norm(a)
/ np.linalg.norm(b))
and this is the same as the similarity result between a and b. I find this works almost exactly but that some precision is lost, for example
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
model_gigaword = api.load("glove-wiki-gigaword-300")
a = 'france'
b = 'chirac'
cossim1 = model_gigaword.most_similar(a)
import numpy as np
cossim2 = (np.dot(model_gigaword[a], model_gigaword[b])
/ np.linalg.norm(model_gigaword[a])
/ np.linalg.norm(model_gigaword[b]))
print(cossim1)
print(cossim2)
Output:
[('french', 0.7344760894775391), ('paris', 0.6580672264099121), ('belgium', 0.620672345161438), ('spain', 0.573593258857727), ('italy', 0.5643460154533386), ('germany', 0.5567398071289062), ('prohertrib', 0.5564222931861877), ('britain', 0.5553334355354309), ('chirac', 0.5362644195556641), ('switzerland', 0.5320892333984375)]
0.53626436
So the most_similar function gives 0.53626441955... (rounds to 0.53626442) and the calculation with numpy gives 0.53626436. Similarly, you can see differences between the values for 'paris' and 'italy' (in similarity compared to 'france'). These differences suggest that the calculation is not being done to full precision (but it is in Gensim). How can I fix it and get the output for a single similarity to higher precision, exactly as it comes from most_similar?
TL/DR - I want to use function('france', 'chirac') and get 0.5362644195556641, not 0.53626436.
Any idea what's going on?
UPDATE: I should clarify, I want to know and replicate how most_similar does the computation, but for only one (a,b) pair. That's my priority, rather than finding out how to improve the precision of my cossim calculation above. I just assumed the two were equivalent.
To increase accuracy you can try the following:
a = np.array(model_gigaword[a]).astype('float128')
b = np.array(model_gigaword[b]).astype('float128')
cossim = (np.dot(a, b)
/ np.linalg.norm(a)
/ np.linalg.norm(b))
The vectors are likely to use lower-precision floats and hence there is loss precision in calculations.
However, the results I got are somewhat different to what model_gigaword.most_similar offers for you:
model_gigaword.similarity: 0.5362644
float64: 0.5362644263010196
float128: 0.53626442630101950744
You may want to check what you get on your machine and with your version of Python and gensim.
Because floating-point numbers (like the np.float32-typed values in these vector models) are represented using an imprecise binary approximation, none of the numbers you're working with, or displaying, are the exact decimal numbers you think they are.
The number you're seeing as 0.53626436 isn't exactly that - but some binary floating-point number very close to that number. Similarly, the number you're seeing as 0.5362644195556641 isn't exactly that – but some other binary floating-point number, ver close to that.
Further, these tiny imprecisions can mean that mathematical expressions that should under ideal circumstances give identical results to each other, no matter the order-of-evaluation, instead give slightly different results for different orders-of-evaluation. For example, we know that mathematically, a * (b + c) is always equal to ab + ac. However, if a, b, & c are floating-point numbers with limited precision, the results of doing the addition then multiplication, versus doing two multiplications then one addition, might vary - because the interim values would have been approximated slightly differently.
But: for nearly all domains in which these numbers are used, this tiny amount of noise shouldn't make any difference. The right policy is to ignore it, and write code that's robust to this small 'jitter' in extremely-low-significance digits - especially when printing or comparing results.
So really you should only be printing/comparing these numbers to a level of significance where they reliably agree, say, 4 digits after the decimal:
0.53626436
0.5362644195556641
(In fact, your output already makes it look like you may have changed the default level of display-precision in numpy or python, because it wouldn't be typical for the results of most_simlar() to display with those 16 digits after the decimal.)
If you really, really wanted, as an exploration, to match the most_similar() results exactly, you could look at its source code. Then, perform the exact same steps, in the exact same order, using the exact same library routines, on your inputs.
(Here's the source for most_similar() in the current gensim-4.0.0beta prerelease: https://github.com/RaRe-Technologies/gensim/blob/4.0.0beta/gensim/models/keyedvectors.py#L690)
But: insisting on such exact correspondence is usually unwise, & creates more-fragile code, given the inherent imprecision in floating-point math.
See also: another answer covering some similar issues, which also points out a way to change the default displayed precision.
The following code causes the print statements to be executed:
import numpy as np
import math
foo = np.array([1/math.sqrt(2), 1/math.sqrt(2)], dtype=np.complex_)
total = complex(0, 0)
one = complex(1, 0)
for f in foo:
total = total + pow(np.abs(f), 2)
if(total != one):
print str(total) + " vs " + str(one)
print "NOT EQUAL"
However, my input of [1/math.sqrt(2), 1/math.sqrt(2)] results in the total being one:
(1+0j) vs (1+0j) NOT EQUAL
Is it something to do with mixing NumPy with Python's complex type?
When using floating point numbers it is important to keep in mind that working with these numbers is never accurate and thus computations are every time subject to rounding errors. This is caused by the design of floating point arithmetic and currently the most practicable way to do high arbitrary precision mathematics on computers with limited resources. You can't compute exactly using floats (means you have practically no alternative), as your numbers have to be cut off somewhere to fit in a reasonable amount of memory (in most cases at maximum 64 bits), this cut-off is done by rounding it (see below for an example).
To deal correctly with these shortcomings you should never compare to floats for equality, but for closeness. Numpy provides 2 functions for that: np.isclose for comparison of single values (or a item-wise comparison for arrays) and np.allclose for whole arrays. The latter is a np.all(np.isclose(a, b)), so you get a single value for an array.
>>> np.isclose(np.float32('1.000001'), np.float32('0.999999'))
True
But sometimes the rounding is very practicable and matches with our analytical expectation, see for example:
>>> np.float(1) == np.square(np.sqrt(1))
True
After squaring the value will be reduced in size to fit in the given memory, so in this case it's rounded to what we would expect.
These two functions have built-in absolute and relative tolerances (you can also give then as parameter) that are use to compare two values. By default they are rtol=1e-05 and atol=1e-08.
Also, don't mix different packages with their types. If you use Numpy, use Numpy-Types and Numpy-Functions. This will also reduce your rounding errors.
Btw: Rounding errors have even more impact when working with numbers which differ in their exponent widely.
I guess, the same considerations as for real numbers are applicable: never assume they can be equal, but rather close enough:
eps = 0.000001
if abs(a - b) < eps:
print "Equal"
I need to use a module that does some math on integers, however my input is in floats.
What I want to achieve is to convert a generic float value into a corresponding integer value and loose as little data as possible.
For example:
val : 1.28827339907e-08
result : 128827339906934
Which is achieved after multiplying by 1e22.
Unfortunately the range of values can change, so I cannot always multiply them by the same constant. Any ideas?
ADDED
To put it in other words, I have a matrix of values < 1, let's say from 1.323224e-8 to 3.457782e-6.
I want to convert them all into integers and loose as little data as possible.
The answers that suggest multiplying by a power of ten cause unnecessary rounding.
Multiplication by a power of the base used in the floating-point representation has no error in IEEE 754 arithmetic (the most common floating-point implementation) as long as there is no overflow or underflow.
Thus, for binary floating-point, you may be able to achieve your goal by multiplying the floating-point number by a power of two and rounding the result to the nearest integer. The multiplication will have no error. The rounding to integer may have an error up to .5, obviously.
You might select a power of two that is as large as possible without causing any of your numbers to exceed the bounds of the integer type you are using.
The most common conversion of floating-point to integer truncates, so that 3.75 becomes 3. I am not sure about Python semantics. To round instead of truncating, you might use a function such as round before converting to integer.
If you want to preserve the values for operations on matrices I would choose some value to multiply them all by.
For Example:
1.23423
2.32423
4.2324534
Multiply them all by 10000000 and you get
12342300
23242300
42324534
You can perform you multiplications, additions etc with your matrices. Once you have performed all your calculations you can convert them back to floats by dividing them all by the appropriate value depending on the operation you performed.
Mathematically it makes sense because
(Scalar multiplication)
M1` = M1 * 10000000
M2` = M2 * 10000000
Result = M1`.M2`
Result = (M1 x 10000000).(M2 x 10000000)
Result = (10000000 x 10000000) x (M1.M2)
So in the case of multiplication you would divide your result by 10000000 x 10000000.
If its addition / subtraction then you simply divide by 10000000.
You can either choose the value to multiply by through your knowledge of what decimals you expect to find or by scanning the floats and generating the value yourself at runtime.
Hope that helps.
EDIT: If you are worried about going over the maximum capacity of integers - then you would be happy to know that python automatically (and silently) converts integers to longs when it notices overflow is going to occur. You can see for yourself in a python console:
>>> i = 3423
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'long'>
If you are still worried about overflow, you can always choose a lower constant with a compromise for slightly less accuracy (since you will be losing some digits towards then end of the decimal point).
Also, the method proposed by Eric Postpischil seems to make sense - but I have not tried it out myself. I gave you a solution from a more mathematical perspective which also seems to be more "pythonic"
Perhaps consider counting the number of places after the decimal for each value to determine the value (x) of your exponent (1ex). Roughly something like what's addressed here. Cheers!
Here's one solution:
def to_int(val):
return int(repr(val).replace('.', '').split('e')[0])
Usage:
>>> to_int(1.28827339907e-08)
128827339907
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why can't decimal numbers be represented exactly in binary?
Program not entering if statement
So I'm trying to run a program that has two variables, when one variable is equal to another, it performs a function. In this case, printing spam. However, for some reason, when I run this program, I'm not getting any output even though I know they are equal.
g=0.0
b=3.0
while g < 30.0:
if g==b:
print "Hi"
g+=.1
print g, b
You are assuming that adding .1 enough times to 0.0 will produce 3.0. These are floating point numbers, they are inaccurate. Rounding errors make it so that the value is never exactly equal to 3.0. You should almost never use == to test floating point numbers.
A good way to do this is to count with integer values (e.g., loop with i from 0 to 300 by 1) and scale the counter only when the float value is used (e.g., set f = i * .1). When you do this, the loop counter is always exact, so you get exactly the iterations you want, and there is only one floating-point rounding, which does not accumulate from iteration to iteration.
The loop counter is most commonly an integer type, so that addition is easily seen to be exact (until overflow is reached). However, the loop counter may also be a floating-point type, provided you are sure that the values and operations for it are exact. (The common 32-bit floating-point format represents integers exactly from -224 to +224. Outside that, it does not have the precision to represent integers exactly. It does not represent .1 exactly, so you cannot count with increments of .1. But you could count with increments of .5, .25, .375, or other small multiples of moderate powers of two, which are represented exactly.)
To expand on Karoly Horvath's comment, what you can do to test near-equality is choose some value (let's call it epsilon) that is very, very small relative to the minimum increment. Let's say epsilon is 1.0 * 10^-6, five orders of magnitude smaller than your increment. (It should probably be based on the average rounding error of your floating point representation, but that varies, and this is simply an example).
What you then do is check if g and b are less than epsilon different - if they are close enough that they are practically equal, the difference between practically and actually being the rounding error, which you're approximating with epsilon.
Check for
abs(g - b) < epsilon
and you'll have your almost-but-not-quite equality check, which should be good enough for most purposes.
First off, I'm not a math guy, so large number precision rarely filters into my daily work. Please be gentle. ;)
Using NumPy to generate a matrix with values equally divided from 1:
>>> m = numpy.matrix([(1.0 / 1000) for x in xrange(1000)]).T
>>> m
matrix[[ 0.001 ],
[ 0.001 ],
...
[ 0.001 ]])
On 64-bit Windows with Python 2.6, summing rarely works out to 1.0. math.fsum() does with this matrix, it doesn't if I change the matrix to use smaller numbers.
>>> numpy.sum(m)
1.0000000000000007
>>> math.fsum(m)
1.0
>>> sum(m)
matrix([[ 1.]])
>>> float(sum(m))
1.0000000000000007
On 32-bit Linux (Ubuntu) with Python 2.6, summing always works out to 1.0.
>>> numpy.sum(m)
1.0
>>> math.fsum(m)
1.0
>>> sum(m)
matrix([[ 1.]])
>>> float(sum(m))
1.0000000000000007
I can add an epsilon to my code when assessing if the matrix sums to 1 (e.g. -epsilon < sum(m) < +epsilon) but I want to first understand what the cause of the difference is within Python, and if there's a better way to determine the sum correctly.
My understanding is that the sum(s) are processing the machine representation of the numbers (floats) differently than how they're displayed, and when sum'ing, the internal repesentation is used. Howeve,r looking at the 3 methods I used to calculate the sum it's not clear why they're all different, or the same between the platforms.
What's the best way to correctly calculate the sum of a matrix?
If you're looking for a more interesting matrix, this simple change will have smaller matrix numbers:
>>> m = numpy.matrix([(1.0 / 999) for x in xrange(999)]).T
Thanks in advance for any help!
Update
I think I figured something out. If I correct the value being stored to a 32-bit float the results match the 32-bit Linux sum'ing.
>>> m = numpy.matrix([(numpy.float32(1.0) / 1000) for x in xrange(1000)]).T
>>> m
matrix[[ 0.001 ],
[ 0.001 ],
...
[ 0.001 ]])
>>> numpy.sum(m)
1.0
This will set the matrix machine numbers to represent 32-bit floats, not 64-bit on my Windows test, and will sum correctly. Why is a 0.001 float not equal as a machine number on a 32-bit and 64-bit system? I would expect them to be different if I was trying to store very small numbers with lots of decimal places.
Does anyone have any thoughts on this? Should I explicitly switch to 32-bit floats in this case, or is there a 64-bit sum'ing method? Or am I back to adding an epsilon? Sorry if I sound dumb, I'm interested in opinions. Thanks!
It's because you're comparing 32-bit floats to 64-bit floats, as you've already found out.
If you specify a 32-bit or 64-bit dtype on both machines, you'll see the same result.
Numpy's default floating point dtype (the numerical type for a numpy array) is the same as the machine precision. This is why you're seeing different results on different machines.
E.g.
The 32-bit version:
m = numpy.ones(1000, dtype=numpy.float32) / 1000
print repr(m.sum())
and the 64-bit version:
m = numpy.ones(1000, dtype=numpy.float64) / 1000
print repr(m.sum())
Will be different due to the differing precision, but you'll see the same results on different machines. (However, the 64-bit operation will be much slower on a 32-bit machine)
If you just specify numpy.float, this will be either a float32 or a float64 depending on the machine's native architecture.
I'd say that the most accurate way (not the most efficient) is to use the decimal module:
>>> from decimal import Decimal
>>> m = numpy.matrix([(Decimal(1) / 1000) for x in xrange(1000)])
>>> numpy.sum(m)
Decimal('1.000')
>>> numpy.sum(m) == 1.0
True
First, if you use numpy to store values, you should use numpy's methods, if provided, to work with the array/matrix. That is, if you want to trust the extremely capable people that have put numpy together.
Now, the 64-bit answer of numpy's sum() can not sum up to exactly 1 for the reasons how floating point numbers are being handled in computers (murgatroid99 provided you with a link, there are hundred's more out there).
Therefore, the only safe way, (and even very helpful in understanding your mathematical treatment of your code much better, and therefore your problem per se) is to use an epsilon value to cut off at a certain precision.
Why do I think it is helpful? Because computational science needs to deal with errors as much as experimental science does and by deliberately dealing (meaning: determining them) with errors at this place, you already have done the first step in dealing with the computational errors of your code.
So, there maybe other ways to deal with it, but most of the time, I would use an epsilon to determine the precision I require for a given problem.