Can someone help me unpack what exactly is going on under the hood here?
>>> 1e16 + 1.
1e+16
>>> 1e16 + 1.1
1.0000000000000002e+16
I'm on 64-bit Python 2.7. For the first, I would assume that since there is only a precision of 15 for float that it's just round-off error. The true floating-point answer might be something like
10000000000000000.999999....
And the decimal just gets lopped of. But the second result makes me question this understanding and can't 1 be represented exactly? Any thoughts?
[Edit: Just to clarify. I'm not in any way suggesting that the answers are "wrong." Clearly, they're right, because, well they are. I'm just trying to understand why.]
It's just rounding as close as it can.
1e16 in floating hex is 0x4341c37937e08000.
1e16+2 is 0x4341c37937e08001.
At this level of magnitude, the smallest difference in precision that you can represent is 2. Adding 1.0 exactly rounds down (because typically IEEE floating point math will round to an even number). Adding values larger than 1.0 will round up to the next representable value.
10^16 = 0x002386f26fc10000 is exactly representable as a double precision floating point number. The next representable number is 1e16+2. 1e16+1 is correctly rounded to 1e16, and 1e16+1.1 is correctly rounded to 1e16+2. Check the output of this C program:
#include <stdio.h>
#include <math.h>
#include <stdint.h>
int main()
{
uint64_t i = 10000000000000000ULL;
double a = (double)i;
double b = nextafter(a,1.0e20); // next representable number
printf("I=0x%016llx\n",i); // 10^16 in hex
printf("A=%a (%.4f)\n",a,a); // double representation
printf("B=%a (%.4f)\n",b,b); // next double
}
Output:
I=0x002386f26fc10000
A=0x1.1c37937e08p+53 (10000000000000000.0000)
B=0x1.1c37937e08001p+53 (10000000000000002.0000)
Let's decode some floats, and see what's actually going on! I'm going to use Common Lisp, which has a handy function for getting at the significand (a.k.a mantissa) and exponent of a floating-point number without needing to twiddle any bits. All floats used are IEEE double-precision floats.
> (integer-decode-float 1.0d0)
4503599627370496
-52
1
That is, if we consider the value stored in the significand as an integer, it is the maximum power of 2 available (4503599627370496 = 2^52), scaled down (2^-52). (It isn't stored as 1 with an exponent of 0 because it's simpler for the significand to never have zeros on the left, and this allows us to skip representing the leftmost 1 bit and have more precision. Numbers not in this form are called denormal.)
Let's look at 1e16.
> (integer-decode-float 1d16)
5000000000000000
1
1
Here we have the representation (5000000000000000) * 2^1. Note that the significand, despite being a nice round decimal number, is not a power of 2; this is because 1e16 is not a power of 2. Every time you multiply by 10, you are multiplying by 2 and 5; multiplying by 2 is just incrementing the exponent, but multiplying by 5 is an "actual" multiplication, and here we've multiplied by 5 16 times.
5000000000000000 = 10001110000110111100100110111111000001000000000000000 (base 2)
Observe that this is a 53-bit binary number, as it should be since double floats have a 53-bit significand.
But the key to understanding the situation is that the exponent is 1. (The exponent being small is an indication that we are getting close to the limits of precision.) This means that the float value is 2^1 = 2 times this significand.
Now, what happens when we try to represent adding 1 to this number? Well, we need to represent 1 at the same scale. But the smallest change we can make in this number is exactly 2, because the least significant bit of the significand has value 2!
That is, if we increment the significand, making the smallest possible change, we get
5000000000000001 = 10001110000110111100100110111111000001000000000000001 (base 2)
and when we apply the exponent, we get 2 * 5000000000000001 = 10000000000000002, which is exactly the value you observed. You can only have either 10000000000000000 or 10000000000000002, and 10000000000000001.1 is closer to the latter.
(Note that the issue here isn't even that decimal numbers aren't exact in binary! There's no binary "repeating decimals" here, and there's plenty of 0 bits on the right end of the significand — it's just that your input neatly falls just beyond the lowest bit.)
With numpy, you can see the next larger and smaller representable IEEE floating point number:
>>> import numpy as np
>>> huge=1e100
>>> tiny=1e-100
>>> np.nextafter(1e16,huge)
10000000000000002.0
>>> np.nextafter(1e16,tiny)
9999999999999998.0
So:
>>> (np.nextafter(1e16,huge)-np.nextafter(1e16,tiny))/2.0
2.0
And:
>>> 1.1>2.0/2
True
Therefore 1e16 + 1.1 is correctly rounded to the next larger IEEE representable number of 10000000000000002.0
As is:
>>> 1e16+1.0000000000000005
1.0000000000000002e+16
and 1e16-(something slightly larger than 1) is rounded down by 2 to the next smaller IEEE number:
>>> 1e16-1.0000000000000005
9999999999999998.0
Keep in mind that 32 bit vs 64 bit Python is irrelevant. It is the size of the IEEE format used that matters. Also keep in mind that the larger the magnitude of the number, the epsilon value (the spread between the two next larger and smaller IEEE values basically) changes.
You can see this in bits as well:
>>> def f_to_bits(f): return struct.unpack('<Q', struct.pack('<d', f))[0]
...
>>> def bits_to_f(bits): return struct.unpack('<d', struct.pack('<Q', bits))[0]
...
>>> bits_to_f(f_to_bits(1e16)+1)
1.0000000000000002e+16
>>> bits_to_f(f_to_bits(1e16)-1)
9999999999999998.0
Related
I'm pretty new to python, and I've made a table which calculates T=1+2^-n-1 and C=2^n, which both give the same values from n=40 to n=52, but for n=52 to n=61 I get 0.0 for T, whereas C gives me progressively smaller decimals each time - why is this?
I think I understand why T becomes 0.0, because of python using binary floating point and because of the machine epsilon value - but I'm slightly confused as to why C doesn't also become 0.0.
import numpy as np
import math
t=np.zeros(21)
c=np.zeros(21)
for n in range(40,61):
m=n-40
t[m]=1+2**(-n)-1
c[m]=2**(-n)
print (n,t[m],c[m])
The "floating" in floating point means that values are represented by storing a fixed number of leading digits and a scale factor, rather than assuming a fixed scale (which would be fixed point).
2**-53 only takes one (binary) digit to represent (not including the scale), but 1+2**-53 would take 54 to represent exactly. Python floats only have 53 binary digits of precision; 2**-53 can be represented exactly, but 1+2**-53 gets rounded to exactly 1, and subtracting 1 from that gives exactly 0. Thus, we have
>>> 2**-53
1.1102230246251565e-16
>>> 1+(2**-53)-1
0.0
Postscript: you might wonder why 2**-53 displays as a value not equal to the exact mathematical value when I said it was exact. That's due to the float->string conversion logic, which only keeps enough decimal digits to reconstruct the original float (instead of printing a bunch of digits at the end that are usually just noise).
The difference between both is indeed due to floating-point representation. Indeed, if you perform 1 + X where X is a very very small number, then the floating-point representation sets its exponent value to 0 and the precision is ensured by the mantissa, which is 52-bit on a 64-bit computer. Therefore, 1 + 2^(-X) if X > 52 is equal to 1. However, even 2^-100 can be represented in double-precision floating-point, so you can see C decrease for a larger number of samples.
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
I have read that minimal float value that Python support is something in power -308.
Exact number doesn't matter because:
>>> -1.42108547152e-14 + 360.0 == 360.0
True
How is that? I have CPython 2.7.3 on Windows.
It cause me errors. My problem will be fixed if I will compare my value -1.42108547152e-14 (computed somehow) to some "delta" and do this:
if v < delta:
v = 0
What delta should I choose? In other words, with values smaller than what this effect will occur?
Please note that NumPy is not available.
An (over)simplified explanation is: A (normal) double-precision floating point number holds (what is equivalent to) approximately 16 decimal digits. Let's try to do your addition by hand:
360.0000000000000000000000000
- 0.0000000000000142108547152
______________________________
359.9999999999999857891452848
If you round this to 16 figures (3 before the point and 13 after), you get 360.
Now, in reality this is done in binary. The "16 decimal digits" is therefore not a precise rule. In reality, the precision here (between 256.0 and 512.0) is 44 binary digits for the fractional part of the number. So the number closest to 360 which can be represented is 360 minus {2 to the -44th power}, which gives:
359.9999999999999431565811391 (truncated)
But since our result before was closer to 360.0 than to this number, 360.0 is what you get.
Most processors use IEEE 754 binary floating-point arithmetic. In this format, numbers are represented as a sign s, a fraction f, and an exponent e. The fraction is also called a significand.
The sign s is a bit 0 or 1 representing + or –, respectively.
In double precision, the significand f is a 53-bit binary numeral with a radix point after the first bit, such as 1.10100001001001100111000110110011001010100000001000112.
In double precision, the exponent e is an integer from –1022 to +1023.
The combined value represented by the sign, significand, and exponent is (-1)s•2e•f.
When you add two numbers, the processor figures out what exponent to use for the result. Then, given that exponent, it figures out what fraction to use for the result. When you add a large number and a small number, the entire result will not fit into the significand. So, the processor must round the mathematical result to something that will fit into the significand.
In the cases you ask about, the second added number is so small that the rounding produces the same value as the first number. In order to change the first number, you must add a value that is at least half the value of the lowest bit in the significand of the first number. (When rounding, if part that does not fit in the significand is more than half the lowest bit of the significand, it is rounded up. If it is exactly half, it is rounded up if that makes the lowest bit zero or down if that makes the lowest bit zero.)
There are additional issues in floating-point, such as subnormal numbers, infinities, how the exponent is stored, and so on, but the above explains the behavior you asked about.
In the specific case you ask about, adding to 360, adding any value greater than 2-45 will produce a sum greater than 360. Adding any positive value less than or equal to 2-45 will produce exactly 360. This is because the highest bit in 360 is 28, so the lowest bit in its significand is 2-44.
The built-in Python str() function outputs some weird results when passing in floats with many decimals. This is what happens:
>>> str(19.9999999999999999)
>>> '20.0'
I'm expecting to get:
>>> '19.9999999999999999'
Does anyone know why? and maybe workaround it?
Thanks!
It's not str() that rounds, it's the fact that you're using floats in the first place. Float types are fast, but have limited precision; in other words, they are imprecise by design. This applies to all programming languages. For more details on float quirks, please read "What Every Programmer Should Know About Floating-Point Arithmetic"
If you want to store and operate on precise numbers, use the decimal module:
>>> from decimal import Decimal
>>> str(Decimal('19.9999999999999999'))
'19.9999999999999999'
A float has 32 bits (in C at least). One of those bits is allocated for the sign, a few allocated for the mantissa, and a few allocated for the exponent. You can't fit every single decimal to an infinite number of digits into 32 bits. Therefore floating point numbers are heavily based on rounding.
If you try str(19.998), it will probably give you something at least close to 19.998 because 32 bits have enough precision to estimate that, but something like 19.999999999999999 is too precise to estimate in 32 bits, so it rounds to the nearest possible value, which happens to be 20.
Please note that this is a problem of understanding floating point (fixed-length) numbers. Most languages do exactly (or very similar to) what Python does.
Python float is IEEE 754 64-bit binary floating point. It is limited to 53 bits of precision i.e. slightly less than 16 decimal digits of precision. 19.9999999999999999 contains 18 decimal digits; it cannot be represented exactly as a float. float("19.9999999999999999") produces the nearest floating point value, which happens to be the same as float("20.0").
>>> float("19.9999999999999999") == float("20.0")
True
If by "many decimals" you mean "many digits after the decimal point", please be aware that the same "weird" results happen when there are many decimal digits before the decimal point:
>>> float("199999999999999999")
2e+17
If you want the full float precision, don't use str(), use repr():
>>> x = 1. / 3.
>>> str(x)
'0.333333333333'
>>> str(x).count('3')
12
>>> repr(x)
'0.3333333333333333'
>>> repr(x).count('3')
16
>>>
Update It's interesting how often decimal is prescribed as a cure-all for float-induced astonishment. This is often accompanied by simple examples like 0.1 + 0.1 + 0.1 != 0.3. Nobody stops to point out that decimal has its share of deficiencies e.g.
>>> (1.0 / 3.0) * 3.0
1.0
>>> (Decimal('1.0') / Decimal('3.0')) * Decimal('3.0')
Decimal('0.9999999999999999999999999999')
>>>
True, float is limited to 53 binary digits of precision. By default, decimal is limited to 28 decimal digits of precision.
>>> Decimal(2) / Decimal(3)
Decimal('0.6666666666666666666666666667')
>>>
You can change the limit, but it's still limited precision. You still need to know the characteristics of the number format to use it effectively without "astonishing" results, and the extra precision is bought by slower operation (unless you use the 3rd-party cdecimal module).
For any given binary floating point number, there is an infinite set of decimal fractions that, on input, round to that number. Python's str goes to some trouble to produce the shortest decimal fraction from this set; see GLS's paper http://kurtstephens.com/files/p372-steele.pdf for the general algorithm (IIRC they use a refinement that avoids arbitrary-precision math in most cases). You happened to input a decimal fraction that rounds to a float (IEEE double) whose shortest possible decimal fraction is not the same as the one you entered.
Python's math module contain handy functions like floor & ceil. These functions take a floating point number and return the nearest integer below or above it. However these functions return the answer as a floating point number. For example:
import math
f=math.floor(2.3)
Now f returns:
2.0
What is the safest way to get an integer out of this float, without running the risk of rounding errors (for example if the float is the equivalent of 1.99999) or perhaps I should use another function altogether?
All integers that can be represented by floating point numbers have an exact representation. So you can safely use int on the result. Inexact representations occur only if you are trying to represent a rational number with a denominator that is not a power of two.
That this works is not trivial at all! It's a property of the IEEE floating point representation that int∘floor = ⌊⋅⌋ if the magnitude of the numbers in question is small enough, but different representations are possible where int(floor(2.3)) might be 1.
To quote from Wikipedia,
Any integer with absolute value less than or equal to 224 can be exactly represented in the single precision format, and any integer with absolute value less than or equal to 253 can be exactly represented in the double precision format.
Use int(your non integer number) will nail it.
print int(2.3) # "2"
print int(math.sqrt(5)) # "2"
You could use the round function. If you use no second parameter (# of significant digits) then I think you will get the behavior you want.
IDLE output.
>>> round(2.99999999999)
3
>>> round(2.6)
3
>>> round(2.5)
3
>>> round(2.4)
2
Combining two of the previous results, we have:
int(round(some_float))
This converts a float to an integer fairly dependably.
That this works is not trivial at all! It's a property of the IEEE floating point representation that int∘floor = ⌊⋅⌋ if the magnitude of the numbers in question is small enough, but different representations are possible where int(floor(2.3)) might be 1.
This post explains why it works in that range.
In a double, you can represent 32bit integers without any problems. There cannot be any rounding issues. More precisely, doubles can represent all integers between and including 253 and -253.
Short explanation: A double can store up to 53 binary digits. When you require more, the number is padded with zeroes on the right.
It follows that 53 ones is the largest number that can be stored without padding. Naturally, all (integer) numbers requiring less digits can be stored accurately.
Adding one to 111(omitted)111 (53 ones) yields 100...000, (53 zeroes). As we know, we can store 53 digits, that makes the rightmost zero padding.
This is where 253 comes from.
More detail: We need to consider how IEEE-754 floating point works.
1 bit 11 / 8 52 / 23 # bits double/single precision
[ sign | exponent | mantissa ]
The number is then calculated as follows (excluding special cases that are irrelevant here):
-1sign × 1.mantissa ×2exponent - bias
where bias = 2exponent - 1 - 1, i.e. 1023 and 127 for double/single precision respectively.
Knowing that multiplying by 2X simply shifts all bits X places to the left, it's easy to see that any integer must have all bits in the mantissa that end up right of the decimal point to zero.
Any integer except zero has the following form in binary:
1x...x where the x-es represent the bits to the right of the MSB (most significant bit).
Because we excluded zero, there will always be a MSB that is one—which is why it's not stored. To store the integer, we must bring it into the aforementioned form: -1sign × 1.mantissa ×2exponent - bias.
That's saying the same as shifting the bits over the decimal point until there's only the MSB towards the left of the MSB. All the bits right of the decimal point are then stored in the mantissa.
From this, we can see that we can store at most 52 binary digits apart from the MSB.
It follows that the highest number where all bits are explicitly stored is
111(omitted)111. that's 53 ones (52 + implicit 1) in the case of doubles.
For this, we need to set the exponent, such that the decimal point will be shifted 52 places. If we were to increase the exponent by one, we cannot know the digit right to the left after the decimal point.
111(omitted)111x.
By convention, it's 0. Setting the entire mantissa to zero, we receive the following number:
100(omitted)00x. = 100(omitted)000.
That's a 1 followed by 53 zeroes, 52 stored and 1 added due to the exponent.
It represents 253, which marks the boundary (both negative and positive) between which we can accurately represent all integers. If we wanted to add one to 253, we would have to set the implicit zero (denoted by the x) to one, but that's impossible.
If you need to convert a string float to an int you can use this method.
Example: '38.0' to 38
In order to convert this to an int you can cast it as a float then an int. This will also work for float strings or integer strings.
>>> int(float('38.0'))
38
>>> int(float('38'))
38
Note: This will strip any numbers after the decimal.
>>> int(float('38.2'))
38
math.floor will always return an integer number and thus int(math.floor(some_float)) will never introduce rounding errors.
The rounding error might already be introduced in math.floor(some_large_float), though, or even when storing a large number in a float in the first place. (Large numbers may lose precision when stored in floats.)
Another code sample to convert a real/float to an integer using variables.
"vel" is a real/float number and converted to the next highest INTEGER, "newvel".
import arcpy.math, os, sys, arcpy.da
.
.
with arcpy.da.SearchCursor(densifybkp,[floseg,vel,Length]) as cursor:
for row in cursor:
curvel = float(row[1])
newvel = int(math.ceil(curvel))
Since you're asking for the 'safest' way, I'll provide another answer other than the top answer.
An easy way to make sure you don't lose any precision is to check if the values would be equal after you convert them.
if int(some_value) == some_value:
some_value = int(some_value)
If the float is 1.0 for example, 1.0 is equal to 1. So the conversion to int will execute. And if the float is 1.1, int(1.1) equates to 1, and 1.1 != 1. So the value will remain a float and you won't lose any precision.