I'm trying to work with big numbers in R, in my opinion they aren't even that big. I asked R to return me the module of the division of 6001532020609003100 by 97, I got answer 1; when doing the same calculation in Python I got answer 66.
Can someone tell me what's going on?
R doesn't have the same kind of "magic", arbitrary-length integers that Python does: its base integer type is 32 bit, which maxes out at .Machine$integer.max == 2147483647. When confronted with a number greater than this value R automatically converts it to double-precision floating point; then the %% operator gets messed up by floating-point imprecision. (If you try to insist that the input is an integer by entering 6001532020609003100L (L indicates integer) R still converts it to float, but warns you ...)
#JonSpring is right that you can do completely arbitrary-length integer computation (up to your computer's memory capability) with Rmpfr, but you can
also use the bit64 package for 64-bit integers, which your example just fits into:
library(bit64)
x <- as.integer64("6001532020609003100")
x %% 97
## [1] 66
But doubling this value puts you out of the integer-64 range: 2*x gives an overflow error.
Honestly, if you want to do a lot of big-integer calculation I'd say that Python is more convenient ...
library(Rmpfr)
as.integer(mpfr("6001532020609003100") %% 97)
[1] 66
Related
I was trying to process some rather large numbers in python and came across an overflow error. I decided to investigate a little bit more and came across an inequality I cannot explain.
When I evaluate 10^26 I get:
>>> 10**26
100000000000000000000000000
Which is perfectly logical. However when I evaluate 10e26 and convert it to an int I get:
>>>int(10e26)
1000000000000000013287555072
Why is this?
Do I not understand the e notation properly? (From what I know 10e26 is 10*10^26 as seen in this answer: 10e notation used with variables?)
10^26 is way past the max integer size so I was also wondering if there was any mechanism in python which could allow to work with numbers in scientific format (not considering all those zeros) in order to be able to compute operations with numbers past the max size.
The short answer is that 10e26 and 10**26 do not represent identical values.
10**26, with both operands being int values, evaluates to an int. As int represents integers with arbitrary precision, its value is exactly 1026 as intended.
10e26, on the other hand, is a float literal, and as such the resulting value is subject to the limited precision of the float type on your machine. The result of int(10e26) is the integer value of the float closest to the real number 1027.
10e26 represents ten times ten to the power of 26, which is 1027.
10**26 represents represents ten to the power of 26, 1026.
Obviously, these are different, so 10e26 == 10**26 is false.
However, if we correct the mistake so we compare 1e26 and 10**26 by evaluating 1e26 == 10**26, we get false for a different reason:
1e26 is evaluated in a limited-precision floating-point format, producing 100000000000000004764729344 in most implementations. (Python is not strict about the floating-point format.) 100000000000000004764729344 is the closest one can get to 1026 using 53 significant bits.
10**26 is evaluated with integer arithmetic, producing 100000000000000000000000000.
Comparing them reports they are different.
(I am uncertain of Python semantics, but I presume it converts the floating-point value to an extended-precision integer for the comparison. If we instead convert the integer to floating-point, with float(10**26) == 1e26, the conversion of 100000000000000000000000000 to float produces the same value, 100000000000000004764729344, and the comparison returns true.)
I am trying to simulate a fixed-point filter implementation. I want to capture low-level hardware features like 2s-complement wraparound/overflow and fixed register widths. Some of the registers widths are set by hardware features at unusual and long widths (ie 72b).
I've been making some progress using the built-in integers. The infinite width is incredibly useful... but I find myself fighting Python a lot because it sometimes wants to interpret a binary as a positive integer, and sometimes it seems to want to interpret a very similar binary as a negative 2's complement. For example:
>> a = 0b11111 # sign-extended -1
>> b = 0b0011
>> print("{0:b}".format(a*b))
5f
>> print("{0:b}".format((a*b)&a)) # Truncate to correct product length
11101 # == -3 in 2s complement. Great!
>> print("{0:b}".format(~((a*b)&a)+1)) # Actually perform the 2's complement
-11101 # Arrrrggggghhh
>> print("{0:b}".format((~((a*b)&a)&a)+1)) # Truncate with extreme prejudice
11 # OK. Fine.
I guess if I think hard enough I can figure out why all this works the way it does, but if I could just do it all in unsigned space without worrying about python adding sign bits it would make things easier and less error-prone. Anyone know if there's a relatively easy way to do this? I considered bit strings, but I have to do a lot of adds & multiplies in this application and built-in integer arithmetic is really useful for that.
~x is literally defined on arbitrary precision integers as -(x+1). It does not do bit arithmetic: ~0 is 255 in one-byte integers, 65535 in two-byte integers, 1023 for 10-bit integers etc; so defining ~ via bit inversion on stretchy integers is useless.
If a defines the fixed width of your integers (with 0b11111 saying you are working with five-bit numbers), bit inversion is as simple as a^x.
print("{0:b}".format(a ^ b)
# => 11100
Two's complement is meanwhile easiest done as a+1-b, or equivalently a^b+1:
print("{0:b}".format((a + 1) - b))
# => 11101
print("{0:b}".format((a ^ b) + 1))
# => 11101
tl;dr: Don't use ~ if you want to stay unsigned.
I'm looking at _math.c in git (line 25):
#if !defined(HAVE_ACOSH) || !defined(HAVE_ASINH)
static const double ln2 = 6.93147180559945286227E-01;
static const double two_pow_p28 = 268435456.0; /* 2**28 */
and I noticed that ln2 value is different from the what wolframalpha value for ln2. (bald part is the difference)
ln2 = 0.693147180559945286227 (cpython)
ln2 = 0.6931471805599453094172321214581 (wolframalpha)
ln2 = 0.693147180559945309417232121458 (wikipedia)
so my question is why there is a difference? what am I missing?
As user2357112 noted, this code came from FDLIBM. That was carefully written for IEEE-754 machines, where C doubles have 53 bits of precision. It doesn't really care what the actual log of 2 is, but cares a whole lot about the best 53-bit approximation to log(2).
To reproduce the intended 53-bit-precise value, 17 decimal digits would have sufficed.
So why did they use 21 decimal digits instead? My guess: 21 decimal digits is the minimum needed to guarantee that the converted result will be correct to 64 bits of precision. Which may have been an issue at the time, if a compiler somehow decided to convert the literal to a Pentium's 80-bit float format (which has 64 bits of precision).
So they displayed the 53-bit result with enough decimal digits so that if it were converted to a binary float format with 64 bits of precision, the trailing 11 bits (=64-53) would all be zeroes, thus ensuring they'd be working with the 53-bit value they intended from the start.
>>> import mpmath
>>> x = mpmath.log(2)
>>> x
mpf('0.69314718055994529')
>>> mpmath.mp.prec = 64
>>> y = mpmath.mpf("0.693147180559945286227")
>>> x == y
True
>>> y
mpf('0.693147180559945286227')
In English, x is the 53-bit precise value of log(2), and y is the result of converting the decimal string in the code to a binary float format with 64 bits of precision. They're identical.
In current reality, I expect all compilers now convert the literal to the native IEEE-754 double format, with 53 bits of precision.
Either way, the code ensures the best 53-bit approximation to log(2) will be used.
Up to the precision of binary64 floating-point representation, these values are equal:
In [21]: 0.6931471805599453094172321214581 == 0.693147180559945286227
Out[21]: True
0.693147180559945286227 is what you get if you store the most accurate representable approximation of ln(2) into a 64-bit float and then print it to that many digits. Trying to stuff more digits in a float just gets the result rounded to the same value:
In [23]: '%.21f' % 0.6931471805599453094172321214581
Out[23]: '0.693147180559945286227'
As for why they wrote 0.693147180559945286227 in the code, you'd have to ask the guys who wrote FDLIBM at Sun back in 1993. This code came from FDLIBM.
Python seems wrong, although I'm not sure it is an oversight or it has a deeper meaning. The explanation of BlackJack seems reasonable, but I don't understand, why they would give additional digits that are wrong.
You can check this yourself by using the formula under More efficient series. In Mathematica, you can calculate it up to 70 (35 summands) with
log2 = 2*Sum[1/i*(1/3)^i, {i, 1, 70, 2}]
(*
79535292197135923776615186805136682215642574454974413288086/
114745171628462663795273979107442710223059517312975273318225
*)
With N[log2,30] you get the correct digits
0.693147180559945309417232121458
which supports the correctness of Wikipedia and W|A. If you like, you can do the same calculation for machine precision numbers. In Mathematica, this usually means double.
logC = Compile[{{z, _Real, 0}},
2.0*Sum[1/i*((z - 1)/(z + 1))^i, {i, 1, 100, 2}]
]
Note that this code gets completely compiled to a normal iteration and does not use some error reducing summation scheme. So there is no magical compiled Sum function. This gives on my machine:
logC[2]//FullForm
(* 0.6931471805599451` *)
and is correct up to the digits you pointed out. This has the precision that was suggested by BlackJack
$MachinePrecision
(* 15.9546 *)
Edit
As pointed out in comments and answers, the value you see in _math.c might be the 53 bit representation
digits = RealDigits[log2, 2, 53];
N[FromDigits[digits, 2], 21]
(* 0.693147180559945286227 *)
I am trying to write a program in python 2.7 that will first see if a number divides the other evenly, and if it does get the result of the division.
However, I am getting some interesting results when I use large numbers.
Currently I am using:
from __future__ import division
import math
a=82348972389472433334783
b=2
if a/b==math.trunc(a/b):
answer=a/b
print 'True' #to quickly see if the if loop was invoked
When I run this I get:
True
But 82348972389472433334783 is clearly not even.
Any help would be appreciated.
That's a crazy way to do it. Just use the remainder operator.
if a % b == 0:
# then b divides a evenly
quotient = a // b
The true division implicitly converts the input to floats which don't provide the precision to store the value of a accurately. E.g. on my machine
>>> int(1E15+1)
1000000000000001
>>> int(1E16+1)
10000000000000000
hence you loose precision. A similar thing happens with your big number (compare int(float(a))-a).
Now, if you check your division, you see the result "is" actually found to be an integer
>>> (a/b).is_integer()
True
which is again not really expected beforehand.
The math.trunc function does something similar (from the docs):
Return the Real value x truncated to an Integral (usually a long integer).
The duck typing nature of python allows a comparison of the long integer and float, see
Checking if float is equivalent to an integer value in python and
Comparing a float and an int in Python.
Why don't you use the modulus operator instead to check if a number can be divided evenly?
n % x == 0
So, cPython (2.4) has some interesting behaviour when the length of something gets near to 1<<32 (the size of an int).
r = xrange(1<<30)
assert len(r) == 1<<30
is fine, but:
r = xrange(1<<32)
assert len(r) == 1<<32
ValueError: xrange object size cannot be reported`__len__() should return 0 <= outcome
Alex's wowrange has this behaviour as well. wowrange(1<<32).l is fine, but len(wowrange(1<<32)) is bad. I'm guessing there is some floating point behaviour (being read as negative) action going on here.
What exactly is happening here? (this is pretty well-solved below!)
How can I get around it? Longs?
(My specific application is random.sample(xrange(1<<32),ABUNCH)) if people want to tackle that question directly!)
cPython assumes that lists fit in memory. This extends to objects that behave like lists, such as xrange. essentially, the len function expects the __len__ method to return something that is convertable to size_t, which won't happen if the number of logical elements is too large, even if those elements don't actually exist in memory.
You'll find that
xrange(1 << 31 - 1)
is the last one that behaves as you want. This is because the maximum signed (32-bit) integer is 2^31 - 1.
1 << 32 is not a positive signed 32-bit integer (Python's int datatype), so that's why you're getting that error.
In Python 2.6, I can't even do xrange(1 << 32) or xrange(1 << 31) without getting an error, much less len on the result.
Edit If you want a little more detail...
1 << 31 represents the number 0x80000000 which in 2's complement representation is the lowest representable negative number (-1 * 2^31) for a 32-bit int. So yes, due to the bit-wise representation of the numbers you're working with, it's actually becoming negative.
For a 32-bit 2's complement number, 0x7FFFFFFF is the highest representable integer (2^31 - 1) before you "overflow" into negative numbers.
Further reading, if you're interested.
Note that when you see something like 2147483648L in the prompt, the "L" at the end signifies that it's now being represented as a "long integer" (64 bits, usually, I can't make any promises on how Python handles it because I haven't read up on it).
1<<32, when treated as a signed integer, is negative.