Python, len, and size of ints - python

So, cPython (2.4) has some interesting behaviour when the length of something gets near to 1<<32 (the size of an int).
r = xrange(1<<30)
assert len(r) == 1<<30
is fine, but:
r = xrange(1<<32)
assert len(r) == 1<<32
ValueError: xrange object size cannot be reported`__len__() should return 0 <= outcome
Alex's wowrange has this behaviour as well. wowrange(1<<32).l is fine, but len(wowrange(1<<32)) is bad. I'm guessing there is some floating point behaviour (being read as negative) action going on here.
What exactly is happening here? (this is pretty well-solved below!)
How can I get around it? Longs?
(My specific application is random.sample(xrange(1<<32),ABUNCH)) if people want to tackle that question directly!)

cPython assumes that lists fit in memory. This extends to objects that behave like lists, such as xrange. essentially, the len function expects the __len__ method to return something that is convertable to size_t, which won't happen if the number of logical elements is too large, even if those elements don't actually exist in memory.

You'll find that
xrange(1 << 31 - 1)
is the last one that behaves as you want. This is because the maximum signed (32-bit) integer is 2^31 - 1.
1 << 32 is not a positive signed 32-bit integer (Python's int datatype), so that's why you're getting that error.
In Python 2.6, I can't even do xrange(1 << 32) or xrange(1 << 31) without getting an error, much less len on the result.
Edit If you want a little more detail...
1 << 31 represents the number 0x80000000 which in 2's complement representation is the lowest representable negative number (-1 * 2^31) for a 32-bit int. So yes, due to the bit-wise representation of the numbers you're working with, it's actually becoming negative.
For a 32-bit 2's complement number, 0x7FFFFFFF is the highest representable integer (2^31 - 1) before you "overflow" into negative numbers.
Further reading, if you're interested.
Note that when you see something like 2147483648L in the prompt, the "L" at the end signifies that it's now being represented as a "long integer" (64 bits, usually, I can't make any promises on how Python handles it because I haven't read up on it).

1<<32, when treated as a signed integer, is negative.

Related

What is the exact definition of bitwise not in Python, given arbitrary length integers?

Bitwise not (~) is well-defined in languages that define a specific bit length and format for ints. Since in Python 3, ints can be any length, they by definition have variable number of bits. Internally, I believe Python uses at least 28 bytes to store an int, but of course these aren't what bitwise not is defined on.
How does Python define bitwise not:
Is the bit length a function of the int size, the native platform, or something else?
Does Python sign extend, zero extend, or do something else?
Python integers emulate 2's-complement represenation with "an infinite" number of sign bits (all 0 bits for >= 0, all 1 bits for < 0).
All bitwise integer operations maintain this useful illusion.
For example, 0 is an infinite number of 0 bits, and ~0 is an infinite number of 1 bits - which is -1 in 2's-complement notation.
>>> ~0
-1
It's generally true that ~i = -i-1, which, in English, can be read as "the 1's-complement of i is the 2's-complement of i minus 1".
For right shifts of integers, Python fills with more copies of the sign bit.

How to adjust the numerical precision for integers

I'm trying to work with big numbers in R, in my opinion they aren't even that big. I asked R to return me the module of the division of 6001532020609003100 by 97, I got answer 1; when doing the same calculation in Python I got answer 66.
Can someone tell me what's going on?
R doesn't have the same kind of "magic", arbitrary-length integers that Python does: its base integer type is 32 bit, which maxes out at .Machine$integer.max == 2147483647. When confronted with a number greater than this value R automatically converts it to double-precision floating point; then the %% operator gets messed up by floating-point imprecision. (If you try to insist that the input is an integer by entering 6001532020609003100L (L indicates integer) R still converts it to float, but warns you ...)
#JonSpring is right that you can do completely arbitrary-length integer computation (up to your computer's memory capability) with Rmpfr, but you can
also use the bit64 package for 64-bit integers, which your example just fits into:
library(bit64)
x <- as.integer64("6001532020609003100")
x %% 97
## [1] 66
But doubling this value puts you out of the integer-64 range: 2*x gives an overflow error.
Honestly, if you want to do a lot of big-integer calculation I'd say that Python is more convenient ...
library(Rmpfr)
as.integer(mpfr("6001532020609003100") %% 97)
[1] 66

Unsigned ints of arbitrary length in python

I am trying to simulate a fixed-point filter implementation. I want to capture low-level hardware features like 2s-complement wraparound/overflow and fixed register widths. Some of the registers widths are set by hardware features at unusual and long widths (ie 72b).
I've been making some progress using the built-in integers. The infinite width is incredibly useful... but I find myself fighting Python a lot because it sometimes wants to interpret a binary as a positive integer, and sometimes it seems to want to interpret a very similar binary as a negative 2's complement. For example:
>> a = 0b11111 # sign-extended -1
>> b = 0b0011
>> print("{0:b}".format(a*b))
5f
>> print("{0:b}".format((a*b)&a)) # Truncate to correct product length
11101 # == -3 in 2s complement. Great!
>> print("{0:b}".format(~((a*b)&a)+1)) # Actually perform the 2's complement
-11101 # Arrrrggggghhh
>> print("{0:b}".format((~((a*b)&a)&a)+1)) # Truncate with extreme prejudice
11 # OK. Fine.
I guess if I think hard enough I can figure out why all this works the way it does, but if I could just do it all in unsigned space without worrying about python adding sign bits it would make things easier and less error-prone. Anyone know if there's a relatively easy way to do this? I considered bit strings, but I have to do a lot of adds & multiplies in this application and built-in integer arithmetic is really useful for that.
~x is literally defined on arbitrary precision integers as -(x+1). It does not do bit arithmetic: ~0 is 255 in one-byte integers, 65535 in two-byte integers, 1023 for 10-bit integers etc; so defining ~ via bit inversion on stretchy integers is useless.
If a defines the fixed width of your integers (with 0b11111 saying you are working with five-bit numbers), bit inversion is as simple as a^x.
print("{0:b}".format(a ^ b)
# => 11100
Two's complement is meanwhile easiest done as a+1-b, or equivalently a^b+1:
print("{0:b}".format((a + 1) - b))
# => 11101
print("{0:b}".format((a ^ b) + 1))
# => 11101
tl;dr: Don't use ~ if you want to stay unsigned.

Reverse bit in python

Reverse bits of a given 32 bits unsigned integer.
For example, given input 43261596 (represented in binary as
00000010100101000001111010011100), return 964176192 (represented in
binary as 00111001011110000010100101000000).
This does not work
def reverseBits(self, n):
return int(bin(n)[:1:-1], 2)
Your problem is in assuming Python's bin produces a 32 bit aligned output. It doesn't; it outputs the smallest number of bits possible. Python 3's int type has an unbounded number of bits, and even in Python 2, int will auto-promote to long if it overflows the bounds of int (which is not related to the limits of C's int).
If you want it to act like a specific width, the easiest way is to use formatting tools with more control (which will also simplify your slice operation).
For example, by formatting to a fixed 32 characters wide, padding with zeroes, you get your desired result:
>>> int('{:032b}'.format(43261596)[::-1], 2)
964176192
The answer is in the output of bin():
>>> bin(12345)
'0b11000000111001'
As you can see, it only outputs the first 14 ones and zeros. This is because it removes any leading zeros. Why does it do this? Well, python doesn't use a fixed size for integers like many other languages. The ints might be any number of bytes in practice, depending on the system and implementation.
So instead of 00000000000000001111111111111111 becoming 11111111111111110000000000000000, it becomes 1111111111111111 instead

How can I get sign bit of an integer in python?

I want to be able to access the sign bit of a number in python. I can do something like n >> 31 in C since int is represented as 32 bits.
I can't make use of the conditional operator and > <.
in python 3 integers don't have a fixed size, and aren't represented using the internal CPU representation (which allows to handle very large numbers without trouble).
So the best way is
signbit = 1 if n < 0 else 0
or
signbit = int(n < 0)
EDIT: if you cannot use < or > (which is ludicrious but so be it) you could use the fact that a-b will be positive if a is greater than b, so you could do
abs(a-b) == a-b
that doesn't use < or > (at least in the text, because abs uses it you can trust me)
I would argue that in Python there is not really a concept of a sign bit. As far as the programmer is concerned, an int is just a type with certain behavior. You don't get access to the low-level representation. Even bin(-3) returns a "negative" binary representation: '-0b11'
However, there are ways to get the sign or the bigger if two integers without comparisons. The following approach abuses floating point math to avoid comparisons.
def sign(a):
try:
return (1 - int(a / (a**2)**0.5)) // 2
except ZeroDivisionError:
return 0
def return_bigger(a, b):
s = sign(b - a)
return a * s + b * (1 - s)
assert sign(-33) == 1
assert sign(33) == 0
assert return_bigger(10, 15) == 15
assert return_bigger(25, 3) == 25
assert return_bigger(42, 42) == 42
(a**2)**0.5 could be replaced with abs but I bet internally this is implemented with a comparison.
The try/except is not needed if you don't care about 0 or equal integers (or there may be another horrible math workaround).
Finally, I'd like to point out that I have absolutely no idea why on earth anybody would want to implement something like that, except for the hell of it.
Conceptually, the bit representation of a negative integer is padded with an infinite number of 1 bits to the left (just like a non-negative number is regarded as padded with an infinite number of 0 bits). The operation n >> 31 does work (given that n is in the range of signed 32-bit numbers) in the sense that it places the sign bit (or if you prefer, one of the left-padding bits) in the lowest bit position. You just need to get rid of the rest of the left-padding bits, which you can do with a bitwise and operation like this:
n >> 31 & 1
Or you can make use of the fact that all one bits is how −1 is represented, and simply negate the result:
-(n >> 31)
Alternatively, you can cut off all but the lowest 32 1 bits before you do the shift, like this:
(n & 0xffffffff) >> 31
This is all under the assumption that you are working with numbers that fit into a signed 32-bit int. If n might need a 64 bit representation, shift by 63 places instead of 31, and if it's just 16 bits numbers, shifting by 15 places is enoough. (If you use the (n & 0xffffffff) >> 31 variant, adjust the number of fs accordingly).
On machine code level, and-ing/negating and shifting is potentially much more efficient than using comparison. The former is just a couple of machine instructions, while the latter would usually boil down to a branch. Branching doesn't just take more machine instructions, it also has a bad influence on the pipelining and out-of-order execution of modern CPUs. But Python execution takes place in a higher-level layer than machine code execution, and therefore it's difficult to say anything about the performance impact in Python: it may depend on the context – as it would in machine code – and may therefore also be difficult to test generally. (Caveat: I don't know much about how low-level execution happens in CPython, or in Python in general. For someone who does, this might not be so difficult to say.)
If you don't know how big n is (in Python, an integer is not required to fit into any specific number of bits), you can use bit_length() to find out. This will work for integers of any size:
-(n >> n.bit_length())
The bit_length() operation might boil down to a single machine instruction, or it might actually need a loop to find the result, depending on the implementation and the underlying machine architecture. In the latter case, this should be noticeably more costly than using a constant
Final remark: in C, n >> 31 is actually not guaranteed to work like you assume, because the C language leaves it undefined whether >> does logical right shift (like you assume) or arithmetic shift right (like Python does). In some languages, these are different operators, which makes it clear what you get. In Java for instance, logical shift right is >>>, and arithmetic shift right is >>.
How about this?
def is_negative(num, places):
return not long(num * 10**places) & 0xFFFFFFFF == long(num * 10**places)
Not efficient, but not using < or > definitely restricts you to weirdness. Note that zero will evaluate as positive.

Categories