Unsigned ints of arbitrary length in python - python

I am trying to simulate a fixed-point filter implementation. I want to capture low-level hardware features like 2s-complement wraparound/overflow and fixed register widths. Some of the registers widths are set by hardware features at unusual and long widths (ie 72b).
I've been making some progress using the built-in integers. The infinite width is incredibly useful... but I find myself fighting Python a lot because it sometimes wants to interpret a binary as a positive integer, and sometimes it seems to want to interpret a very similar binary as a negative 2's complement. For example:
>> a = 0b11111 # sign-extended -1
>> b = 0b0011
>> print("{0:b}".format(a*b))
5f
>> print("{0:b}".format((a*b)&a)) # Truncate to correct product length
11101 # == -3 in 2s complement. Great!
>> print("{0:b}".format(~((a*b)&a)+1)) # Actually perform the 2's complement
-11101 # Arrrrggggghhh
>> print("{0:b}".format((~((a*b)&a)&a)+1)) # Truncate with extreme prejudice
11 # OK. Fine.
I guess if I think hard enough I can figure out why all this works the way it does, but if I could just do it all in unsigned space without worrying about python adding sign bits it would make things easier and less error-prone. Anyone know if there's a relatively easy way to do this? I considered bit strings, but I have to do a lot of adds & multiplies in this application and built-in integer arithmetic is really useful for that.

~x is literally defined on arbitrary precision integers as -(x+1). It does not do bit arithmetic: ~0 is 255 in one-byte integers, 65535 in two-byte integers, 1023 for 10-bit integers etc; so defining ~ via bit inversion on stretchy integers is useless.
If a defines the fixed width of your integers (with 0b11111 saying you are working with five-bit numbers), bit inversion is as simple as a^x.
print("{0:b}".format(a ^ b)
# => 11100
Two's complement is meanwhile easiest done as a+1-b, or equivalently a^b+1:
print("{0:b}".format((a + 1) - b))
# => 11101
print("{0:b}".format((a ^ b) + 1))
# => 11101
tl;dr: Don't use ~ if you want to stay unsigned.

Related

How to adjust the numerical precision for integers

I'm trying to work with big numbers in R, in my opinion they aren't even that big. I asked R to return me the module of the division of 6001532020609003100 by 97, I got answer 1; when doing the same calculation in Python I got answer 66.
Can someone tell me what's going on?
R doesn't have the same kind of "magic", arbitrary-length integers that Python does: its base integer type is 32 bit, which maxes out at .Machine$integer.max == 2147483647. When confronted with a number greater than this value R automatically converts it to double-precision floating point; then the %% operator gets messed up by floating-point imprecision. (If you try to insist that the input is an integer by entering 6001532020609003100L (L indicates integer) R still converts it to float, but warns you ...)
#JonSpring is right that you can do completely arbitrary-length integer computation (up to your computer's memory capability) with Rmpfr, but you can
also use the bit64 package for 64-bit integers, which your example just fits into:
library(bit64)
x <- as.integer64("6001532020609003100")
x %% 97
## [1] 66
But doubling this value puts you out of the integer-64 range: 2*x gives an overflow error.
Honestly, if you want to do a lot of big-integer calculation I'd say that Python is more convenient ...
library(Rmpfr)
as.integer(mpfr("6001532020609003100") %% 97)
[1] 66

Reverse bit in python

Reverse bits of a given 32 bits unsigned integer.
For example, given input 43261596 (represented in binary as
00000010100101000001111010011100), return 964176192 (represented in
binary as 00111001011110000010100101000000).
This does not work
def reverseBits(self, n):
return int(bin(n)[:1:-1], 2)
Your problem is in assuming Python's bin produces a 32 bit aligned output. It doesn't; it outputs the smallest number of bits possible. Python 3's int type has an unbounded number of bits, and even in Python 2, int will auto-promote to long if it overflows the bounds of int (which is not related to the limits of C's int).
If you want it to act like a specific width, the easiest way is to use formatting tools with more control (which will also simplify your slice operation).
For example, by formatting to a fixed 32 characters wide, padding with zeroes, you get your desired result:
>>> int('{:032b}'.format(43261596)[::-1], 2)
964176192
The answer is in the output of bin():
>>> bin(12345)
'0b11000000111001'
As you can see, it only outputs the first 14 ones and zeros. This is because it removes any leading zeros. Why does it do this? Well, python doesn't use a fixed size for integers like many other languages. The ints might be any number of bytes in practice, depending on the system and implementation.
So instead of 00000000000000001111111111111111 becoming 11111111111111110000000000000000, it becomes 1111111111111111 instead

How can I get sign bit of an integer in python?

I want to be able to access the sign bit of a number in python. I can do something like n >> 31 in C since int is represented as 32 bits.
I can't make use of the conditional operator and > <.
in python 3 integers don't have a fixed size, and aren't represented using the internal CPU representation (which allows to handle very large numbers without trouble).
So the best way is
signbit = 1 if n < 0 else 0
or
signbit = int(n < 0)
EDIT: if you cannot use < or > (which is ludicrious but so be it) you could use the fact that a-b will be positive if a is greater than b, so you could do
abs(a-b) == a-b
that doesn't use < or > (at least in the text, because abs uses it you can trust me)
I would argue that in Python there is not really a concept of a sign bit. As far as the programmer is concerned, an int is just a type with certain behavior. You don't get access to the low-level representation. Even bin(-3) returns a "negative" binary representation: '-0b11'
However, there are ways to get the sign or the bigger if two integers without comparisons. The following approach abuses floating point math to avoid comparisons.
def sign(a):
try:
return (1 - int(a / (a**2)**0.5)) // 2
except ZeroDivisionError:
return 0
def return_bigger(a, b):
s = sign(b - a)
return a * s + b * (1 - s)
assert sign(-33) == 1
assert sign(33) == 0
assert return_bigger(10, 15) == 15
assert return_bigger(25, 3) == 25
assert return_bigger(42, 42) == 42
(a**2)**0.5 could be replaced with abs but I bet internally this is implemented with a comparison.
The try/except is not needed if you don't care about 0 or equal integers (or there may be another horrible math workaround).
Finally, I'd like to point out that I have absolutely no idea why on earth anybody would want to implement something like that, except for the hell of it.
Conceptually, the bit representation of a negative integer is padded with an infinite number of 1 bits to the left (just like a non-negative number is regarded as padded with an infinite number of 0 bits). The operation n >> 31 does work (given that n is in the range of signed 32-bit numbers) in the sense that it places the sign bit (or if you prefer, one of the left-padding bits) in the lowest bit position. You just need to get rid of the rest of the left-padding bits, which you can do with a bitwise and operation like this:
n >> 31 & 1
Or you can make use of the fact that all one bits is how −1 is represented, and simply negate the result:
-(n >> 31)
Alternatively, you can cut off all but the lowest 32 1 bits before you do the shift, like this:
(n & 0xffffffff) >> 31
This is all under the assumption that you are working with numbers that fit into a signed 32-bit int. If n might need a 64 bit representation, shift by 63 places instead of 31, and if it's just 16 bits numbers, shifting by 15 places is enoough. (If you use the (n & 0xffffffff) >> 31 variant, adjust the number of fs accordingly).
On machine code level, and-ing/negating and shifting is potentially much more efficient than using comparison. The former is just a couple of machine instructions, while the latter would usually boil down to a branch. Branching doesn't just take more machine instructions, it also has a bad influence on the pipelining and out-of-order execution of modern CPUs. But Python execution takes place in a higher-level layer than machine code execution, and therefore it's difficult to say anything about the performance impact in Python: it may depend on the context – as it would in machine code – and may therefore also be difficult to test generally. (Caveat: I don't know much about how low-level execution happens in CPython, or in Python in general. For someone who does, this might not be so difficult to say.)
If you don't know how big n is (in Python, an integer is not required to fit into any specific number of bits), you can use bit_length() to find out. This will work for integers of any size:
-(n >> n.bit_length())
The bit_length() operation might boil down to a single machine instruction, or it might actually need a loop to find the result, depending on the implementation and the underlying machine architecture. In the latter case, this should be noticeably more costly than using a constant
Final remark: in C, n >> 31 is actually not guaranteed to work like you assume, because the C language leaves it undefined whether >> does logical right shift (like you assume) or arithmetic shift right (like Python does). In some languages, these are different operators, which makes it clear what you get. In Java for instance, logical shift right is >>>, and arithmetic shift right is >>.
How about this?
def is_negative(num, places):
return not long(num * 10**places) & 0xFFFFFFFF == long(num * 10**places)
Not efficient, but not using < or > definitely restricts you to weirdness. Note that zero will evaluate as positive.

Reversible hash function?

I need a reversible hash function (obviously the input will be much smaller in size than the output) that maps the input to the output in a random-looking way. Basically, I want a way to transform a number like "123" to a larger number like "9874362483910978", but not in a way that will preserve comparisons, so it must not be always true that, if x1 > x2, f(x1) > f(x2) (but neither must it be always false).
The use case for this is that I need to find a way to transform small numbers into larger, random-looking ones. They don't actually need to be random (in fact, they need to be deterministic, so the same input always maps to the same output), but they do need to look random (at least when base64encoded into strings, so shifting by Z bits won't work as similar numbers will have similar MSBs).
Also, easy (fast) calculation and reversal is a plus, but not required.
I don't know if I'm being clear, or if such an algorithm exists, but I'd appreciate any and all help!
None of the answers provided seemed particularly useful, given the question. I had the same problem, needing a simple, reversible hash for not-security purposes, and decided to go with bit relocation. It's simple, it's fast, and it doesn't require knowing anything about boolean maths or crypo algorithms or anything else that requires actual thinking.
The simplest would probably be to just move half the bits left, and the other half right:
def hash(n):
return ((0x0000FFFF & n)<<16) + ((0xFFFF0000 & n)>>16)
This is reversible, in that hash(hash(n)) = n, and has non-sequential pairs {n,m}, n < m, where hash(m) < hash(n).
And to get a much less sequential looking implementation, you might also want to consider an interlace reordering from [msb,z,...,a,lsb] to [msb,lsb,z,a,...] or [lsb,msb,a,z,...] or any other relocation you feel gives an appropriately non-sequential sequence for the numbers you deal with, or even add a XOR on top for peak desequential'ing.
(The above function is safe for numbers that fit in 32 bits, larger numbers are guaranteed to cause collisions and would need some more bit mask coverage to prevent problems. That said, 32 bits is usually enough for any non-security uid).
Also have a look at the multiplicative inverse answer given by Andy Hayden, below.
Another simple solution is to use multiplicative inverses (see Eri Clippert's blog):
we showed how you can take any two coprime positive integers x and m and compute a third positive integer y with the property that (x * y) % m == 1, and therefore that (x * z * y) % m == z % m for any positive integer z. That is, there always exists a “multiplicative inverse”, that “undoes” the results of multiplying by x modulo m.
We take a large number e.g. 4000000000 and a large co-prime number e.g. 387420489:
def rhash(n):
return n * 387420489 % 4000000000
>>> rhash(12)
649045868
We first calculate the multiplicative inverse with modinv which turns out to be 3513180409:
>>> 3513180409 * 387420489 % 4000000000
1
Now, we can define the inverse:
def un_rhash(h):
return h * 3513180409 % 4000000000
>>> un_rhash(649045868) # un_rhash(rhash(12))
12
Note: This answer is fast to compute and works for numbers up to 4000000000, if you need to handle larger numbers choose a sufficiently large number (and another co-prime).
You may want to do this with hexidecimal (to pack the int):
def rhash(n):
return "%08x" % (n * 387420489 % 4000000000)
>>> rhash(12)
'26afa76c'
def un_rhash(h):
return int(h, 16) * 3513180409 % 4000000000
>>> un_rhash('26afa76c') # un_rhash(rhash(12))
12
If you choose a relatively large co-prime then this will seem random, be non-sequential and also be quick to calculate.
What you are asking for is encryption. A block cipher in its basic mode of operation, ECB, reversibly maps a input block onto an output block of the same size. The input and output blocks can be interpreted as numbers.
For example, AES is a 128 bit block cipher, so it maps an input 128 bit number onto an output 128 bit number. If 128 bits is good enough for your purposes, then you can simply pad your input number out to 128 bits, transform that single block with AES, then format the output as a 128 bit number.
If 128 bits is too large, you could use a 64 bit block cipher, like 3DES, IDEA or Blowfish.
ECB mode is considered weak, but its weakness is the constraint that you have postulated as a requirement (namely, that the mapping be "deterministic"). This is a weakness, because once an attacker has observed that 123 maps to 9874362483910978, from then on whenever she sees the latter number, she knows the plaintext was 123. An attacker can perform frequency analysis and/or build up a dictionary of known plaintext/ciphertext pairs.
Basically, you are looking for 2 way encryption, and one that probably uses a salt.
You have a number of choices:
TripleDES
AES
Here is an example:" Simple insecure two-way "obfuscation" for C#
What language are you looking at? If .NET then look at the encryption namespace for some ideas.
Why not just XOR with a nice long number?
Easy. Fast. Reversible.
Or, if this doesn't need to be terribly secure, you could convert from base 10 to some smaller base (like base 8 or base 4, depending on how long you want the numbers to be).

Python, len, and size of ints

So, cPython (2.4) has some interesting behaviour when the length of something gets near to 1<<32 (the size of an int).
r = xrange(1<<30)
assert len(r) == 1<<30
is fine, but:
r = xrange(1<<32)
assert len(r) == 1<<32
ValueError: xrange object size cannot be reported`__len__() should return 0 <= outcome
Alex's wowrange has this behaviour as well. wowrange(1<<32).l is fine, but len(wowrange(1<<32)) is bad. I'm guessing there is some floating point behaviour (being read as negative) action going on here.
What exactly is happening here? (this is pretty well-solved below!)
How can I get around it? Longs?
(My specific application is random.sample(xrange(1<<32),ABUNCH)) if people want to tackle that question directly!)
cPython assumes that lists fit in memory. This extends to objects that behave like lists, such as xrange. essentially, the len function expects the __len__ method to return something that is convertable to size_t, which won't happen if the number of logical elements is too large, even if those elements don't actually exist in memory.
You'll find that
xrange(1 << 31 - 1)
is the last one that behaves as you want. This is because the maximum signed (32-bit) integer is 2^31 - 1.
1 << 32 is not a positive signed 32-bit integer (Python's int datatype), so that's why you're getting that error.
In Python 2.6, I can't even do xrange(1 << 32) or xrange(1 << 31) without getting an error, much less len on the result.
Edit If you want a little more detail...
1 << 31 represents the number 0x80000000 which in 2's complement representation is the lowest representable negative number (-1 * 2^31) for a 32-bit int. So yes, due to the bit-wise representation of the numbers you're working with, it's actually becoming negative.
For a 32-bit 2's complement number, 0x7FFFFFFF is the highest representable integer (2^31 - 1) before you "overflow" into negative numbers.
Further reading, if you're interested.
Note that when you see something like 2147483648L in the prompt, the "L" at the end signifies that it's now being represented as a "long integer" (64 bits, usually, I can't make any promises on how Python handles it because I haven't read up on it).
1<<32, when treated as a signed integer, is negative.

Categories