In python one can handle very large integers (for instance uuid.uuid4().int.bit_length() gives 128), but the largest int datastructure the C-API documentation offers is long long, and it is a 64-bit int.
I would love to be able to get a C int128 from a PyLong, but it seems there is no tooling for this. PyLong_AsLongLong for instance cannot handle python integers bigger than 2**64.
Is there some documentation I missed, and it is actually possible?
Is there currently not possible, but some workaround exist? (I would love to use the tooling available in the python C-API for long long with int128, for instance a PyLong_AsInt128AndOverflow function).
Is it a planed feature in a forthcoming python release?
There are a couple of different ways you can access the level of precision you want.
Systems with 64-bit longs often have 128-bit long longs. Notice that the article you link says "at least 64 bits". It's worth checking sizeof(long long) in case there's nothing further to do.
Assuming that is not what you are working with, you'll have to look closer at the raw PyLongObject, which is actually a typedef of the private _longobject structure.
The raw bits are accessible through the ob_digit field, with the length given by ob_size. The data type of the digits, and the actual number of boots they hold is given by the typedef digit and the macro PYLONG_BITS_IN_DIGIT. The latter must be smaller than 8 * sizeof(digit), larger than 8, and a multiple of 5 (so 30 or 15, depending on how your build was done).
Luckily for you, there is an "undocumented" method in the C API that will copy the bytes of the number for you: _PyLong_AsByteArray. The comment in longobject.h reads:
/* _PyLong_AsByteArray: Convert the least-significant 8*n bits of long
v to a base-256 integer, stored in array bytes. Normally return 0,
return -1 on error.
If little_endian is 1/true, store the MSB at bytes[n-1] and the LSB at
bytes[0]; else (little_endian is 0/false) store the MSB at bytes[0] and
the LSB at bytes[n-1].
If is_signed is 0/false, it's an error if v < 0; else (v >= 0) n bytes
are filled and there's nothing special about bit 0x80 of the MSB.
If is_signed is 1/true, bytes is filled with the 2's-complement
representation of v's value. Bit 0x80 of the MSB is the sign bit.
Error returns (-1):
+ is_signed is 0 and v < 0. TypeError is set in this case, and bytes
isn't altered.
+ n isn't big enough to hold the full mathematical value of v. For
example, if is_signed is 0 and there are more digits in the v than
fit in n; or if is_signed is 1, v < 0, and n is just 1 bit shy of
being large enough to hold a sign bit. OverflowError is set in this
case, but bytes holds the least-significant n bytes of the true value.
*/
You should be able to get a UUID with something like
PyLongObject *mylong;
unsigned char myuuid[16];
_PyLong_AsByteArray(mylong, myuuid, sizeof(myuuid), 1, 0);
Related
I want to check the size of int data type in python:
import sys
sys.getsizeof(int)
It comes out to be "436", which doesn't make sense to me.
Anyway, I want to know how many bytes (2,4,..?) int will take on my machine.
The short answer
You're getting the size of the class, not of an instance of the class. Call int to get the size of an instance:
>>> sys.getsizeof(int())
24
If that size still seems a little bit large, remember that a Python int is very different from an int in (for example) c. In Python, an int is a fully-fledged object. This means there's extra overhead.
Every Python object contains at least a refcount and a reference to the object's type in addition to other storage; on a 64-bit machine, that takes up 16 bytes! The int internals (as determined by the standard CPython implementation) have also changed over time, so that the amount of additional storage taken depends on your version.
Some details about int objects in Python 2 and 3
Here's the situation in Python 2. (Some of this is adapted from a blog post by Laurent Luce). Integer objects are represented as blocks of memory with the following structure:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro defining the storage for the refcount and the object type. It's described in some detail by the documentation, and the code can be seen in this answer.
The memory is allocated in large blocks so that there's not an allocation bottleneck for every new integer. The structure for the block looks like this:
struct _intblock {
struct _intblock *next;
PyIntObject objects[N_INTOBJECTS];
};
typedef struct _intblock PyIntBlock;
These are all empty at first. Then, each time a new integer is created, Python uses the memory pointed at by next and increments next to point to the next free integer object in the block.
I'm not entirely sure how this changes once you exceed the storage capacity of an ordinary integer, but once you do so, the size of an int gets larger. On my machine, in Python 2:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
24
>>> sys.getsizeof(2 ** 62)
24
>>> sys.getsizeof(2 ** 63)
36
In Python 3, I think the general picture is the same, but the size of integers increases in a more piecemeal way:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(2 ** 30 - 1)
28
>>> sys.getsizeof(2 ** 30)
32
>>> sys.getsizeof(2 ** 60 - 1)
32
>>> sys.getsizeof(2 ** 60)
36
These results are, of course, all hardware-dependent! YMMV.
The variability in integer size in Python 3 is a hint that they may behave more like variable-length types (like lists). And indeed, this turns out to be true. Here's the definition of the C struct for int objects in Python 3:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The comments that accompany this definition summarize Python 3's representation of integers. Zero is represented not by a stored value, but by an object with size zero (which is why sys.getsizeof(0) is 24 bytes while sys.getsizeof(1) is 28). Negative numbers are represented by objects with a negative size attribute! So weird.
I've recently learnt how to see the max integer int() that my x64 arch can deal with. (Above this number, the system uses long).
Now i'm studing bitwise operators, and learnt on this site that Integers in Python are stored in two's complement system.
when I type:
print sys.maxsize.bit_length()
I get 63 bits. I think that this depends on my machine arch (Ubuntu 64 bit).
The questions are:
where is the bit n°64?
is it the leading 0 or 1 in the two's complement notation?
why isn't included in the bit length?
Added:
why for negative numbers we need 64 bits and not 63?
Because on your platform the value is derived from a signed integer. The maximum value fits in 63 bits, leaving the 64th bit for negative values.
Note that the int.bit_length() method gives you the minimum number of bits required to represent that specific integer and doesn't ever include leading zeros. It doesn't say anything about the underlying C integer:
>>> 1 .bit_length()
1
>>> 2 .bit_length()
2
>>> 3 .bit_length()
2
>>> 4 .bit_length()
3
From the int.bit_length() documentation:
Return the number of bits necessary to represent an integer in binary, excluding the sign and leading zeros
sys.maxsize usually reflects the maximum value a ssize_t C integer can hold, but it gives you a Python int object. The fact that the C type might use two's complement hardly matters to the Python type.
The source code merely converts a C constant to a int object, the constant is defined in pyport.h and thus is platform dependent as to how that value is derived. For Linux that'll be:
typedef ssize_t Py_ssize_t;
/* ... */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))
Clearly the value must be using a two's complement signed number representation for that last part to work; the value -1 is shifted one bit to the right to come at the largest possible value; in two's complement -1 is represented as all 1 bits, shifting these to the right gives you a 0 and all 1s.
In the two's complement system of encoding it is the most significant bit (the left-most) that encodes the sign, so in a 64-bit number that leaves only the other 63 bits to encode the integer value.
In C, C++, and Java, an integer has a certain range. One thing I realized in Python is that I can calculate really large integers such as pow(2, 100). The same equivalent code, in C, pow(2, 100) would clearly cause an overflow since in 32-bit architecture, the unsigned integer type ranges from 0 to 2^32-1. How is it possible for Python to calculate these large numbers?
Basically, big numbers in Python are stored in arrays of 'digits'. That's quoted, right, because each 'digit' could actually be quite a big number on its own. )
You can check the details of implementation in longintrepr.h and longobject.c:
There are two different sets of parameters: one set for 30-bit digits,
stored in an unsigned 32-bit integer type, and one set for 15-bit
digits with each digit stored in an unsigned short. The value of
PYLONG_BITS_IN_DIGIT, defined either at configure time or in pyport.h,
is used to decide which digit size to use.
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
*/
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
How is it possible for Python to calculate these large numbers?
How is it possible for you to calculate these large numbers if you only have the 10 digits 0-9? Well, you use more than one digit!
Bignum arithmetic works the same way, except the individual "digits" are not 0-9 but 0-4294967296 or 0-18446744073709551616.
My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.
The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.
This is a simplified example code:
unsigned int HashString(const char* str) {
unsigned int hash = 3151;
while (*str != 0) {
hash = (hash << 3) + (*str << 2) * 3;
str++;
}
return hash;
}
How can this be translated to Python? The direct translation obviously yields a different result:
def hash_string(str):
hash = 3151
for c in str:
hash = (hash << 3) + (ord(c) << 2) * 3
return hash
For instance:
hash_string("foo bar spam") # 228667414299004
HashString("foo bar spam") // 3355459964
Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.
Mask the hash value with &:
def hash_string(str, _width=2**32-1):
hash = 3151
for c in str:
hash = ((hash << 3) + (ord(c) << 2) * 3)
return hash & _width
This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.
Demo:
>>> hash_string("foo bar spam")
3355459964
The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.
The easiest fix is just to correct at the end:
return hash % (1 << 32)
For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.
PHP may have the same problem, or a different one.
PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)
But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.
Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.
So, cPython (2.4) has some interesting behaviour when the length of something gets near to 1<<32 (the size of an int).
r = xrange(1<<30)
assert len(r) == 1<<30
is fine, but:
r = xrange(1<<32)
assert len(r) == 1<<32
ValueError: xrange object size cannot be reported`__len__() should return 0 <= outcome
Alex's wowrange has this behaviour as well. wowrange(1<<32).l is fine, but len(wowrange(1<<32)) is bad. I'm guessing there is some floating point behaviour (being read as negative) action going on here.
What exactly is happening here? (this is pretty well-solved below!)
How can I get around it? Longs?
(My specific application is random.sample(xrange(1<<32),ABUNCH)) if people want to tackle that question directly!)
cPython assumes that lists fit in memory. This extends to objects that behave like lists, such as xrange. essentially, the len function expects the __len__ method to return something that is convertable to size_t, which won't happen if the number of logical elements is too large, even if those elements don't actually exist in memory.
You'll find that
xrange(1 << 31 - 1)
is the last one that behaves as you want. This is because the maximum signed (32-bit) integer is 2^31 - 1.
1 << 32 is not a positive signed 32-bit integer (Python's int datatype), so that's why you're getting that error.
In Python 2.6, I can't even do xrange(1 << 32) or xrange(1 << 31) without getting an error, much less len on the result.
Edit If you want a little more detail...
1 << 31 represents the number 0x80000000 which in 2's complement representation is the lowest representable negative number (-1 * 2^31) for a 32-bit int. So yes, due to the bit-wise representation of the numbers you're working with, it's actually becoming negative.
For a 32-bit 2's complement number, 0x7FFFFFFF is the highest representable integer (2^31 - 1) before you "overflow" into negative numbers.
Further reading, if you're interested.
Note that when you see something like 2147483648L in the prompt, the "L" at the end signifies that it's now being represented as a "long integer" (64 bits, usually, I can't make any promises on how Python handles it because I haven't read up on it).
1<<32, when treated as a signed integer, is negative.