In Python, what is `sys.maxsize`? - python

I assumed that this number ( 2^63 - 1 ) was the maximum value python could handle, or store as a variable. But these commands seem to be working fine:
>>> sys.maxsize
9223372036854775807
>>> a=sys.maxsize + 1
>>> a
9223372036854775808
So is there any significance at all? Can Python handle arbitrarily large numbers, if computation resoruces permitt?
Note, here's the print-out of my version is:
>>> sys.version
3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]'

Python can handle arbitrarily large integers in computation. Any integer too big to fit in 64 bits (or whatever the underlying hardware limit is) is handled in software. For that reason, Python 3 doesn't have a sys.maxint constant.
The value sys.maxsize, on the other hand, reports the platform's pointer size, and that limits the size of Python's data structures such as strings and lists.

Documentation for sys.maxsize:
An integer giving the maximum value a variable of type Py_ssize_t can take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a 64-bit platform. python3
The largest positive integer supported by the platform’s Py_ssize_t type, and thus the maximum size lists, strings, dicts, and many other containers can have. python2
What is Py_ssize_t?
It is an index type (number type for indexing things, like lists). It is the signed version of size_t (from the C language).
We don't use a normal number/ int, because this is unbounded in Python.
In Python, we don't use size_t because we want to support negative indexing, in Python we can do my_list[-4:]. So Py_ssize_t provides negative and positive numbers between a range.
The _t stands for type, to inform developers that size_t is a type name, not a variable. Just a convention.
So what is the effect of having a limit on Py_ssize_t? Why does this limit list, strings, dict size?
There is no way to index a list with an element larger than this. The list cannot get bigger than this, because it won't accept a non Py_ssize_t.
In the dictionary case, Py_ssize_t is used as the hash. Python doesn't use linked lists in its dictionary implementation, it uses Open Addressing/ probing, where if a collision is found, we a systematic way of getting another place to find the key and put the value. So you can't have more than Py_ssize_t in a dictionary in Python.
In all practical cases (64 bit machines aka. probably you), you will run out of memory before you max out Py_ssize_t. Trying dict.fromkeys(range(sys.maxsize + 5)) never got there, it just slowed my computer down.

Related

What is the largest value Python 3's int can hold?

I read that sys.maxsize is the largest value Python 3's int can hold.
However, it seems not to be the case; I can put much bigger number and it still does not overflow.
What is the limit that int can hold in Python 3? I am asking because I am converting a string to an integer and I am wondering if I need to worry about a possibility of overflow when doing the conversion.
From the docs what's new page:
The sys.maxint constant was removed, since there is no longer a limit
to the value of integers. However, sys.maxsize can be used as an
integer larger than any practical list or string index. It conforms to
the implementation’s “natural” integer size and is typically the same
as sys.maxint in previous releases on the same platform (assuming the
same build options).

Understanding memory allocation for large integers in Python

How does Python allocate memory for large integers?
An int type has a size of 28 bytes and as I keep increasing the value of the int, the size increases in increments of 4 bytes.
Why 28 bytes initially for any value as low as 1?
Why increments of 4 bytes?
PS: I am running Python 3.5.2 on a x86_64 (64 bit machine). Any pointers/resources/PEPs on how the (3.0+) interpreters work on such huge numbers is what I am looking for.
Code illustrating the sizes:
>>> a=1
>>> print(a.__sizeof__())
28
>>> a=1024
>>> print(a.__sizeof__())
28
>>> a=1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024*1024*1024
>>> a
1152921504606846976
>>> print(a.__sizeof__())
36
Why 28 bytes initially for any value as low as 1?
I believe #bgusach answered that completely; Python uses C structs to represent objects in the Python world, any objects including ints:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
PyObject_VAR_HEAD is a macro that when expanded adds another field in the struct (field PyVarObject which is specifically used for objects that have some notion of length) and, ob_digits is an array holding the value for the number. Boiler-plate in size comes from that struct, for small and large Python numbers.
Why increments of 4 bytes?
Because, when a larger number is created, the size (in bytes) is a multiple of the sizeof(digit); you can see that in _PyLong_New where the allocation of memory for a new longobject is performed with PyObject_MALLOC:
/* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
sizeof(digit)*size. Previous incarnations of this code used
sizeof(PyVarObject) instead of the offsetof, but this risks being
incorrect in the presence of padding between the PyVarObject header
and the digits. */
if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
size*sizeof(digit));
offsetof(PyLongObject, ob_digit) is the 'boiler-plate' (in bytes) for the long object that isn't related with holding its value.
digit is defined in the header file holding the struct _longobject as a typedef for uint32:
typedef uint32_t digit;
and sizeof(uint32_t) is 4 bytes. That's the amount by which you'll see the size in bytes increase when the size argument to _PyLong_New increases.
Of course, this is just how CPython has chosen to implement it. It is an implementation detail and as such you wont find much information in PEPs. The python-dev mailing list would hold implementation discussions if you can find the corresponding thread :-).
Either way, you might find differing behavior in other popular implementations, so don't take this one for granted.
It's actually easy. Python's int is not the kind of primitive you may be used to from other languages, but a full fledged object, with its methods and all the stuff. That is where the overhead comes from.
Then, you have the payload itself, the integer that is being represented. And there is no limit for that, except your memory.
The size of a Python's int is what it needs to represent the number plus a little overhead.
If you want to read further, take a look at the relevant part of the documentation:
Integers have unlimited precision

OverflowError occurs when using cython with a large int

python 3.4, windows 10, cython 0.21.1
I'm compiling this function to c with cython
def weakchecksum(data):
"""
Generates a weak checksum from an iterable set of bytes.
"""
cdef long a, b, l
a = b = 0
l = len(data)
for i in range(l):
a += data[i]
b += (l - i)*data[i]
return (b << 16) | a, a, b
which produces this error:
"OverflowError: Python int too large to convert to C long"
I've also tried declaring them as unsigned longs. What type do I use to work with really large numbers? If it's too large for a c long are there any workarounds?
cython compiles pyx files to C, thus it depends on underlying C compiler.
Size of integer types in C varies on different platforms and operations systems, and C standard don't dictate exact implementation.
However there is de facto implementation conventions.
Windows for both 32 and 64 bit uses 4 bytes (32 bits) for int and long, 8 bytes (64 bits) for long long. The difference between Win32 and Win64 is size of pointer (32 bits for Win32 and 64 bits for Win64). (See Data Type Ranges] from MSDN).
Linux uses another model: int is 32 bits for both linux-32 and linux-64, long long is always 64-bit. long and pointers are vary: 32 bits on linux-32 and 64 bits on linux-64.
Long story short: if you need maximum capacity for integer type which doesn't changed on different platforms use long long (or unsigned long long).
The data range for long long is [–9223372036854775808, 9223372036854775807].
If you need numbers with arbitrary precision there is GMP library -- de facto standard for high-precision arithmetic. Python has wrapper for it called gmpy2.
If you make sure that your calculations are in c (for instance, declare i to be long, and put the data element into a cdefed variable or cast it before calculation), you won't get this error. Your actual results, though, could vary depending on platform, depending (potentially) on the exact assembly code generated and the resulting treatment of overflows. There are better algorithms for this, as #cod3monk3y has noted (look at the "simple checksums" link).

Maximum and Minimum values for ints

How do I represent minimum and maximum values for integers in Python? In Java, we have Integer.MIN_VALUE and Integer.MAX_VALUE.
See also: What is the maximum float in Python?.
Python 3
In Python 3, this question doesn't apply. The plain int type is unbounded.
However, you might actually be looking for information about the current interpreter's word size, which will be the same as the machine's word size in most cases. That information is still available in Python 3 as sys.maxsize, which is the maximum value representable by a signed word. Equivalently, it's the size of the largest possible list or in-memory sequence.
Generally, the maximum value representable by an unsigned word will be sys.maxsize * 2 + 1, and the number of bits in a word will be math.log2(sys.maxsize * 2 + 2). See this answer for more information.
Python 2
In Python 2, the maximum value for plain int values is available as sys.maxint:
>>> sys.maxint # on my system, 2**63-1
9223372036854775807
You can calculate the minimum value with -sys.maxint - 1 as shown in the docs.
Python seamlessly switches from plain to long integers once you exceed this value. So most of the time, you won't need to know it.
If you just need a number that's bigger than all others, you can use
float('inf')
in similar fashion, a number smaller than all others:
float('-inf')
This works in both python 2 and 3.
The sys.maxint constant has been removed from Python 3.0 onward, instead use sys.maxsize.
Integers
PEP 237: Essentially, long renamed to int. That is, there is only one built-in integral type, named int; but it behaves mostly like the old long type.
...
The sys.maxint constant was removed, since there is no longer a limit to the value of integers. However, sys.maxsize can be used as an integer larger than any practical list or string index. It conforms to the implementation’s “natural” integer size and is typically the same as sys.maxint in previous releases on the same platform (assuming the same build options).
For Python 3, it is
import sys
max_size = sys.maxsize
min_size = -sys.maxsize - 1
In Python integers will automatically switch from a fixed-size int representation into a variable width long representation once you pass the value sys.maxint, which is either 231 - 1 or 263 - 1 depending on your platform. Notice the L that gets appended here:
>>> 9223372036854775807
9223372036854775807
>>> 9223372036854775808
9223372036854775808L
From the Python manual:
Numbers are created by numeric literals or as the result of built-in functions and operators. Unadorned integer literals (including binary, hex, and octal numbers) yield plain integers unless the value they denote is too large to be represented as a plain integer, in which case they yield a long integer. Integer literals with an 'L' or 'l' suffix yield long integers ('L' is preferred because 1l looks too much like eleven!).
Python tries very hard to pretend its integers are mathematical integers and are unbounded. It can, for instance, calculate a googol with ease:
>>> 10**100
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
You may use 'inf' like this:
import math
bool_true = 0 < math.inf
bool_false = 0 < -math.inf
Refer: math — Mathematical functions
If you want the max for array or list indices (equivalent to size_t in C/C++), you can use numpy:
np.iinfo(np.intp).max
This is same as sys.maxsize however advantage is that you don't need import sys just for this.
If you want max for native int on the machine:
np.iinfo(np.intc).max
You can look at other available types in doc.
For floats you can also use sys.float_info.max.
sys.maxsize is not the actually the maximum integer value which is supported. You can double maxsize and multiply it by itself and it stays a valid and correct value.
However, if you try sys.maxsize ** sys.maxsize, it will hang your machine for a significant amount of time. As many have pointed out, the byte and bit size does not seem to be relevant because it practically doesn't exist. I guess python just happily expands it's integers when it needs more memory space. So in general there is no limit.
Now, if you're talking about packing or storing integers in a safe way where they can later be retrieved with integrity then of course that is relevant. I'm really not sure about packing but I know python's pickle module handles those things well. String representations obviously have no practical limit.
So really, the bottom line is: what is your applications limit? What does it require for numeric data? Use that limit instead of python's fairly nonexistent integer limit.
I rely heavily on commands like this.
python -c 'import sys; print(sys.maxsize)'
Max int returned: 9223372036854775807
For more references for 'sys' you should access
https://docs.python.org/3/library/sys.html
https://docs.python.org/3/library/sys.html#sys.maxsize
code given below will help you.
for maximum value you can use sys.maxsize and for minimum you can negate same value and use it.
import sys
ni=sys.maxsize
print(ni)

fast, large-width, non-cryptographic string hashing in python

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)
Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!
An ideal solution would have the following properties, in a relative, loose order of importance.
Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.
'Perturbed' hash example, where the hash value is changed drastically by a small integer value n
def perturb_hash(key,n):
return hash((key,n))
Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)
Thanks!
Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).
If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.
Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.
Usage example and timing comparison:
import murmur3
import timeit
# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)
# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()
t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()
Output:
15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653
BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!
Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443
Use the built-in hash() function. This function, at least on the machine I'm developing for (with
python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for
my purposes.
That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.
This is the python str hashing function from Objects/stringobject.c (Python version 2.7):
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x; /* Notice the 64-bit hash, at least on a 64-bit system */
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}
"strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.
This may violate your first constraint, but: consider using something like
(zlib.adler32(strg, perturber) << N) ^ hash(strg)
to get a (32+N)-bit hash.
If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.
Have a look at xxHash, there's also the pip package.
xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).
I've been using xxHash for a long time (my typical use case is to hash strings -- not for security purposes) and I'm really satisfied of the performance.

Categories