python 3.4, windows 10, cython 0.21.1
I'm compiling this function to c with cython
def weakchecksum(data):
"""
Generates a weak checksum from an iterable set of bytes.
"""
cdef long a, b, l
a = b = 0
l = len(data)
for i in range(l):
a += data[i]
b += (l - i)*data[i]
return (b << 16) | a, a, b
which produces this error:
"OverflowError: Python int too large to convert to C long"
I've also tried declaring them as unsigned longs. What type do I use to work with really large numbers? If it's too large for a c long are there any workarounds?
cython compiles pyx files to C, thus it depends on underlying C compiler.
Size of integer types in C varies on different platforms and operations systems, and C standard don't dictate exact implementation.
However there is de facto implementation conventions.
Windows for both 32 and 64 bit uses 4 bytes (32 bits) for int and long, 8 bytes (64 bits) for long long. The difference between Win32 and Win64 is size of pointer (32 bits for Win32 and 64 bits for Win64). (See Data Type Ranges] from MSDN).
Linux uses another model: int is 32 bits for both linux-32 and linux-64, long long is always 64-bit. long and pointers are vary: 32 bits on linux-32 and 64 bits on linux-64.
Long story short: if you need maximum capacity for integer type which doesn't changed on different platforms use long long (or unsigned long long).
The data range for long long is [–9223372036854775808, 9223372036854775807].
If you need numbers with arbitrary precision there is GMP library -- de facto standard for high-precision arithmetic. Python has wrapper for it called gmpy2.
If you make sure that your calculations are in c (for instance, declare i to be long, and put the data element into a cdefed variable or cast it before calculation), you won't get this error. Your actual results, though, could vary depending on platform, depending (potentially) on the exact assembly code generated and the resulting treatment of overflows. There are better algorithms for this, as #cod3monk3y has noted (look at the "simple checksums" link).
Related
I assumed that this number ( 2^63 - 1 ) was the maximum value python could handle, or store as a variable. But these commands seem to be working fine:
>>> sys.maxsize
9223372036854775807
>>> a=sys.maxsize + 1
>>> a
9223372036854775808
So is there any significance at all? Can Python handle arbitrarily large numbers, if computation resoruces permitt?
Note, here's the print-out of my version is:
>>> sys.version
3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]'
Python can handle arbitrarily large integers in computation. Any integer too big to fit in 64 bits (or whatever the underlying hardware limit is) is handled in software. For that reason, Python 3 doesn't have a sys.maxint constant.
The value sys.maxsize, on the other hand, reports the platform's pointer size, and that limits the size of Python's data structures such as strings and lists.
Documentation for sys.maxsize:
An integer giving the maximum value a variable of type Py_ssize_t can take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a 64-bit platform. python3
The largest positive integer supported by the platform’s Py_ssize_t type, and thus the maximum size lists, strings, dicts, and many other containers can have. python2
What is Py_ssize_t?
It is an index type (number type for indexing things, like lists). It is the signed version of size_t (from the C language).
We don't use a normal number/ int, because this is unbounded in Python.
In Python, we don't use size_t because we want to support negative indexing, in Python we can do my_list[-4:]. So Py_ssize_t provides negative and positive numbers between a range.
The _t stands for type, to inform developers that size_t is a type name, not a variable. Just a convention.
So what is the effect of having a limit on Py_ssize_t? Why does this limit list, strings, dict size?
There is no way to index a list with an element larger than this. The list cannot get bigger than this, because it won't accept a non Py_ssize_t.
In the dictionary case, Py_ssize_t is used as the hash. Python doesn't use linked lists in its dictionary implementation, it uses Open Addressing/ probing, where if a collision is found, we a systematic way of getting another place to find the key and put the value. So you can't have more than Py_ssize_t in a dictionary in Python.
In all practical cases (64 bit machines aka. probably you), you will run out of memory before you max out Py_ssize_t. Trying dict.fromkeys(range(sys.maxsize + 5)) never got there, it just slowed my computer down.
How does Python allocate memory for large integers?
An int type has a size of 28 bytes and as I keep increasing the value of the int, the size increases in increments of 4 bytes.
Why 28 bytes initially for any value as low as 1?
Why increments of 4 bytes?
PS: I am running Python 3.5.2 on a x86_64 (64 bit machine). Any pointers/resources/PEPs on how the (3.0+) interpreters work on such huge numbers is what I am looking for.
Code illustrating the sizes:
>>> a=1
>>> print(a.__sizeof__())
28
>>> a=1024
>>> print(a.__sizeof__())
28
>>> a=1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024*1024*1024
>>> a
1152921504606846976
>>> print(a.__sizeof__())
36
Why 28 bytes initially for any value as low as 1?
I believe #bgusach answered that completely; Python uses C structs to represent objects in the Python world, any objects including ints:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
PyObject_VAR_HEAD is a macro that when expanded adds another field in the struct (field PyVarObject which is specifically used for objects that have some notion of length) and, ob_digits is an array holding the value for the number. Boiler-plate in size comes from that struct, for small and large Python numbers.
Why increments of 4 bytes?
Because, when a larger number is created, the size (in bytes) is a multiple of the sizeof(digit); you can see that in _PyLong_New where the allocation of memory for a new longobject is performed with PyObject_MALLOC:
/* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
sizeof(digit)*size. Previous incarnations of this code used
sizeof(PyVarObject) instead of the offsetof, but this risks being
incorrect in the presence of padding between the PyVarObject header
and the digits. */
if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
size*sizeof(digit));
offsetof(PyLongObject, ob_digit) is the 'boiler-plate' (in bytes) for the long object that isn't related with holding its value.
digit is defined in the header file holding the struct _longobject as a typedef for uint32:
typedef uint32_t digit;
and sizeof(uint32_t) is 4 bytes. That's the amount by which you'll see the size in bytes increase when the size argument to _PyLong_New increases.
Of course, this is just how CPython has chosen to implement it. It is an implementation detail and as such you wont find much information in PEPs. The python-dev mailing list would hold implementation discussions if you can find the corresponding thread :-).
Either way, you might find differing behavior in other popular implementations, so don't take this one for granted.
It's actually easy. Python's int is not the kind of primitive you may be used to from other languages, but a full fledged object, with its methods and all the stuff. That is where the overhead comes from.
Then, you have the payload itself, the integer that is being represented. And there is no limit for that, except your memory.
The size of a Python's int is what it needs to represent the number plus a little overhead.
If you want to read further, take a look at the relevant part of the documentation:
Integers have unlimited precision
This question is a follow-up of this one. In Sun's math library (in C), the expression
*(1+(int*)&x)
is used to retrieve the high word of the floating point number x. Here, the OS is assumed 64-bit, with little-endian representation.
I am thinking how to translate the C expression above into Python? The difficulty here is how to translate the '&', and '*' in the expression. Btw, maybe Python has some built-in function that retrieves the high word of a floating point number?
You can do this more easily with struct:
high_word = struct.pack('<d', x)[4:8]
return struct.unpack('<i', high_word)[0]
Here, high_word is a bytes object (or a str in 2.x) consisting of the four most significant bytes of x in little endian order (using IEEE 64-bit floating point format). We then unpack it back into a 32-bit integer (which is returned in a singleton tuple, hence the [0]).
This always uses little-endian for everything, regardless of your platform's underlying endianness. If you need to use native endianness, replace the < with = (and use > or ! to force big endian). It also guarantees 64-bit doubles and 32-bit ints, which C does not. You can remove that guarantee as well, but there is no good reason to do so since it makes your question nonsensical.
While this could be done with pointer arithmetic, it would involve messing around with ctypes and the conversion from Python float to C float would still be relatively expensive. The struct code is much easier to read.
I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)
Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!
An ideal solution would have the following properties, in a relative, loose order of importance.
Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.
'Perturbed' hash example, where the hash value is changed drastically by a small integer value n
def perturb_hash(key,n):
return hash((key,n))
Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)
Thanks!
Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).
If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.
Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.
Usage example and timing comparison:
import murmur3
import timeit
# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)
# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()
t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()
Output:
15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653
BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!
Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443
Use the built-in hash() function. This function, at least on the machine I'm developing for (with
python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for
my purposes.
That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.
This is the python str hashing function from Objects/stringobject.c (Python version 2.7):
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x; /* Notice the 64-bit hash, at least on a 64-bit system */
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}
"strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.
This may violate your first constraint, but: consider using something like
(zlib.adler32(strg, perturber) << N) ^ hash(strg)
to get a (32+N)-bit hash.
If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.
Have a look at xxHash, there's also the pip package.
xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).
I've been using xxHash for a long time (my typical use case is to hash strings -- not for security purposes) and I'm really satisfied of the performance.
If it is environment-independent, what is the theoretical maximum number of characters in a Python string?
With a 64-bit Python installation, and (say) 64 GB of memory, a Python string of around 63 GB should be quite feasible, if not maximally fast. If you can upgrade your memory beyond 64 GB, your maximum feasible strings should get proportionally longer. (I don't recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).
With a typical 32-bit Python installation, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with high amounts of RAM.
I ran this code on an x2iedn.16xlarge EC2 instance, which has 2048 GiB (2.2 TB) of RAM
>>> one_gigabyte = 1_000_000_000
>>> my_str = 'A' * (2000 * one_gigabyte)
It took a couple minutes but I was able to allocate a 2TB string on Python 3.10 running on Ubuntu 22.04.
>>> import sys
>>> sys.getsizeof(my_str)
2000000000049
>>> my_str
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
The last line actually hangs, but it would print 2 trillion As.
9 quintillion characters on a 64 bit system on CPython 3.10.
That's only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:
9,223,372,036,854,775,758 characters if your string only has ASCII characters (U+00 to U+7F) or
9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (U+80 to U+FF) or
4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e. U+0100 to U+FFFF) or
2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e. U+10000 and above)
On a 32 bit system it's around 2 billion or 500 million characters. If you don't know whether you're using a 64 bit or a 32 bit system or what that means, you're probably using a 64 bit system.
Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t as the data type for storing container length. Py_ssize_t is defined as the same size as the compiler's size_t but signed. On a 64 bit system, size_t is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ - 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 19 billion dollars if we multiply today's (November 2022) price of around $2/GB by 9 billion. On 32-bit systems (which are rare these days), it's 2³¹ - 1 bytes or 2GiB.
CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa', the a's each take 1 byte to store, but if you have a string like 'aaaaaaaaa😀' then all the a's will now take 4 bytes each. 1-byte-per-character strings will also use either 48 or 72 bytes of metadata and 2 or 4 bytes-per-character strings will take 72 bytes for metadata. Each string also has an extra character at the end for a terminating null, so the empty string is actually 49 bytes.
When you allocate a string with PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) (see docs) in CPython, it performs this check:
/* Ensure we won't overflow the size. */
// [...]
if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
return PyErr_NoMemory();
Where PY_SSIZE_T_MAX is
/* Largest positive value of type Py_ssize_t. */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))
which is casting -1 into a size_t (a type defined by the C compiler, a 64 bit unsigned integer on a 64 bit system) which causes it to wrap around to its largest possible value, 2⁶⁴-1 and then right shifts it by 1 (so that the sign bit is 0) which causes it to become 2⁶³-1 and casts that into a Py_ssize_t type.
struct_size is just a bit of overhead for the str object's metadata, either 48 or 72, it's set earlier in the function
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
// [...]
struct_size = sizeof(PyASCIIObject);
}
and char_size is either 1, 2 or 4 and so we have
>>> ((2**63 - 1) - 72) // 4 - 1
2305843009213693932
There's of course the possibility that Python strings are practically limited by some other part of Python that I don't know about, but you should be able to at least allocate a new string of that size, assuming you can get get your hands on 9 exabytes of RAM.