What is the max length of a Python string? - python

If it is environment-independent, what is the theoretical maximum number of characters in a Python string?

With a 64-bit Python installation, and (say) 64 GB of memory, a Python string of around 63 GB should be quite feasible, if not maximally fast. If you can upgrade your memory beyond 64 GB, your maximum feasible strings should get proportionally longer. (I don't recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).
With a typical 32-bit Python installation, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with high amounts of RAM.

I ran this code on an x2iedn.16xlarge EC2 instance, which has 2048 GiB (2.2 TB) of RAM
>>> one_gigabyte = 1_000_000_000
>>> my_str = 'A' * (2000 * one_gigabyte)
It took a couple minutes but I was able to allocate a 2TB string on Python 3.10 running on Ubuntu 22.04.
>>> import sys
>>> sys.getsizeof(my_str)
2000000000049
>>> my_str
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
The last line actually hangs, but it would print 2 trillion As.

9 quintillion characters on a 64 bit system on CPython 3.10.
That's only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:
9,223,372,036,854,775,758 characters if your string only has ASCII characters (U+00 to U+7F) or
9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (U+80 to U+FF) or
4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e. U+0100 to U+FFFF) or
2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e. U+10000 and above)
On a 32 bit system it's around 2 billion or 500 million characters. If you don't know whether you're using a 64 bit or a 32 bit system or what that means, you're probably using a 64 bit system.
Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t as the data type for storing container length. Py_ssize_t is defined as the same size as the compiler's size_t but signed. On a 64 bit system, size_t is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ - 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 19 billion dollars if we multiply today's (November 2022) price of around $2/GB by 9 billion. On 32-bit systems (which are rare these days), it's 2³¹ - 1 bytes or 2GiB.
CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa', the a's each take 1 byte to store, but if you have a string like 'aaaaaaaaa😀' then all the a's will now take 4 bytes each. 1-byte-per-character strings will also use either 48 or 72 bytes of metadata and 2 or 4 bytes-per-character strings will take 72 bytes for metadata. Each string also has an extra character at the end for a terminating null, so the empty string is actually 49 bytes.
When you allocate a string with PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) (see docs) in CPython, it performs this check:
/* Ensure we won't overflow the size. */
// [...]
if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
return PyErr_NoMemory();
Where PY_SSIZE_T_MAX is
/* Largest positive value of type Py_ssize_t. */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))
which is casting -1 into a size_t (a type defined by the C compiler, a 64 bit unsigned integer on a 64 bit system) which causes it to wrap around to its largest possible value, 2⁶⁴-1 and then right shifts it by 1 (so that the sign bit is 0) which causes it to become 2⁶³-1 and casts that into a Py_ssize_t type.
struct_size is just a bit of overhead for the str object's metadata, either 48 or 72, it's set earlier in the function
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
// [...]
struct_size = sizeof(PyASCIIObject);
}
and char_size is either 1, 2 or 4 and so we have
>>> ((2**63 - 1) - 72) // 4 - 1
2305843009213693932
There's of course the possibility that Python strings are practically limited by some other part of Python that I don't know about, but you should be able to at least allocate a new string of that size, assuming you can get get your hands on 9 exabytes of RAM.

Related

How does Python manage size of integers and floats? [duplicate]

This question already has an answer here:
"sys.getsizeof(int)" returns an unreasonably large value?
(1 answer)
Closed 2 years ago.
I am running a 64-bit machine.
If i type getsizeof(int()), i get 24. What are the elements or objects that use these 24 bytes?
Here are some more confusing results:
getsizeof(0) returns 24.
getsizeof(1) returns 28. Why does 1 take 4 more bytes than 0?
And getsizeof(1.5) returns 24. Why does 1.5 which is a float takes smaller size than an integer 1.
I'll talk just about ints for now, and look at floats at the end.
In Python, unlike C and many other languages, int is not just a datatype storing a number. It is a full object, with extra information about the object (more detail here).
This extra information takes up lots of space, which is why the objects can seem unusually large. Specifically, it takes up 3 lots of 8 bytes (3*8=24 bytes). These are a reference count, a type and a size.
Python stores integers using a specific number of bytes depending on how large they are. Specifically:
0 <= n < 2**0:
requires 24 bytes
2**0 <= n < 2**30:
requires 28 bytes
2**30 <= n < 2**60:
requires 32 bytes
In general, for every increase of 30 powers of 2 (for want of better terminology!), 4 extra bytes are required.
This pattern also follows for the negative numbers, just going the opposite direction.
These specific values are the values on my computer, but will vary depending on the environment you're running Python on. However, the general patterns are likely to be the same.
I believe (as explained a little here) that the reason zero alone uses less space is that the only int which can be represented using just 24 bytes is zero, so there is no need to additionally store the actual value, reducing the size. It's possible I've misunderstood this specific case so please correct me below if so!
However, floats are stored differently. Their values are simply stored using 64 bits (i.e. a double), which means there are 8 bytes representing the value. Since this never varies, there is no need to store the size as we do with integers, meaning there are only two 8 byte values to store alongside the specific float value. This means the total size is two lots of 8 bytes for the object data and one lot of 8 bytes for the actual value, or 24 bytes in total.
It is this property of not needing to store the value's size that frees up 8 bytes, which means 1.5 requires less space than 1.

What takes more memory, an 8 char string or 8 digit int?

I'm developing a program that will deal with approx. 90 billion records, so I need to manage memory carefully. Which is larger in memory: 8 char string or 8 digit int?
Details:
-Python 3.7.4
-64 bits
Edit1:
following the advice of user8080blablabla I got:
sys.getsizeof(99999999)
28
sys.getsizeof("99999999")
57
seriously? a 8 char string is 57 bytes long?!?
An int will generally take less memory than its representation as a string, because it is more compact. However, because Python int values are objects, they still take quite a lot of space each compared to primitive values in other languages: the integer object 1 takes up 28 bytes of memory on my machine.
>>> import sys
>>> sys.getsizeof(1)
28
If minimising memory use is your priority, and there is a maximum range the integers can be in, consider using the array module. It can store numeric data (or Unicode characters) in an array, in a primitive data type of your choice, so that each value isn't an object taking up 28+ bytes.
>>> from array import array
>>> arr = array('I') # unsigned int in C
>>> arr.extend(range(10000))
>>> arr.itemsize
4
>>> sys.getsizeof(arr)
40404
The actual number of bytes used per item is dependent on the machine architecture. On my machine, each number takes 4 bytes; there are 404 bytes of overhead for an array of length 10,000. Check arr.itemsize on your machine to see if you need a different primitive type; fewer than 4 bytes is not enough for an 8-digit number.
That said, you should not be trying to fit 90 billion numbers in memory, at 4 bytes each; this would take 360GB of memory. Look for a solution which doesn't require holding every record in memory at once.
You ought to remember that strings are represented as Unicodes in Python, therefore storing a digit in a string can take an upwards of 4-bytes per character to store, which is why you see such a large discrepancy between int and str (interesting read on the topic).
If you are worried about memory allocation I would instead recommend using pandas to manage the backend for you when it comes to manipulating large datasets.

Why are Unicode strings having a different memory footprint in Python 2 and 3? [duplicate]

This question already has an answer here:
How is unicode represented internally in Python?
(1 answer)
Closed 4 years ago.
In Python 2, an empty string occupy exactly 37 bytes,
>>>> print sys.getsizeof('')
37
In Python 3.6, the same call would output 49 bytes,
>>>> print(sys.getsizeof(''))
49
Now I thought that this was due to the fact that in Python 3, all strings are now Unicode. But, to my surprise here are some confusing outputs,
Python 2.7
>>>> print sys.getsizeof(u'')
52
>>>> print sys.getsizeof(u'1')
56
Python 3.6
>>>>print(sys.getsizeof(''))
49
>>>>print(sys.getsizeof('1'))
50
An empty string is not the same size.
4 additional bytes are needed when adding a character in Python 2 and only one for Python 3
Why is the memory footprint different between the two versions ?
EDIT
I specified the exact version of my python environment, because between different Python 3 builds there are differences.
There are reasons of course, but really, it should not matter for any practical purposes. If you have a Python system in which you have to keep so many strings in memory as to get close to the system memory, you should optimize it by (1) trying to lazily load/create strings in memory or (2) Use a byte-oriented efficient binary structure to deal with your data, such as those provided by Numpy, or Python's own bytearray.
The change for the empty string literal (unicode literal fro Py2) could bedue to any implementation details between the versions you are looking at, which should not matter even if were writting C code to interact directly with Python strings: even those should only touch the strings via the API.
Now, the specific reason for why the string in Python 3 just increases its size by "1" byte, while in Python 2 it increases the size by 4 bytes is PEP 393.
Before Python 3.3, any (unicode) string in Python would use either fixed 2 bytes or fixed 4 bytes of memory for each character - and the Python interpreter and Python modules using native code would have to be compiled to use just one of these kinds. I.E. you efectively could have incompatible Python binaries, even if the versions matched, due to the string-width optoin picked up at build time - the builds were known as "narrow build" and "wide build". With the above mentioned PEP 391, Python strings have their character sized determined when they are instantiate, depending on the size of the widest unicode codepoint it contains. Strings that contain points that are contained in the first 256 codepoints (equivalent to the Latin-1 character set) use only 1 byte per character.
Internally, Python 3 now stores strings in four different encodings, and chooses a different encoding for each string. These encodings are ASCII, LATIN-1, UCS-2, and UTF-32. Each one is capable of representing a different subset of Unicode characters and has the useful property that the element at index i is also the Unicode code point at index i.
In [1]: import sys
In [2]: sys.getsizeof('\xFF')
Out[2]: 74
In [3]: sys.getsizeof('X\xFF')
Out[3]: 75
In [4]: sys.getsizeof('\u0100')
Out[4]: 76
In [5]: sys.getsizeof('X\u0100')
Out[5]: 78
In [6]: sys.getsizeof('\U00010000')
Out[6]: 80
In [7]: sys.getsizeof('X\U00010000')
Out[7]: 84
You can see that adding an additional character, in this case 'X', to a string causes that string to take up additional space depending on the values contained in the rest of the string.
This system was proposed in PEP-0393 and implemented in Python 3.3. Earlier versions of Python use the older unicode representation, which always used 2 or 4 bytes per... element (I hesitate to say "character"), depending on compile-time options and they could not be mixed.

Understanding memory allocation for large integers in Python

How does Python allocate memory for large integers?
An int type has a size of 28 bytes and as I keep increasing the value of the int, the size increases in increments of 4 bytes.
Why 28 bytes initially for any value as low as 1?
Why increments of 4 bytes?
PS: I am running Python 3.5.2 on a x86_64 (64 bit machine). Any pointers/resources/PEPs on how the (3.0+) interpreters work on such huge numbers is what I am looking for.
Code illustrating the sizes:
>>> a=1
>>> print(a.__sizeof__())
28
>>> a=1024
>>> print(a.__sizeof__())
28
>>> a=1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024*1024*1024
>>> a
1152921504606846976
>>> print(a.__sizeof__())
36
Why 28 bytes initially for any value as low as 1?
I believe #bgusach answered that completely; Python uses C structs to represent objects in the Python world, any objects including ints:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
PyObject_VAR_HEAD is a macro that when expanded adds another field in the struct (field PyVarObject which is specifically used for objects that have some notion of length) and, ob_digits is an array holding the value for the number. Boiler-plate in size comes from that struct, for small and large Python numbers.
Why increments of 4 bytes?
Because, when a larger number is created, the size (in bytes) is a multiple of the sizeof(digit); you can see that in _PyLong_New where the allocation of memory for a new longobject is performed with PyObject_MALLOC:
/* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
sizeof(digit)*size. Previous incarnations of this code used
sizeof(PyVarObject) instead of the offsetof, but this risks being
incorrect in the presence of padding between the PyVarObject header
and the digits. */
if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
size*sizeof(digit));
offsetof(PyLongObject, ob_digit) is the 'boiler-plate' (in bytes) for the long object that isn't related with holding its value.
digit is defined in the header file holding the struct _longobject as a typedef for uint32:
typedef uint32_t digit;
and sizeof(uint32_t) is 4 bytes. That's the amount by which you'll see the size in bytes increase when the size argument to _PyLong_New increases.
Of course, this is just how CPython has chosen to implement it. It is an implementation detail and as such you wont find much information in PEPs. The python-dev mailing list would hold implementation discussions if you can find the corresponding thread :-).
Either way, you might find differing behavior in other popular implementations, so don't take this one for granted.
It's actually easy. Python's int is not the kind of primitive you may be used to from other languages, but a full fledged object, with its methods and all the stuff. That is where the overhead comes from.
Then, you have the payload itself, the integer that is being represented. And there is no limit for that, except your memory.
The size of a Python's int is what it needs to represent the number plus a little overhead.
If you want to read further, take a look at the relevant part of the documentation:
Integers have unlimited precision

OverflowError occurs when using cython with a large int

python 3.4, windows 10, cython 0.21.1
I'm compiling this function to c with cython
def weakchecksum(data):
"""
Generates a weak checksum from an iterable set of bytes.
"""
cdef long a, b, l
a = b = 0
l = len(data)
for i in range(l):
a += data[i]
b += (l - i)*data[i]
return (b << 16) | a, a, b
which produces this error:
"OverflowError: Python int too large to convert to C long"
I've also tried declaring them as unsigned longs. What type do I use to work with really large numbers? If it's too large for a c long are there any workarounds?
cython compiles pyx files to C, thus it depends on underlying C compiler.
Size of integer types in C varies on different platforms and operations systems, and C standard don't dictate exact implementation.
However there is de facto implementation conventions.
Windows for both 32 and 64 bit uses 4 bytes (32 bits) for int and long, 8 bytes (64 bits) for long long. The difference between Win32 and Win64 is size of pointer (32 bits for Win32 and 64 bits for Win64). (See Data Type Ranges] from MSDN).
Linux uses another model: int is 32 bits for both linux-32 and linux-64, long long is always 64-bit. long and pointers are vary: 32 bits on linux-32 and 64 bits on linux-64.
Long story short: if you need maximum capacity for integer type which doesn't changed on different platforms use long long (or unsigned long long).
The data range for long long is [–9223372036854775808, 9223372036854775807].
If you need numbers with arbitrary precision there is GMP library -- de facto standard for high-precision arithmetic. Python has wrapper for it called gmpy2.
If you make sure that your calculations are in c (for instance, declare i to be long, and put the data element into a cdefed variable or cast it before calculation), you won't get this error. Your actual results, though, could vary depending on platform, depending (potentially) on the exact assembly code generated and the resulting treatment of overflows. There are better algorithms for this, as #cod3monk3y has noted (look at the "simple checksums" link).

Categories