How does Python manage size of integers and floats? [duplicate]

How does Python manage size of integers and floats? [duplicate] - python

This question already has an answer here:
"sys.getsizeof(int)" returns an unreasonably large value?
(1 answer)
Closed 2 years ago.
I am running a 64-bit machine.
If i type getsizeof(int()), i get 24. What are the elements or objects that use these 24 bytes?
Here are some more confusing results:
getsizeof(0) returns 24.
getsizeof(1) returns 28. Why does 1 take 4 more bytes than 0?
And getsizeof(1.5) returns 24. Why does 1.5 which is a float takes smaller size than an integer 1.

I'll talk just about ints for now, and look at floats at the end.
In Python, unlike C and many other languages, int is not just a datatype storing a number. It is a full object, with extra information about the object (more detail here).
This extra information takes up lots of space, which is why the objects can seem unusually large. Specifically, it takes up 3 lots of 8 bytes (3*8=24 bytes). These are a reference count, a type and a size.
Python stores integers using a specific number of bytes depending on how large they are. Specifically:
0 <= n < 2**0:
requires 24 bytes
2**0 <= n < 2**30:
requires 28 bytes
2**30 <= n < 2**60:
requires 32 bytes
In general, for every increase of 30 powers of 2 (for want of better terminology!), 4 extra bytes are required.
This pattern also follows for the negative numbers, just going the opposite direction.
These specific values are the values on my computer, but will vary depending on the environment you're running Python on. However, the general patterns are likely to be the same.
I believe (as explained a little here) that the reason zero alone uses less space is that the only int which can be represented using just 24 bytes is zero, so there is no need to additionally store the actual value, reducing the size. It's possible I've misunderstood this specific case so please correct me below if so!
However, floats are stored differently. Their values are simply stored using 64 bits (i.e. a double), which means there are 8 bytes representing the value. Since this never varies, there is no need to store the size as we do with integers, meaning there are only two 8 byte values to store alongside the specific float value. This means the total size is two lots of 8 bytes for the object data and one lot of 8 bytes for the actual value, or 24 bytes in total.
It is this property of not needing to store the value's size that frees up 8 bytes, which means 1.5 requires less space than 1.

Related

Why is 0.5 taking less memory than 1 in python?

I think I understand the concept of how python is storing variables and why certain vars are larger than others. I also googled about float point but that couldn't answer my question:
Why is a float e.g. 0.5 is only taking 24 bytes of memory but a integer like 1 is taking 28? What even confuses me more is that a 0 takes 24 bytes too (That I understand. It stores just the object with "no" integer (I think...)). But how does it work that, when python adds 4 bytes if the number can't be saved with less, python can store a larger binary number like 0.5 in the same space like 0.
I used sys.getsizeof() to get the size of the objects in Python 3.9.1 64-bit

What takes more memory, an 8 char string or 8 digit int?

I'm developing a program that will deal with approx. 90 billion records, so I need to manage memory carefully. Which is larger in memory: 8 char string or 8 digit int?
Details:
-Python 3.7.4
-64 bits
Edit1:
following the advice of user8080blablabla I got:
sys.getsizeof(99999999)
28
sys.getsizeof("99999999")
57
seriously? a 8 char string is 57 bytes long?!?

An int will generally take less memory than its representation as a string, because it is more compact. However, because Python int values are objects, they still take quite a lot of space each compared to primitive values in other languages: the integer object 1 takes up 28 bytes of memory on my machine.
>>> import sys
>>> sys.getsizeof(1)
28
If minimising memory use is your priority, and there is a maximum range the integers can be in, consider using the array module. It can store numeric data (or Unicode characters) in an array, in a primitive data type of your choice, so that each value isn't an object taking up 28+ bytes.
>>> from array import array
>>> arr = array('I') # unsigned int in C
>>> arr.extend(range(10000))
>>> arr.itemsize
4
>>> sys.getsizeof(arr)
40404
The actual number of bytes used per item is dependent on the machine architecture. On my machine, each number takes 4 bytes; there are 404 bytes of overhead for an array of length 10,000. Check arr.itemsize on your machine to see if you need a different primitive type; fewer than 4 bytes is not enough for an 8-digit number.
That said, you should not be trying to fit 90 billion numbers in memory, at 4 bytes each; this would take 360GB of memory. Look for a solution which doesn't require holding every record in memory at once.

You ought to remember that strings are represented as Unicodes in Python, therefore storing a digit in a string can take an upwards of 4-bytes per character to store, which is why you see such a large discrepancy between int and str (interesting read on the topic).
If you are worried about memory allocation I would instead recommend using pandas to manage the backend for you when it comes to manipulating large datasets.

A RAM error of big array

I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.
I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array.
This is my code:
_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
print "i="+str(i)
_strlf = _L1file.readline()
_strlf = _strlf.split('\t')
_strlf = _strlf[:-1]
_L1[i] = _strlf
i += 1
_L1file.close()
And this is my error message:
MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
_strlf = _strlf.split('\t')

you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.
for line in open("400MB_file"):
# do something with line.
Or
f=open("file")
for linenum,line in enumerate(f):
if linenum+1 in [2,3,10]:
print "there are ", len(line.split())," columns" #assuming you want to split on spaces
print "100th column value is: ", line.split()[99]
if linenum+1>10:
break # break if you want to stop after the 10th line
f.close()

This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.
Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.
You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?
If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.

Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe.
On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.
Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.
Possible solutions:
(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.
(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.
Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.
>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10)) # 72 + 8 bytes for each pointer
152

MemoryError exception:
Raised when an operation runs out of
memory but the situation may still be
rescued (by deleting some objects).
The associated value is a string
indicating what kind of (internal)
operation ran out of memory. Note that
because of the underlying memory
management architecture (C’s malloc()
function), the interpreter may not
always be able to completely recover
from this situation; it nevertheless
raises an exception so that a stack
traceback can be printed, in case a
run-away program was the cause.
It seems that, at least in your case, reading the entire file into memory is not a doable option.

Replace this:
_strlf = _strlf[:-1]
with this:
_strlf = [float(val) for val in _strlf[:-1]]
You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.
You might want to include some handling for null values.
You can also go to numpy's array type, but your problem may be too small to bother.

What is the max length of a Python string?

If it is environment-independent, what is the theoretical maximum number of characters in a Python string?

With a 64-bit Python installation, and (say) 64 GB of memory, a Python string of around 63 GB should be quite feasible, if not maximally fast. If you can upgrade your memory beyond 64 GB, your maximum feasible strings should get proportionally longer. (I don't recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).
With a typical 32-bit Python installation, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with high amounts of RAM.

I ran this code on an x2iedn.16xlarge EC2 instance, which has 2048 GiB (2.2 TB) of RAM
>>> one_gigabyte = 1_000_000_000
>>> my_str = 'A' * (2000 * one_gigabyte)
It took a couple minutes but I was able to allocate a 2TB string on Python 3.10 running on Ubuntu 22.04.
>>> import sys
>>> sys.getsizeof(my_str)
2000000000049
>>> my_str
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
The last line actually hangs, but it would print 2 trillion As.

9 quintillion characters on a 64 bit system on CPython 3.10.
That's only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:
9,223,372,036,854,775,758 characters if your string only has ASCII characters (U+00 to U+7F) or
9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (U+80 to U+FF) or
4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e. U+0100 to U+FFFF) or
2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e. U+10000 and above)
On a 32 bit system it's around 2 billion or 500 million characters. If you don't know whether you're using a 64 bit or a 32 bit system or what that means, you're probably using a 64 bit system.
Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t as the data type for storing container length. Py_ssize_t is defined as the same size as the compiler's size_t but signed. On a 64 bit system, size_t is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ - 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 19 billion dollars if we multiply today's (November 2022) price of around $2/GB by 9 billion. On 32-bit systems (which are rare these days), it's 2³¹ - 1 bytes or 2GiB.
CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa', the a's each take 1 byte to store, but if you have a string like 'aaaaaaaaa😀' then all the a's will now take 4 bytes each. 1-byte-per-character strings will also use either 48 or 72 bytes of metadata and 2 or 4 bytes-per-character strings will take 72 bytes for metadata. Each string also has an extra character at the end for a terminating null, so the empty string is actually 49 bytes.
When you allocate a string with PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) (see docs) in CPython, it performs this check:
/* Ensure we won't overflow the size. */
// [...]
if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
return PyErr_NoMemory();
Where PY_SSIZE_T_MAX is
/* Largest positive value of type Py_ssize_t. */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))
which is casting -1 into a size_t (a type defined by the C compiler, a 64 bit unsigned integer on a 64 bit system) which causes it to wrap around to its largest possible value, 2⁶⁴-1 and then right shifts it by 1 (so that the sign bit is 0) which causes it to become 2⁶³-1 and casts that into a Py_ssize_t type.
struct_size is just a bit of overhead for the str object's metadata, either 48 or 72, it's set earlier in the function
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
// [...]
struct_size = sizeof(PyASCIIObject);
}
and char_size is either 1, 2 or 4 and so we have
>>> ((2**63 - 1) - 72) // 4 - 1
2305843009213693932
There's of course the possibility that Python strings are practically limited by some other part of Python that I don't know about, but you should be able to at least allocate a new string of that size, assuming you can get get your hands on 9 exabytes of RAM.

Short Integers in Python

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?

Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module

Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8

You can use NumyPy's int as np.int8 or np.int16.

Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).

You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521

Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.