Possible to store Python ints in less than 12 bytes? - python

Is it possible to make Python use less than 12 bytes for an int?
>>> x=int()
>>> x
0
>>> sys.getsizeof(x)
12
I am not a computer specialist but isn't 12 bytes excessive?
The smallest int I want to store is 0, the largest int 147097614, so I shouldn't really need more than 4 bytes.
(There is probably something I misunderstand here as I couldn't find an answer anywhere on the net. Keep that in mind.)

In python, ints are objects just like everything else. Because of that, there is a little extra overhead just associated with the fact that you're using an object which has some associated meta-data.
If you're going to use lots of ints, and it makes sense to lay them out in an array-like structure, you should look into numpy. Numpy ndarray objects will have a little overhead associated with them for the various pieces of meta-data that the array objects keep track of, but the actual data is stored as the datatype you specify (e.g. numpy.int32 for a 4-byte integer.)
Thus, if you have:
import numpy as np
a = np.zeros(5000,dtype=np.int32)
The array will take only slightly more than 4*5000 = 20000 bytes of your memory

Size of an integer object includes the overhead of maintaining other object information along with its value. The additional information can include object type, reference count and other implementation-specific details.
If you store many integers and want to optimize the space spent, use the array module, specifically arrays constructed with array.array('i').

Integers in python are objects, and are therefore stored with extra overhead.
You can read more information about it here
The integer type in cpython is stored in a structure like so:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro that expands out into a reference count and a pointer to the type object.
So you can see that:
long ob_ival - 4 bytes for a long.
Py_ssize_t ob_refcnt - I would assume to size_t here is 4 bytes.
PyTypeObject *ob_type - Is a pointer, so another 4 bytes.
12 bytes in total!

Related

Understanding memory allocation for large integers in Python

How does Python allocate memory for large integers?
An int type has a size of 28 bytes and as I keep increasing the value of the int, the size increases in increments of 4 bytes.
Why 28 bytes initially for any value as low as 1?
Why increments of 4 bytes?
PS: I am running Python 3.5.2 on a x86_64 (64 bit machine). Any pointers/resources/PEPs on how the (3.0+) interpreters work on such huge numbers is what I am looking for.
Code illustrating the sizes:
>>> a=1
>>> print(a.__sizeof__())
28
>>> a=1024
>>> print(a.__sizeof__())
28
>>> a=1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024*1024*1024
>>> a
1152921504606846976
>>> print(a.__sizeof__())
36
Why 28 bytes initially for any value as low as 1?
I believe #bgusach answered that completely; Python uses C structs to represent objects in the Python world, any objects including ints:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
PyObject_VAR_HEAD is a macro that when expanded adds another field in the struct (field PyVarObject which is specifically used for objects that have some notion of length) and, ob_digits is an array holding the value for the number. Boiler-plate in size comes from that struct, for small and large Python numbers.
Why increments of 4 bytes?
Because, when a larger number is created, the size (in bytes) is a multiple of the sizeof(digit); you can see that in _PyLong_New where the allocation of memory for a new longobject is performed with PyObject_MALLOC:
/* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
sizeof(digit)*size. Previous incarnations of this code used
sizeof(PyVarObject) instead of the offsetof, but this risks being
incorrect in the presence of padding between the PyVarObject header
and the digits. */
if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
size*sizeof(digit));
offsetof(PyLongObject, ob_digit) is the 'boiler-plate' (in bytes) for the long object that isn't related with holding its value.
digit is defined in the header file holding the struct _longobject as a typedef for uint32:
typedef uint32_t digit;
and sizeof(uint32_t) is 4 bytes. That's the amount by which you'll see the size in bytes increase when the size argument to _PyLong_New increases.
Of course, this is just how CPython has chosen to implement it. It is an implementation detail and as such you wont find much information in PEPs. The python-dev mailing list would hold implementation discussions if you can find the corresponding thread :-).
Either way, you might find differing behavior in other popular implementations, so don't take this one for granted.
It's actually easy. Python's int is not the kind of primitive you may be used to from other languages, but a full fledged object, with its methods and all the stuff. That is where the overhead comes from.
Then, you have the payload itself, the integer that is being represented. And there is no limit for that, except your memory.
The size of a Python's int is what it needs to represent the number plus a little overhead.
If you want to read further, take a look at the relevant part of the documentation:
Integers have unlimited precision

What does python sys getsizeof for string return?

What does sys.getsizeof return for a standard string? I am noticing that this value is much higher than what len returns.
I will attempt to answer your question from a broader point of view. You're referring to two functions and comparing their outputs. Let's take a look at their documentation first:
len():
Return the length (the number of items) of an object. The argument may
be a sequence (such as a string, bytes, tuple, list, or range) or a
collection (such as a dictionary, set, or frozen set).
So in case of string, you can expect len() to return the number of characters.
sys.getsizeof():
Return the size of an object in bytes. The object can be any type of
object. All built-in objects will return correct results, but this
does not have to hold true for third-party extensions as it is
implementation specific.
So in case of string (as with many other objects) you can expect sys.getsizeof() the size of the object in bytes. There is no reason to think that it should be the same as the number of characters.
Let's have a look at some examples:
>>> first = "First"
>>> len(first)
5
>>> sys.getsizeof(first)
42
This example confirms that the size is not the same as the number of characters.
>>> second = "Second"
>>> len(second)
6
>>> sys.getsizeof(second)
43
We can notice that if we look at a string one character longer, its size is one byte bigger as well. We don't know if it's a coincidence or not though.
>>> together = first + second
>>> print(together)
FirstSecond
>>> len(together)
11
If we concatenate the two strings, their combined length is equal to the sum of their lengths, which makes sense.
>>> sys.getsizeof(together)
48
Contrary to what someone might expect though, the size of the combined string is not equal to the sum of their individual sizes. But it still seems to be the length plus something. In particular, something worth 37 bytes. Now you need to realize that it's 37 bytes in this particular case, using this particular Python implementation etc. You should not rely on that at all. Still, we can take a look why it's 37 bytes what they are (approximately) used for.
String objects are in CPython (probably the most widely used implementation of Python) implemented as PyStringObject. This is the C source code (I use the 2.7.9 version):
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
You can see that there is something called PyObject_VAR_HEAD, one int, one long and a char array. The char array will always contain one more character to store the '\0' at the end of the string. This, along with the int, long and PyObject_VAR_HEAD take the additional 37 bytes. PyObject_VAR_HEAD is defined in another C source file and it refers to other implementation-specific stuff, you need to explore if you want to find out where exactly are the 37 bytes. Plus, the documentation mentions that sys.getsizeof()
adds an additional garbage collector overhead if the object is managed
by the garbage collector.
Overall, you don't need to know what exactly takes the something (the 37 bytes here) but this answer should give you a certain idea why the numbers differ and where to find more information should you really need it.
To quote the documentation:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Built in strings are not simple character sequences - they are full fledged objects, with garbage collection overhead, which probably explains the size discrepancy you're noticing.

Python's variables size [duplicate]

When I ran the below code I got 3 and 36 as the answers respectively.
x ="abd"
print len(x)
print sys.getsizeof(x)
Can someone explain to me what's the difference between them ?
They are not the same thing at all.
len() queries for the number of items contained in a container. For a string that's the number of characters:
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
sys.getsizeof() on the other hand returns the memory size of the object:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Python string objects are not simple sequences of characters, 1 byte per character.
Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):
Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);
where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:
>>> sys.getsizeof('')
37
For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.
For containers such as dictionaries or lists that reference other objects, the memory size given covers only the memory used by the container and the pointer values used to reference those other objects. There is no straightforward method of including the memory size of the ‘contained’ objects because those same objects could have many more references elsewhere and are not necessarily owned by a single container.
The documentation states it like this:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
If you need to calculate the memory footprint of a container and anything referenced by that container you’ll have to use some method of traversing to those contained objects and get their size; the documentation points to a recursive recipe.
key difference is that len() will give actual length of elements in container , Whereas sys.sizeof() will give it's memory size which it occupy
for more information read docs of python which is available at
https://docs.python.org/3/library/sys.html#module-sys

What is the difference between len() and sys.getsizeof() methods in python?

When I ran the below code I got 3 and 36 as the answers respectively.
x ="abd"
print len(x)
print sys.getsizeof(x)
Can someone explain to me what's the difference between them ?
They are not the same thing at all.
len() queries for the number of items contained in a container. For a string that's the number of characters:
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
sys.getsizeof() on the other hand returns the memory size of the object:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Python string objects are not simple sequences of characters, 1 byte per character.
Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):
Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);
where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:
>>> sys.getsizeof('')
37
For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.
For containers such as dictionaries or lists that reference other objects, the memory size given covers only the memory used by the container and the pointer values used to reference those other objects. There is no straightforward method of including the memory size of the ‘contained’ objects because those same objects could have many more references elsewhere and are not necessarily owned by a single container.
The documentation states it like this:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
If you need to calculate the memory footprint of a container and anything referenced by that container you’ll have to use some method of traversing to those contained objects and get their size; the documentation points to a recursive recipe.
key difference is that len() will give actual length of elements in container , Whereas sys.sizeof() will give it's memory size which it occupy
for more information read docs of python which is available at
https://docs.python.org/3/library/sys.html#module-sys

Short Integers in Python

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?
Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module
Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
You can use NumyPy's int as np.int8 or np.int16.
Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).
You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228

Categories