Why do ints require three times as much memory in Python? - python

On a 64-bit system an integer in Python takes 24 bytes. This is 3 times the memory that would be needed in e.g. C for a 64-bit integer. Now, I know this is because Python integers are objects. But what is the extra memory used for? I have my guesses, but it would be nice to know for sure.

Remember that the Python int type does not have a limited range like C int has; the only limit is the available memory.
Memory goes to storing the value, the current size of the integer storage (the storage size is variable to support arbitrary sizes), and the standard Python object bookkeeping (a reference to the relevant object and a reference count).
You can look up the longintrepr.h source (the Python 3 int type was traditionally known as the long type in Python 2); it makes effective use of the PyVarObject C type to track integer size:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The ob_digit array stores 'digits' of either 15 or 30 bits wide (depending on your platform); so on my 64-bit OS X system, an integer up to (2 ^ 30) - 1 uses 1 'digit':
>>> sys.getsizeof((1 << 30) - 1)
28
but if you use 2 30-bit digits in the number an additional 4 bytes are needed, etc:
>>> sys.getsizeof(1 << 30)
32
>>> sys.getsizeof(1 << 60)
36
>>> sys.getsizeof(1 << 90)
40
The base 24 bytes then are the PyObject_VAR_HEAD structure, holding the object size, the reference count and the type pointer (each 8 bytes / 64 bits on my 64-bit OS X platform).
On Python 2, integers <= sys.maxint but >= -sys.maxint - 1 are stored using a simpler structure storing just the single value:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
because this uses PyObject instead of PyVarObject there is no ob_size field in the struct and the memory size is limited to just 24 bytes; 8 for the long value, 8 for the reference count and 8 for the type object pointer.

From longintrepr.h, we see that a Python 'int' object is defined with this C structure:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
Digit is a 32-bit unsigned value. The bulk of the space is taken by the variable size object header. From object.h, we can find its definition:
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
We can see that we are using a Py_ssize_t, 64-bits assuming 64-bit system, to store the count of "digits" in the value. This is possibly wasteful. We can also see that the general object header has a 64-bit reference count, and a pointer to the object type, which will also be a 64-bits of storage. The reference count is necessary for Python to know when to deallocate the object, and the pointer to the object type is necessary to know that we have an int and not, say, a string, as C structures have no way to test the type of an object from an arbitrary pointer.
_PyObject_HEAD_EXTRA is defined to nothing on most builds of python, but can be used to store a linked list of all Python objects on the heap if the build enables that option, using another two pointers of 64-bits each.

Related

Python 2 vs Python 3 - Size of an integer [duplicate]

I want to check the size of int data type in python:
import sys
sys.getsizeof(int)
It comes out to be "436", which doesn't make sense to me.
Anyway, I want to know how many bytes (2,4,..?) int will take on my machine.
The short answer
You're getting the size of the class, not of an instance of the class. Call int to get the size of an instance:
>>> sys.getsizeof(int())
24
If that size still seems a little bit large, remember that a Python int is very different from an int in (for example) c. In Python, an int is a fully-fledged object. This means there's extra overhead.
Every Python object contains at least a refcount and a reference to the object's type in addition to other storage; on a 64-bit machine, that takes up 16 bytes! The int internals (as determined by the standard CPython implementation) have also changed over time, so that the amount of additional storage taken depends on your version.
Some details about int objects in Python 2 and 3
Here's the situation in Python 2. (Some of this is adapted from a blog post by Laurent Luce). Integer objects are represented as blocks of memory with the following structure:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro defining the storage for the refcount and the object type. It's described in some detail by the documentation, and the code can be seen in this answer.
The memory is allocated in large blocks so that there's not an allocation bottleneck for every new integer. The structure for the block looks like this:
struct _intblock {
struct _intblock *next;
PyIntObject objects[N_INTOBJECTS];
};
typedef struct _intblock PyIntBlock;
These are all empty at first. Then, each time a new integer is created, Python uses the memory pointed at by next and increments next to point to the next free integer object in the block.
I'm not entirely sure how this changes once you exceed the storage capacity of an ordinary integer, but once you do so, the size of an int gets larger. On my machine, in Python 2:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
24
>>> sys.getsizeof(2 ** 62)
24
>>> sys.getsizeof(2 ** 63)
36
In Python 3, I think the general picture is the same, but the size of integers increases in a more piecemeal way:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(2 ** 30 - 1)
28
>>> sys.getsizeof(2 ** 30)
32
>>> sys.getsizeof(2 ** 60 - 1)
32
>>> sys.getsizeof(2 ** 60)
36
These results are, of course, all hardware-dependent! YMMV.
The variability in integer size in Python 3 is a hint that they may behave more like variable-length types (like lists). And indeed, this turns out to be true. Here's the definition of the C struct for int objects in Python 3:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The comments that accompany this definition summarize Python 3's representation of integers. Zero is represented not by a stored value, but by an object with size zero (which is why sys.getsizeof(0) is 24 bytes while sys.getsizeof(1) is 28). Negative numbers are represented by objects with a negative size attribute! So weird.

Python struct.unpack equivalent in C

What I have in python is a
string lenght = 4
, that I can unpack to get some values.
Is there an equivalent of this function on python, in C?
In C, there is no concept of "packing" like this. Whenever you have a char buffer such as
char buf[128];
whether you treat it like a string or a complex data structure is up to you. The simplest way would be to define a struct and copy data back and forth from your array.
struct MyStruct{
int data1;
int data2;
};
char buf[sizeof(struct MyStruct)];
struct MyStruct myStruct;
myStruct.data1 = 1;
myStruct.data2 = 2;
memcpy(buf, &myStruct, sizeof(struct MyStruct));
Please note that there IS some packing/padding that MAY happen here. For example, if you have a short in your struct, the compiler MAY use 4 bytes anyway. This also fails when you have to use pointers, such as char* strings in the structure.

Why does sys.getsizeof() not return [size] in file.read([size]) in Python

I have a large binary file that I would like to read in and unpack using struct.unpack()
The file consists of a number of lines each 2957 bytes long.
I read in the file using the following code:
with open("bin_file", "rb") as f:
line = f.read(2957)
My question is why, is the size returned by:
import sys
sys.getsizeof(line)
not equal to 2957 (in my case it is 2978)?
You misunderstand what sys.getsizeof() does. It returns the amount of memory Python uses for a string object, not length of the line.
Python string objects track reference counts, the object type and other metadata together with the actual characters, so 2978 bytes is not the same thing as the string length.
See the stringobject.h definition of the type:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
where PyObject_VAR_HEAD is defined in object.h, where the standard ob_refcnt, ob_type and ob_size fields are all defined.
So a string of length 2957 takes 2958 bytes (string length + null) and the remaining 20 bytes you see are to hold the reference count, the type pointer, the object 'size' (string length here), the cached string hash and the interned state flag.
Other object types will have different memory footprints, and the exact sizes of the C types used differ from platform to platform as well.
A string object representing 2957 bytes of data takes more than 2957 bytes of memory to represent, due to overhead such as the type pointer and the reference count. sys.getsizeof includes this additional overhead.

how to use python to built an int arrray and manuplate it as efficient as C?

From http://www.cs.bell-labs.com/cm/cs/pearls/sol01.html:
The C code is like that:
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
I found ctypes, BitArrays,numpy but I'm not sure whether they could be as efficient as the C codes above.
For example, if I write codes like this:
from ctypes import c_int
a=[c_int(9)]*1024*1024
Would the space used be 1M Bytes, or much more?
Does anyone know some good libraries that can do the same things in Python?
Numpy or ctypes are both good choices. But are you sure your Python code really needs to be as efficient as C, and are you sure this code is a performance hotspot?
The best thing to do is to use the Python profiler to ensure that this code really does need to be as efficient as C. If it truly does, then it will probably be easiest to just keep the code in C and link to it using something like ctypes or SWIG.
Edit: To answer your updated question, a numpy array of size N with element size M will contain N*M bytes of contiguous memory, plus a header and some bytes for views.
Here are a couple of related links:
Python memory usage of numpy arrays
Memory usage of numpy-arrays
you can also check the built-in array module:
>>> import array
>>> help(array)
Help on built-in module array:
NAME
array
FILE
(built-in)
DESCRIPTION
This module defines an object type which can efficiently represent
an array of basic values: characters, integers, floating point
numbers. Arrays are sequence types and behave very much like lists,
except that the type of objects stored in them is constrained. The
type is specified at object creation time by using a type code, which
is a single character. The following type codes are defined:
Type code C Type Minimum size in bytes
'b' signed integer 1
'B' unsigned integer 1
'u' Unicode character 2 (see note)
'h' signed integer 2
'H' unsigned integer 2
'i' signed integer 2
'I' unsigned integer 2
'l' signed integer 4
'L' unsigned integer 4
'f' floating point 4
'd' floating point 8
This:
a=[c_int()]
makes a list which contains a reference to a c_int object.
Multiplying the list merely duplicates the references, so:
a = [c_int()] * 1024 * 1024
actually creates a list of 1024 * 1024 references to the same single c_int object.
If you want an array of 1024 * 1024 c_ints, do this:
a = c_int * (1024 * 1024)

struct.error: unpack requires a string argument of length 4

Python says I need 4 bytes for a format code of "BH":
struct.error: unpack requires a string argument of length 4
Here is the code, I am putting in 3 bytes as I think is needed:
major, minor = struct.unpack("BH", self.fp.read(3))
"B" Unsigned char (1 byte) + "H" Unsigned short (2 bytes) = 3 bytes (!?)
struct.calcsize("BH") says 4 bytes.
EDIT: The file is ~800 MB and this is in the first few bytes of the file so I'm fairly certain there's data left to be read.
The struct module mimics C structures. It takes more CPU cycles for a processor to read a 16-bit word on an odd address or a 32-bit dword on an address not divisible by 4, so structures add "pad bytes" to make structure members fall on natural boundaries. Consider:
struct { 11
char a; 012345678901
short b; ------------
char c; axbbcxxxdddd
int d;
};
This structure will occupy 12 bytes of memory (x being pad bytes).
Python works similarly (see the struct documentation):
>>> import struct
>>> struct.pack('BHBL',1,2,3,4)
'\x01\x00\x02\x00\x03\x00\x00\x00\x04\x00\x00\x00'
>>> struct.calcsize('BHBL')
12
Compilers usually have a way of eliminating padding. In Python, any of =<>! will eliminate padding:
>>> struct.calcsize('=BHBL')
8
>>> struct.pack('=BHBL',1,2,3,4)
'\x01\x02\x00\x03\x04\x00\x00\x00'
Beware of letting struct handle padding. In C, these structures:
struct A { struct B {
short a; int a;
char b; char b;
}; };
are typically 4 and 8 bytes, respectively. The padding occurs at the end of the structure in case the structures are used in an array. This keeps the 'a' members aligned on correct boundaries for structures later in the array. Python's struct module does not pad at the end:
>>> struct.pack('LB',1,2)
'\x01\x00\x00\x00\x02'
>>> struct.pack('LBLB',1,2,3,4)
'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'
By default, on many platforms the short will be aligned to an offset at a multiple of 2, so there will be a padding byte added after the char.
To disable this, use: struct.unpack("=BH", data). This will use standard alignment, which doesn't add padding:
>>> struct.calcsize('=BH')
3
The = character will use native byte ordering. You can also use < or > instead of = to force little-endian or big-endian byte ordering, respectively.

Categories