This question already has answers here:
How is Python's List Implemented?
(10 answers)
Closed 8 years ago.
I am writing a program where I need to know the efficiency (memory wise) of different data containers in Python / Cython. One of said containers is the standard Python list.
The Python list is tripping me up because I do not know how it works on the binary level. Unlike Python, C's arrays are easy to understand, because all of the elements are the same type, and the space is declared ahead of time. This means when the programmer wants to go in and index the array, the program knows mathematically what memory address to go to. But the problem is, a Python list can store many different data types, and even nested lists inside of a list. The size of these data structures changes all the time, and the list still holds them, accounting for the changes. Does extra separator memory exist to make the list as dynamic as it is?
If you could, I would appreciate an actual binary layout of an example list in RAM, annotated with what each byte represents. This will help me to fully understand the inner workings of the list, as I am working on the binary level.
The list object is defined in Include/listobject.h. The structure is really simple:
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;
/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/
Py_ssize_t allocated;
} PyListObject;
and PyObject_VAR_HEAD is defined as
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
Basically, then, a list object looks like this:
[ssize_t ob_refcnt]
[type *ob_type]
[ssize_t ob_size]
[object **ob_item] -> [object *][object *][object *]...
[ssize_t allocated]
Note that len retrieves the value of ob_size.
ob_item points to an array of PyObject * pointers. Each element in a list is a Python object, and Python objects are always passed by reference (at the C-API level, as pointers to the actual PyObjects). Therefore, lists only store pointers to objects, and not the objects themselves.
When a list fills up, it will be reallocated. allocated tracks how many elements the list can hold at maximum (before reallocation). The reallocation algorithm is in Objects/listobject.c:
/* Ensure ob_item has room for at least newsize elements, and set
* ob_size to newsize. If newsize > ob_size on entry, the content
* of the new slots at exit is undefined heap trash; it's the caller's
* responsibility to overwrite them with sane values.
* The number of allocated elements may grow, shrink, or stay the same.
* Failure is impossible if newsize <= self.allocated on entry, although
* that partly relies on an assumption that the system realloc() never
* fails when passed a number of bytes <= the number of bytes last
* allocated (the C standard doesn't guarantee this, but it's hard to
* imagine a realloc implementation where it wouldn't be true).
* Note that self->ob_item may change, and even if newsize is less
* than ob_size on entry.
*/
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
PyObject **items;
size_t new_allocated;
Py_ssize_t allocated = self->allocated;
/* Bypass realloc() when a previous overallocation is large enough
to accommodate the newsize. If the newsize falls lower than half
the allocated size, then proceed with the realloc() to shrink the list.
*/
if (allocated >= newsize && newsize >= (allocated >> 1)) {
assert(self->ob_item != NULL || newsize == 0);
Py_SIZE(self) = newsize;
return 0;
}
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
if (newsize == 0)
new_allocated = 0;
items = self->ob_item;
if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
PyMem_RESIZE(items, PyObject *, new_allocated);
else
items = NULL;
if (items == NULL) {
PyErr_NoMemory();
return -1;
}
self->ob_item = items;
Py_SIZE(self) = newsize;
self->allocated = new_allocated;
return 0;
}
As you can see from the comments, lists grow rather slowly, in the following sequence:
0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
Related
I'm new to c++ and looking for a faster way to append pixel value to python list, since currrently on loop it takes around 0.1 second to process one frame of image with resolution of 854x480, do anyone have any idea?
I tried to avoid using third party module if possible.
Here is what I've got so far:
PyObject* byte_list = PyList_New(static_cast<Py_ssize_t>(0));
AVFrame *pFrameRGB = av_frame_alloc();
av_frame_copy_props(pFrameRGB, this->pFrame);
pFrameRGB->width = this->pFrame->width;
pFrameRGB->height = this->pFrame->height;
pFrameRGB->format = AV_PIX_FMT_RGB24;
av_frame_get_buffer(pFrameRGB, 0);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0,
this->pCodecContext->height, pFrameRGB->data, pFrameRGB->linesize);
if (this->_debug) {
std::cout << "Frame linesize " << pFrameRGB->linesize[0] << "\n";
std::cout << "Frame width " << pFrameRGB->width << "\n";
std::cout << "Frame height " << pFrameRGB->height << "\n";
}
// This looping method seems slow
for(int y = 0; y < pFrameRGB->height; ++y) {
for(int x = 0; x < pFrameRGB->width; ++x) {
int p = x * 3 + y * pFrameRGB->linesize[0];
int r = pFrameRGB->data[0][p];
int g = pFrameRGB->data[0][p+1];
int b = pFrameRGB->data[0][p+2];
PyList_Append(byte_list, PyLong_FromLong(r));
PyList_Append(byte_list, PyLong_FromLong(g));
PyList_Append(byte_list, PyLong_FromLong(b));
}
}
av_frame_free(&pFrameRGB);
Thanks!
After looking around, I've decided to use Python Built-in Array Library that can use memcpy instead of PyList which require to input the data one by one.
From my test, this improve the speed from 2-10 times, depending on the data.
PyObject *vec_to_array(std::vector<uint8_t>& vec) {
static PyObject *single_array;
if (!single_array) {
PyObject *array_module = PyImport_ImportModule("array");
if (!array_module)
return NULL;
PyObject *array_type = PyObject_GetAttrString(array_module, "array");
Py_DECREF(array_module);
if (!array_type)
return NULL;
single_array = PyObject_CallFunction(array_type, "s[B]", "B", 0);
Py_DECREF(array_type);
if (!single_array)
return NULL;
}
// extra-fast way to create an empty array of count elements:
// array = single_element_array * count
PyObject *pysize = PyLong_FromSsize_t(vec.size());
if (!pysize)
return NULL;
PyObject *array = PyNumber_Multiply(single_array, pysize);
Py_DECREF(pysize);
if (!array)
return NULL;
// now, obtain the address of the array's buffer
PyObject *buffer_info = PyObject_CallMethod(array, "buffer_info", "");
if (!buffer_info) {
Py_DECREF(array);
return NULL;
}
PyObject *pyaddr = PyTuple_GetItem(buffer_info, 0);
void *addr = PyLong_AsVoidPtr(pyaddr);
// and, finally, copy the data.
if (vec.size())
memcpy(addr, &vec[0], vec.size() * sizeof(uint8_t));
return array;
}
after that I passed the vector into that function
std::vector<uint8_t> rgb_arr;
// Copy data from AV Frame
uint8_t* rgb_data[4]; int rgb_linesize[4];
av_image_alloc(rgb_data, rgb_linesize, this->pFrame->width, this->pFrame->height, AV_PIX_FMT_RGB24, 32);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0, this->pFrame->height, rgb_data, rgb_linesize);
// Put the data into vector
int rgb_size = pFrame->height * rgb_linesize[0];
std::vector<uint8_t> rgb_vector(rgb_size);
memcpy(rgb_vector.data(), rgb_data[0], rgb_size);
// Transfer the data from vector to rgb_arr
for(int y = 0; y < pFrame->height; ++y) {
rgb_arr.insert(
rgb_arr.end(),
rgb_vector.begin() + y * rgb_linesize[0],
rgb_vector.begin() + y * rgb_linesize[0] + 3 * pFrame->width
);
}
PyObject* arr = vec_to_array(rgb_arr);
This then later can be accessed by python.
Use a container with a faster insertion time, such as std::vector or std::deque instead of std::list. These containers have a constant-time insertion time, whereas std::list has a linear-time insertion time.
Use a bulk insertion method, such as std::vector::insert() or std::deque::insert(), to insert multiple values at once instead of inserting them one at a time. This can reduce the overhead of inserting individual elements.
Use a memory-efficient data structure, such as std::bitset, to store the pixel values if each pixel only has a few possible values (e.g. 0 or 1). This can reduce the memory usage and improve the performance of inserting and accessing the values.
Use C++11's emplace_back() method, which avoids the overhead of constructing and copying elements by constructing the element in place in the container.
Preallocate the memory for the container to avoid the overhead of frequent memory reallocations as the container grows. You can use the reserve() method of std::vector or std::deque to preallocate the memory.
Consider using a faster algorithm or data structure for the image processing task itself. For example, you may be able to use optimized image processing libraries or parallelize the image processing using multi-threading or SIMD instructions.
A is big: len(a)=10000000
Will python interpreter optimize the op like a[:10]=[1,2,3] to O(1) time ?
Is there any difference between a[:10]=[1,2,3] and a[:3]=[1,2,3]? I mean difference between whether length changes.
There's very much a difference between the two statements:
a[:10] = [1,2,3]
a[:3] = [1,2,3]
The first involves actual deletion of some elements in the list, whereas the second can just change the elements that are already there. You can verify this by executing:
print(len(a))
before and after the operation.
There's a useful web page that shows the various operations on standard Python data structures along with their time complexities. Deleting from a list (which is really an array under the covers) is O(n) as it involves moving all elements beyond the deletion area to fill in the gap that would otherwise be left.
And, in fact, if you look at the list_ass_slice code responsible for list slice assignment, you'll see it has a number of memcpy and memmove operations for modifying the list, for example:
if (d < 0) { /* Delete -d items */
Py_ssize_t tail;
tail = (Py_SIZE(a) - ihigh) * sizeof(PyObject *);
memmove(&item[ihigh+d], &item[ihigh], tail);
if (list_resize(a, Py_SIZE(a) + d) < 0) {
memmove(&item[ihigh], &item[ihigh+d], tail);
memcpy(&item[ilow], recycle, s);
goto Error;
}
item = a->ob_item;
}
Basically, the code first works out the size difference between the slice being copied and the slice it's replacing: d = n - norig. Before copying the individual elements, it inserts some new elements if d > 0, deletes some if d < 0, and does neither if d == 0.
I am wondering what the big-O runtime complexity is for comparing two collections.Counter objects. Here is some code to demonstrate what I mean:
import collections
counter_1 = collections.Counter("abcabcabcabcabcabcdefg")
counter_2 = collections.Counter("xyzxyzxyzabc")
comp = counter_1 == counter_2 # What is the runtime of this comparison statement?
Is the runtime of the equality comparison in the final statement O(1)? Or is it O(num_of_unique_keys_in_largest_counter)? Or is it something else?
For reference, here is the source code for collections.Counter https://github.com/python/cpython/blob/0250de48199552cdaed5a4fe44b3f9cdb5325363/Lib/collections/init.py#L497
I do not see the class implementing an __eq()__ method.
Bonus points: If the answer to this question changes between python2 and python3, I would love to hear the difference?
Counter is a subclass of dict, therefore the big O analysis is the one of dict, with the caveat that Counter objects are specialized to only hold int values (i/e they can not hold collections of values as dicts can); this simplifies the analysis.
Looking at the c code implementation of the equality comparison:
There is an early exit if the number of keys is different. (this does not influence big-O).
Then a loop that iterates over all the keys that exits early if the key is not found, or if the corresponding value is different. (again, this has no bearing on big-O).
if all keys are found, and the corresponding values are all equal, then the dictionaries are declared equal. The lookup and comparisons of each key-value pair is O(1); this operation is repeated at most n times (n being the number of keys)
In all, the time complexity is O(n), with n the number of keys.
This applies to both python 2 and 3.
from dictobject.c
/* Return 1 if dicts equal, 0 if not, -1 if error.
* Gets out as soon as any difference is detected.
* Uses only Py_EQ comparison.
*/
static int
dict_equal(PyDictObject *a, PyDictObject *b)
{
Py_ssize_t i;
if (a->ma_used != b->ma_used)
/* can't be equal if # of entries differ */
return 0;
/* Same # of entries -- check all of 'em. Exit early on any diff. */
for (i = 0; i < a->ma_keys->dk_nentries; i++) {
PyDictKeyEntry *ep = &DK_ENTRIES(a->ma_keys)[i];
PyObject *aval;
if (a->ma_values)
aval = a->ma_values[i];
else
aval = ep->me_value;
if (aval != NULL) {
int cmp;
PyObject *bval;
PyObject *key = ep->me_key;
/* temporarily bump aval's refcount to ensure it stays
alive until we're done with it */
Py_INCREF(aval);
/* ditto for key */
Py_INCREF(key);
/* reuse the known hash value */
b->ma_keys->dk_lookup(b, key, ep->me_hash, &bval);
if (bval == NULL) {
Py_DECREF(key);
Py_DECREF(aval);
if (PyErr_Occurred())
return -1;
return 0;
}
cmp = PyObject_RichCompareBool(aval, bval, Py_EQ);
Py_DECREF(key);
Py_DECREF(aval);
if (cmp <= 0) /* error or not equal */
return cmp;
}
}
return 1;
}
Internally, collections.Counter stores the count as a dictionary (that's why it subclasses dict) so the same rules apply as with comparing dictionaries - namely, it compares each key with each value from both dictionaries to ensure equality. For CPython, that is implemented in dict_equal(), other implementations may vary but, logically, you have to do the each-with-each comparison to ensure equality.
This also means that the complexity is O(N) at its worst (loops through one of the dictionaries, looks if the value is the same in the other). There are no significant changes between Python 2.x and Python 3.x in this regard.
EDIT 3
I have some C++ code (externed as C) which I access from python.
I want to allocate a double** in python, pass it to the C/C++ code to copy the content of a class internal data, and then use it in python similarly to how I would use a list of lists.
Unfortunately I can not manage to specify to python the size of the most inner array, so it reads invalid memory when iterating over it and the program segfaults.
I can not change the structure of the internal data in C++, and I'd like to have python do the bound checking for me (like if I was using a c_double_Array_N_Array_M instead of an array of pointers).
test.cpp (compile with g++ -Wall -fPIC --shared -o test.so test.cpp )
#include <stdlib.h>
#include <string.h>
class Dummy
{
double** ptr;
int e;
int i;
};
extern "C" {
void * get_dummy(int N, int M) {
Dummy * d = new Dummy();
d->ptr = new double*[N];
d->e = N;
d->i = M;
for(int i=0; i<N; ++i)
{
d->ptr[i]=new double[M];
for(int j=0; j <M; ++j)
{
d->ptr[i][j] = i*N + j;
}
}
return d;
}
void copy(void * inst, double ** dest) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
memcpy(dest[i], d->ptr[i], sizeof(double) * d->i);
}
}
void cleanup(void * inst) {
if (inst != NULL) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
delete[] d->ptr[i];
}
delete[] d->ptr;
delete d;
}
}
}
Python (this segfaults. Put it in the same dir in which the test.so is)
import os
from contextlib import contextmanager
import ctypes as ct
DOUBLE_P = ct.POINTER(ct.c_double)
library_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'test.so')
lib = ct.cdll.LoadLibrary(library_path)
lib.get_dummy.restype = ct.c_void_p
N=15
M=10
#contextmanager
def work_with_dummy(N, M):
dummy = None
try:
dummy = lib.get_dummy(N, M)
yield dummy
finally:
lib.cleanup(dummy)
with work_with_dummy(N,M) as dummy:
internal = (ct.c_double * M)
# Dest is allocated in python, it will live out of the with context and will be deallocated by python
dest = (DOUBLE_P * N)()
for i in range(N):
dest[i] = internal()
lib.copy(dummy, dest)
#dummy is not available anymore here. All the C resources has been cleaned up
for i in dest:
for n in i:
print(n) #it segfaults reading more than the length of the array
What can I change in my python code so that I can treat the array as a list?
(I need only to read from it)
3 ways to pass a int** array from Python to C and back
So that Python knows the size of the array when iterating
The data
This solutions work for either 2d array or array of pointers to arrays with slight modifications, without the use of libraries like numpy.
I will use int as a type instead of double and we will copy source, which is defined as
N = 10;
M = 15;
int ** source = (int **) malloc(sizeof(int*) * N);
for(int i=0; i<N; ++i)
{
source[i] = (int *) malloc(sizeof(int) * M);
for(int j=0; j<M; ++j)
{
source[i][j] = i*N + j;
}
}
1) Assigning the array pointers
Python allocation
dest = ((ctypes.c_int * M) * N) ()
int_P = ctypes.POINTER(ctypes.c_int)
temp = (int_P * N) ()
for i in range(N):
temp[i] = dest[i]
lib.copy(temp)
del temp
# temp gets collected by GC, but the data was stored into the memory allocated by dest
# You can now access dest as if it was a list of lists
for row in dest:
for item in row:
print(item)
C copy function
void copy(int** dest)
{
for(int i=0; i<N; ++i)
{
memcpy(dest[i], source[i], sizeof(int) * M);
}
}
Explanation
We first allocate a 2D array. A 2D array[N][M] is allocated as a 1D array[N*M], with 2d_array[n][m] == 1d_array[n*M + m].
Since our code is expecting a int**, but our 2D array in allocated as a int *, we create a temporary array to provide the expected structure.
We allocate temp[N][M], and than we assign the address of the memory we allocated previously temp[n] = 2d_array[n] = &1d_array[n*M] (the second equal is there to show what is happening with the real memory we allocated).
If you change the copying code so that it copies more than M, let's say M+1, you will see that it will not segfault, but it will override the memory of the next row because they are contiguous (if you change the copying code, remember to add increase by 1 the size of dest allocated in python, otherwise it will segfault when you write after the last item of the last row)
2) Slicing the pointers
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
for row in dest:
# Python knows the length of dest, so everything works fine here
for item in row:
# Python doesn't know that row is an array, so it will continue to read memory without ever stopping (actually, a segfault will stop it)
print(item)
dest = [internal[:M] for internal in dest]
for row in dest:
for item in row:
# No more segfaulting, as now python know that internal is M item long
print(item)
C copy function
Same as for solution 1
Explanation
This time we are allocating an actual array of pointers of array, like source was allocated.
Since the outermost array ( dest ) is an array of pointers, python doesn't know the length of the array pointed to (it doesn't even know that is an array, it could be a pointer to a single int as well).
If you iterate over that pointer, python will not bound check and it will start reading all your memory, resulting in a segfault.
So, we slice the pointer taking the first M elements (which actually are all the elements in the array). Now python knows that it should only iterate over the first M elements, and it won't segfault any more.
I believe that python copies the content pointed to a new list using this method ( see sources )
2.1) Slicing the pointers, continued
Eryksun jumped in in the comments and proposed a solution which avoids the copying of all the elements in new lists.
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
inner_array_P = ctypes.POINTER(inner_array)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
dest_arrays = [inner_array_p.from_buffer(x)[0] for x in dest]
for row in dest_arrays:
for item in row:
print(item)
C copying code
Same as for solution 1
3) Contiguous memory
This method is an option only if you can change the copying code on the C side. source will not need to be changed.
Python allocation
dest = ((ctypes.c_int * M) * N) ()
lib.copy(dest)
for row in dest:
for item in row:
print(item)
C copy function
void copy(int * dest) {
for(int i=0; i < N; ++i)
{
memcpy(&dest[i * M], source[i], sizeof(int) * M);
}
}
Explanation
This time, like in case 1) we are allocating a contiguous 2D array. But since we can change the C code, we don't need to create a different array and copy the pointers since we will be giving the expected type to C.
In the copy function, we pass the address of the first item of every row, and we copy M elements in that row, then we go to the next row.
The copy pattern is exactly as in case 1), but this time instead of writing the interface in python so that the C code receives the data how it expects it, we changed the C code to expect the data in that precise format.
If you keep this C code, you'll be able to use numpy arrays as well, as they are 2D row major arrays.
All of this answer is possible thanks the great (and concise) comments of #eryksun below the original question.
How does Python hash long numbers? I guess it takes O(1) time for 32-bit ints, but the way long integers work in Python makes me think the complexity is not O(1) for them. I've looked for answers in relevant questions, but have found none straightforward enough to make me confident. Thank you in advance.
The long_hash() function indeed loops and depends on the size of the integer, yes:
/* The following loop produces a C unsigned long x such that x is
congruent to the absolute value of v modulo ULONG_MAX. The
resulting x is nonzero if and only if v is. */
while (--i >= 0) {
/* Force a native long #-bits (32 or 64) circular shift */
x = (x >> (8*SIZEOF_LONG-PyLong_SHIFT)) | (x << PyLong_SHIFT);
x += v->ob_digit[i];
/* If the addition above overflowed we compensate by
incrementing. This preserves the value modulo
ULONG_MAX. */
if (x < v->ob_digit[i])
x++;
}
where i is the 'object size', e.g. the number of digits used to represent the number, where the size of a digit depends on your platform.