Faster way to append to PyList in c++ - python

I'm new to c++ and looking for a faster way to append pixel value to python list, since currrently on loop it takes around 0.1 second to process one frame of image with resolution of 854x480, do anyone have any idea?
I tried to avoid using third party module if possible.
Here is what I've got so far:
PyObject* byte_list = PyList_New(static_cast<Py_ssize_t>(0));
AVFrame *pFrameRGB = av_frame_alloc();
av_frame_copy_props(pFrameRGB, this->pFrame);
pFrameRGB->width = this->pFrame->width;
pFrameRGB->height = this->pFrame->height;
pFrameRGB->format = AV_PIX_FMT_RGB24;
av_frame_get_buffer(pFrameRGB, 0);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0,
this->pCodecContext->height, pFrameRGB->data, pFrameRGB->linesize);
if (this->_debug) {
std::cout << "Frame linesize " << pFrameRGB->linesize[0] << "\n";
std::cout << "Frame width " << pFrameRGB->width << "\n";
std::cout << "Frame height " << pFrameRGB->height << "\n";
}
// This looping method seems slow
for(int y = 0; y < pFrameRGB->height; ++y) {
for(int x = 0; x < pFrameRGB->width; ++x) {
int p = x * 3 + y * pFrameRGB->linesize[0];
int r = pFrameRGB->data[0][p];
int g = pFrameRGB->data[0][p+1];
int b = pFrameRGB->data[0][p+2];
PyList_Append(byte_list, PyLong_FromLong(r));
PyList_Append(byte_list, PyLong_FromLong(g));
PyList_Append(byte_list, PyLong_FromLong(b));
}
}
av_frame_free(&pFrameRGB);
Thanks!

After looking around, I've decided to use Python Built-in Array Library that can use memcpy instead of PyList which require to input the data one by one.
From my test, this improve the speed from 2-10 times, depending on the data.
PyObject *vec_to_array(std::vector<uint8_t>& vec) {
static PyObject *single_array;
if (!single_array) {
PyObject *array_module = PyImport_ImportModule("array");
if (!array_module)
return NULL;
PyObject *array_type = PyObject_GetAttrString(array_module, "array");
Py_DECREF(array_module);
if (!array_type)
return NULL;
single_array = PyObject_CallFunction(array_type, "s[B]", "B", 0);
Py_DECREF(array_type);
if (!single_array)
return NULL;
}
// extra-fast way to create an empty array of count elements:
// array = single_element_array * count
PyObject *pysize = PyLong_FromSsize_t(vec.size());
if (!pysize)
return NULL;
PyObject *array = PyNumber_Multiply(single_array, pysize);
Py_DECREF(pysize);
if (!array)
return NULL;
// now, obtain the address of the array's buffer
PyObject *buffer_info = PyObject_CallMethod(array, "buffer_info", "");
if (!buffer_info) {
Py_DECREF(array);
return NULL;
}
PyObject *pyaddr = PyTuple_GetItem(buffer_info, 0);
void *addr = PyLong_AsVoidPtr(pyaddr);
// and, finally, copy the data.
if (vec.size())
memcpy(addr, &vec[0], vec.size() * sizeof(uint8_t));
return array;
}
after that I passed the vector into that function
std::vector<uint8_t> rgb_arr;
// Copy data from AV Frame
uint8_t* rgb_data[4]; int rgb_linesize[4];
av_image_alloc(rgb_data, rgb_linesize, this->pFrame->width, this->pFrame->height, AV_PIX_FMT_RGB24, 32);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0, this->pFrame->height, rgb_data, rgb_linesize);
// Put the data into vector
int rgb_size = pFrame->height * rgb_linesize[0];
std::vector<uint8_t> rgb_vector(rgb_size);
memcpy(rgb_vector.data(), rgb_data[0], rgb_size);
// Transfer the data from vector to rgb_arr
for(int y = 0; y < pFrame->height; ++y) {
rgb_arr.insert(
rgb_arr.end(),
rgb_vector.begin() + y * rgb_linesize[0],
rgb_vector.begin() + y * rgb_linesize[0] + 3 * pFrame->width
);
}
PyObject* arr = vec_to_array(rgb_arr);
This then later can be accessed by python.

Use a container with a faster insertion time, such as std::vector or std::deque instead of std::list. These containers have a constant-time insertion time, whereas std::list has a linear-time insertion time.
Use a bulk insertion method, such as std::vector::insert() or std::deque::insert(), to insert multiple values at once instead of inserting them one at a time. This can reduce the overhead of inserting individual elements.
Use a memory-efficient data structure, such as std::bitset, to store the pixel values if each pixel only has a few possible values (e.g. 0 or 1). This can reduce the memory usage and improve the performance of inserting and accessing the values.
Use C++11's emplace_back() method, which avoids the overhead of constructing and copying elements by constructing the element in place in the container.
Preallocate the memory for the container to avoid the overhead of frequent memory reallocations as the container grows. You can use the reserve() method of std::vector or std::deque to preallocate the memory.
Consider using a faster algorithm or data structure for the image processing task itself. For example, you may be able to use optimized image processing libraries or parallelize the image processing using multi-threading or SIMD instructions.

Related

Ctypes: allocate double** , pass it to C, then use it in Python

EDIT 3
I have some C++ code (externed as C) which I access from python.
I want to allocate a double** in python, pass it to the C/C++ code to copy the content of a class internal data, and then use it in python similarly to how I would use a list of lists.
Unfortunately I can not manage to specify to python the size of the most inner array, so it reads invalid memory when iterating over it and the program segfaults.
I can not change the structure of the internal data in C++, and I'd like to have python do the bound checking for me (like if I was using a c_double_Array_N_Array_M instead of an array of pointers).
test.cpp (compile with g++ -Wall -fPIC --shared -o test.so test.cpp )
#include <stdlib.h>
#include <string.h>
class Dummy
{
double** ptr;
int e;
int i;
};
extern "C" {
void * get_dummy(int N, int M) {
Dummy * d = new Dummy();
d->ptr = new double*[N];
d->e = N;
d->i = M;
for(int i=0; i<N; ++i)
{
d->ptr[i]=new double[M];
for(int j=0; j <M; ++j)
{
d->ptr[i][j] = i*N + j;
}
}
return d;
}
void copy(void * inst, double ** dest) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
memcpy(dest[i], d->ptr[i], sizeof(double) * d->i);
}
}
void cleanup(void * inst) {
if (inst != NULL) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
delete[] d->ptr[i];
}
delete[] d->ptr;
delete d;
}
}
}
Python (this segfaults. Put it in the same dir in which the test.so is)
import os
from contextlib import contextmanager
import ctypes as ct
DOUBLE_P = ct.POINTER(ct.c_double)
library_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'test.so')
lib = ct.cdll.LoadLibrary(library_path)
lib.get_dummy.restype = ct.c_void_p
N=15
M=10
#contextmanager
def work_with_dummy(N, M):
dummy = None
try:
dummy = lib.get_dummy(N, M)
yield dummy
finally:
lib.cleanup(dummy)
with work_with_dummy(N,M) as dummy:
internal = (ct.c_double * M)
# Dest is allocated in python, it will live out of the with context and will be deallocated by python
dest = (DOUBLE_P * N)()
for i in range(N):
dest[i] = internal()
lib.copy(dummy, dest)
#dummy is not available anymore here. All the C resources has been cleaned up
for i in dest:
for n in i:
print(n) #it segfaults reading more than the length of the array
What can I change in my python code so that I can treat the array as a list?
(I need only to read from it)
3 ways to pass a int** array from Python to C and back
So that Python knows the size of the array when iterating
The data
This solutions work for either 2d array or array of pointers to arrays with slight modifications, without the use of libraries like numpy.
I will use int as a type instead of double and we will copy source, which is defined as
N = 10;
M = 15;
int ** source = (int **) malloc(sizeof(int*) * N);
for(int i=0; i<N; ++i)
{
source[i] = (int *) malloc(sizeof(int) * M);
for(int j=0; j<M; ++j)
{
source[i][j] = i*N + j;
}
}
1) Assigning the array pointers
Python allocation
dest = ((ctypes.c_int * M) * N) ()
int_P = ctypes.POINTER(ctypes.c_int)
temp = (int_P * N) ()
for i in range(N):
temp[i] = dest[i]
lib.copy(temp)
del temp
# temp gets collected by GC, but the data was stored into the memory allocated by dest
# You can now access dest as if it was a list of lists
for row in dest:
for item in row:
print(item)
C copy function
void copy(int** dest)
{
for(int i=0; i<N; ++i)
{
memcpy(dest[i], source[i], sizeof(int) * M);
}
}
Explanation
We first allocate a 2D array. A 2D array[N][M] is allocated as a 1D array[N*M], with 2d_array[n][m] == 1d_array[n*M + m].
Since our code is expecting a int**, but our 2D array in allocated as a int *, we create a temporary array to provide the expected structure.
We allocate temp[N][M], and than we assign the address of the memory we allocated previously temp[n] = 2d_array[n] = &1d_array[n*M] (the second equal is there to show what is happening with the real memory we allocated).
If you change the copying code so that it copies more than M, let's say M+1, you will see that it will not segfault, but it will override the memory of the next row because they are contiguous (if you change the copying code, remember to add increase by 1 the size of dest allocated in python, otherwise it will segfault when you write after the last item of the last row)
2) Slicing the pointers
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
for row in dest:
# Python knows the length of dest, so everything works fine here
for item in row:
# Python doesn't know that row is an array, so it will continue to read memory without ever stopping (actually, a segfault will stop it)
print(item)
dest = [internal[:M] for internal in dest]
for row in dest:
for item in row:
# No more segfaulting, as now python know that internal is M item long
print(item)
C copy function
Same as for solution 1
Explanation
This time we are allocating an actual array of pointers of array, like source was allocated.
Since the outermost array ( dest ) is an array of pointers, python doesn't know the length of the array pointed to (it doesn't even know that is an array, it could be a pointer to a single int as well).
If you iterate over that pointer, python will not bound check and it will start reading all your memory, resulting in a segfault.
So, we slice the pointer taking the first M elements (which actually are all the elements in the array). Now python knows that it should only iterate over the first M elements, and it won't segfault any more.
I believe that python copies the content pointed to a new list using this method ( see sources )
2.1) Slicing the pointers, continued
Eryksun jumped in in the comments and proposed a solution which avoids the copying of all the elements in new lists.
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
inner_array_P = ctypes.POINTER(inner_array)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
dest_arrays = [inner_array_p.from_buffer(x)[0] for x in dest]
for row in dest_arrays:
for item in row:
print(item)
C copying code
Same as for solution 1
3) Contiguous memory
This method is an option only if you can change the copying code on the C side. source will not need to be changed.
Python allocation
dest = ((ctypes.c_int * M) * N) ()
lib.copy(dest)
for row in dest:
for item in row:
print(item)
C copy function
void copy(int * dest) {
for(int i=0; i < N; ++i)
{
memcpy(&dest[i * M], source[i], sizeof(int) * M);
}
}
Explanation
This time, like in case 1) we are allocating a contiguous 2D array. But since we can change the C code, we don't need to create a different array and copy the pointers since we will be giving the expected type to C.
In the copy function, we pass the address of the first item of every row, and we copy M elements in that row, then we go to the next row.
The copy pattern is exactly as in case 1), but this time instead of writing the interface in python so that the C code receives the data how it expects it, we changed the C code to expect the data in that precise format.
If you keep this C code, you'll be able to use numpy arrays as well, as they are 2D row major arrays.
All of this answer is possible thanks the great (and concise) comments of #eryksun below the original question.

python scipy/weave c. Using python variables in c code

Im trying to run some c code in python using inline from scipy.weave.
Lets say we have 2 double arrays and onbe double value, i wish to add each index of the first index to the corresponiding index of the next index, plus the value.
The C code:
double* first;
double* second;
double val;
int length;
int i;
for (i = 0; i < length; i++) {
second[i] = second[i] + first[i] + val;
}
Then i wish to use the "second" array in my python code again.
Given the following python code:
import numpy
from scipy import weave
first = zeros(10) #first double array
second = ones(10) #second python array
val = 1.0
code = """
the c code
"""
second = inline(code,[first, second, val, 10])
Now i am not shure if this is the correct way of sending in the arrays/getting it out, and how to use/get acces to them within the c code.

C++ and Python version of the same algorithm giving different result

The following code is an algorithm to determine the amount of integer triangles, with their biggest side being smaller or equal to MAX, that have an integer median. The Python version works but is too slow for bigger N, while the C++ version is a lot faster but doesn't give the right result.
When MAX is 10, C++ and Python both return 3.
When MAX is 100, Python returns 835 and C++ returns 836.
When MAX is 200, Python returns 4088 and C++ returns 4102.
When MAX is 500, Python returns 32251 and C++ returns 32296.
When MAX is 1000, Python returns 149869 and C++ returns 150002.
Here's the C++ version:
#include <cstdio>
#include <math.h>
const int MAX = 1000;
int main()
{
long long int x = 0;
for (int b = MAX; b > 4; b--)
{
printf("%lld\n", b);
for (int a = b; a > 4; a -= 2){
for (int c = floor(b/2); c < floor(MAX/2); c+=1)
{
if (a+b > 2*c){
int d = 2*(pow(a,2)+pow(b,2)-2*pow(c,2));
if (sqrt(d)/2==floor(sqrt(d)/2))
x+=1;
}
}
}
}
printf("Done: ");
printf("%lld\n", x);
}
Here's the original Python version:
import math
def sumofSquares(n):
f = 0
for b in range(n,4,-1):
print(b)
for a in range(b,4,-2):
for C in range(math.ceil(b/2),n//2+1):
if a+b>2*C:
D = 2*(a**2+b**2-2*C**2)
if (math.sqrt(D)/2).is_integer():
f += 1
return f
a = int(input())
print(sumofSquares(a))
print('Done')
I'm not too familiar with C++ so I have no idea what could be happening that's causing this (maybe an overflow error?).
Of course, any optimizations for the algorithm are more than welcome!
The issue is that the range for your c (C in python) variables do not match. To make them equivalent to your existing C++ range, you can change your python loop to:
for C in range(int(math.floor(b/2)), int(math.floor(n/2))):
...
To make them equivalent to your existing python range, you can change your C++ loop to:
for (int c = ceil(b/2.0); c < MAX/2 + 1; c++) {
...
}
Depending on which loop is originally correct, this will make the results match.
It seams some troubles could be here:
(sqrt(d)==floor(sqrt(d)))

Binary layout of Python lists [duplicate]

This question already has answers here:
How is Python's List Implemented?
(10 answers)
Closed 8 years ago.
I am writing a program where I need to know the efficiency (memory wise) of different data containers in Python / Cython. One of said containers is the standard Python list.
The Python list is tripping me up because I do not know how it works on the binary level. Unlike Python, C's arrays are easy to understand, because all of the elements are the same type, and the space is declared ahead of time. This means when the programmer wants to go in and index the array, the program knows mathematically what memory address to go to. But the problem is, a Python list can store many different data types, and even nested lists inside of a list. The size of these data structures changes all the time, and the list still holds them, accounting for the changes. Does extra separator memory exist to make the list as dynamic as it is?
If you could, I would appreciate an actual binary layout of an example list in RAM, annotated with what each byte represents. This will help me to fully understand the inner workings of the list, as I am working on the binary level.
The list object is defined in Include/listobject.h. The structure is really simple:
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;
/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/
Py_ssize_t allocated;
} PyListObject;
and PyObject_VAR_HEAD is defined as
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
Basically, then, a list object looks like this:
[ssize_t ob_refcnt]
[type *ob_type]
[ssize_t ob_size]
[object **ob_item] -> [object *][object *][object *]...
[ssize_t allocated]
Note that len retrieves the value of ob_size.
ob_item points to an array of PyObject * pointers. Each element in a list is a Python object, and Python objects are always passed by reference (at the C-API level, as pointers to the actual PyObjects). Therefore, lists only store pointers to objects, and not the objects themselves.
When a list fills up, it will be reallocated. allocated tracks how many elements the list can hold at maximum (before reallocation). The reallocation algorithm is in Objects/listobject.c:
/* Ensure ob_item has room for at least newsize elements, and set
* ob_size to newsize. If newsize > ob_size on entry, the content
* of the new slots at exit is undefined heap trash; it's the caller's
* responsibility to overwrite them with sane values.
* The number of allocated elements may grow, shrink, or stay the same.
* Failure is impossible if newsize <= self.allocated on entry, although
* that partly relies on an assumption that the system realloc() never
* fails when passed a number of bytes <= the number of bytes last
* allocated (the C standard doesn't guarantee this, but it's hard to
* imagine a realloc implementation where it wouldn't be true).
* Note that self->ob_item may change, and even if newsize is less
* than ob_size on entry.
*/
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
PyObject **items;
size_t new_allocated;
Py_ssize_t allocated = self->allocated;
/* Bypass realloc() when a previous overallocation is large enough
to accommodate the newsize. If the newsize falls lower than half
the allocated size, then proceed with the realloc() to shrink the list.
*/
if (allocated >= newsize && newsize >= (allocated >> 1)) {
assert(self->ob_item != NULL || newsize == 0);
Py_SIZE(self) = newsize;
return 0;
}
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
if (newsize == 0)
new_allocated = 0;
items = self->ob_item;
if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
PyMem_RESIZE(items, PyObject *, new_allocated);
else
items = NULL;
if (items == NULL) {
PyErr_NoMemory();
return -1;
}
self->ob_item = items;
Py_SIZE(self) = newsize;
self->allocated = new_allocated;
return 0;
}
As you can see from the comments, lists grow rather slowly, in the following sequence:
0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

Python style iterators in C

The "yield" statement in python allows simple iteration from a procedure, and it also means that sequences don't need to be pre-calculated AND stored in a array of "arbitrary" size.
Is there a there a similar way of iterating (with yield) from a C procedure?
Here follows a community-wiki copy of the self-answer, which can be chosen as "the" answer. Please direct up/downvotes to the actual self-answer
Here is the method I found:
/* Example calculates the sum of the prime factors of the first 32 Fibonacci numbers */
#include <stdio.h>
typedef enum{false=0, true=1}bool;
/* the following line is the only time I have ever required "auto" */
#define FOR(i,iterator) auto bool lambda(i); yield_init = (void *)λ iterator; bool lambda(i)
#define DO {
#define YIELD(x) if(!yield(x))return
#define BREAK return false
#define CONTINUE return true
#define OD CONTINUE; }
/* Warning: _Most_ FOR(,){ } loops _must_ have a CONTINUE as the last statement.
* * Otherwise the lambda will return random value from stack, and may terminate early */
typedef void iterator; /* hint at procedure purpose */
static volatile void *yield_init;
#define YIELDS(type) bool (*yield)(type) = yield_init
iterator fibonacci(int n){
YIELDS(int);
int i;
int pair[2] = {0,1};
YIELD(0); YIELD(1);
for(i=2; i<n; i++){
pair[i%2] = pair[0] + pair[1];
YIELD(pair[i%2]);
}
}
iterator factors(int n){
YIELDS(int);
int i;
for(i=2; i*i<=n; i++){
while(n%i == 0 ){
YIELD(i);
n/=i;
}
}
YIELD(n);
}
main(){
FOR(int i, fibonacci(32)){
printf("%d:", i);
int sum = 0;
FOR(int factor, factors(i)){
sum += factor;
printf(" %d",factor);
CONTINUE;
}
printf(" - sum of factors: %d\n", sum);
CONTINUE;
}
}
Got the idea from http://rosettacode.org/wiki/Prime_decomposition#ALGOL_68 - but it reads better in C
I pull this URL out as a joke from time to time: Coroutines in C.
I think the correct answer to your question is this: there's no direct equivalent, and attempts to fake it probably won't be nearly as clean or easy to use.
No.
Nice and short!

Categories