The "yield" statement in python allows simple iteration from a procedure, and it also means that sequences don't need to be pre-calculated AND stored in a array of "arbitrary" size.
Is there a there a similar way of iterating (with yield) from a C procedure?
Here follows a community-wiki copy of the self-answer, which can be chosen as "the" answer. Please direct up/downvotes to the actual self-answer
Here is the method I found:
/* Example calculates the sum of the prime factors of the first 32 Fibonacci numbers */
#include <stdio.h>
typedef enum{false=0, true=1}bool;
/* the following line is the only time I have ever required "auto" */
#define FOR(i,iterator) auto bool lambda(i); yield_init = (void *)λ iterator; bool lambda(i)
#define DO {
#define YIELD(x) if(!yield(x))return
#define BREAK return false
#define CONTINUE return true
#define OD CONTINUE; }
/* Warning: _Most_ FOR(,){ } loops _must_ have a CONTINUE as the last statement.
* * Otherwise the lambda will return random value from stack, and may terminate early */
typedef void iterator; /* hint at procedure purpose */
static volatile void *yield_init;
#define YIELDS(type) bool (*yield)(type) = yield_init
iterator fibonacci(int n){
YIELDS(int);
int i;
int pair[2] = {0,1};
YIELD(0); YIELD(1);
for(i=2; i<n; i++){
pair[i%2] = pair[0] + pair[1];
YIELD(pair[i%2]);
}
}
iterator factors(int n){
YIELDS(int);
int i;
for(i=2; i*i<=n; i++){
while(n%i == 0 ){
YIELD(i);
n/=i;
}
}
YIELD(n);
}
main(){
FOR(int i, fibonacci(32)){
printf("%d:", i);
int sum = 0;
FOR(int factor, factors(i)){
sum += factor;
printf(" %d",factor);
CONTINUE;
}
printf(" - sum of factors: %d\n", sum);
CONTINUE;
}
}
Got the idea from http://rosettacode.org/wiki/Prime_decomposition#ALGOL_68 - but it reads better in C
I pull this URL out as a joke from time to time: Coroutines in C.
I think the correct answer to your question is this: there's no direct equivalent, and attempts to fake it probably won't be nearly as clean or easy to use.
No.
Nice and short!
Related
I'm new to c++ and looking for a faster way to append pixel value to python list, since currrently on loop it takes around 0.1 second to process one frame of image with resolution of 854x480, do anyone have any idea?
I tried to avoid using third party module if possible.
Here is what I've got so far:
PyObject* byte_list = PyList_New(static_cast<Py_ssize_t>(0));
AVFrame *pFrameRGB = av_frame_alloc();
av_frame_copy_props(pFrameRGB, this->pFrame);
pFrameRGB->width = this->pFrame->width;
pFrameRGB->height = this->pFrame->height;
pFrameRGB->format = AV_PIX_FMT_RGB24;
av_frame_get_buffer(pFrameRGB, 0);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0,
this->pCodecContext->height, pFrameRGB->data, pFrameRGB->linesize);
if (this->_debug) {
std::cout << "Frame linesize " << pFrameRGB->linesize[0] << "\n";
std::cout << "Frame width " << pFrameRGB->width << "\n";
std::cout << "Frame height " << pFrameRGB->height << "\n";
}
// This looping method seems slow
for(int y = 0; y < pFrameRGB->height; ++y) {
for(int x = 0; x < pFrameRGB->width; ++x) {
int p = x * 3 + y * pFrameRGB->linesize[0];
int r = pFrameRGB->data[0][p];
int g = pFrameRGB->data[0][p+1];
int b = pFrameRGB->data[0][p+2];
PyList_Append(byte_list, PyLong_FromLong(r));
PyList_Append(byte_list, PyLong_FromLong(g));
PyList_Append(byte_list, PyLong_FromLong(b));
}
}
av_frame_free(&pFrameRGB);
Thanks!
After looking around, I've decided to use Python Built-in Array Library that can use memcpy instead of PyList which require to input the data one by one.
From my test, this improve the speed from 2-10 times, depending on the data.
PyObject *vec_to_array(std::vector<uint8_t>& vec) {
static PyObject *single_array;
if (!single_array) {
PyObject *array_module = PyImport_ImportModule("array");
if (!array_module)
return NULL;
PyObject *array_type = PyObject_GetAttrString(array_module, "array");
Py_DECREF(array_module);
if (!array_type)
return NULL;
single_array = PyObject_CallFunction(array_type, "s[B]", "B", 0);
Py_DECREF(array_type);
if (!single_array)
return NULL;
}
// extra-fast way to create an empty array of count elements:
// array = single_element_array * count
PyObject *pysize = PyLong_FromSsize_t(vec.size());
if (!pysize)
return NULL;
PyObject *array = PyNumber_Multiply(single_array, pysize);
Py_DECREF(pysize);
if (!array)
return NULL;
// now, obtain the address of the array's buffer
PyObject *buffer_info = PyObject_CallMethod(array, "buffer_info", "");
if (!buffer_info) {
Py_DECREF(array);
return NULL;
}
PyObject *pyaddr = PyTuple_GetItem(buffer_info, 0);
void *addr = PyLong_AsVoidPtr(pyaddr);
// and, finally, copy the data.
if (vec.size())
memcpy(addr, &vec[0], vec.size() * sizeof(uint8_t));
return array;
}
after that I passed the vector into that function
std::vector<uint8_t> rgb_arr;
// Copy data from AV Frame
uint8_t* rgb_data[4]; int rgb_linesize[4];
av_image_alloc(rgb_data, rgb_linesize, this->pFrame->width, this->pFrame->height, AV_PIX_FMT_RGB24, 32);
sws_scale(this->swsCtx, this->pFrame->data, this->pFrame->linesize, 0, this->pFrame->height, rgb_data, rgb_linesize);
// Put the data into vector
int rgb_size = pFrame->height * rgb_linesize[0];
std::vector<uint8_t> rgb_vector(rgb_size);
memcpy(rgb_vector.data(), rgb_data[0], rgb_size);
// Transfer the data from vector to rgb_arr
for(int y = 0; y < pFrame->height; ++y) {
rgb_arr.insert(
rgb_arr.end(),
rgb_vector.begin() + y * rgb_linesize[0],
rgb_vector.begin() + y * rgb_linesize[0] + 3 * pFrame->width
);
}
PyObject* arr = vec_to_array(rgb_arr);
This then later can be accessed by python.
Use a container with a faster insertion time, such as std::vector or std::deque instead of std::list. These containers have a constant-time insertion time, whereas std::list has a linear-time insertion time.
Use a bulk insertion method, such as std::vector::insert() or std::deque::insert(), to insert multiple values at once instead of inserting them one at a time. This can reduce the overhead of inserting individual elements.
Use a memory-efficient data structure, such as std::bitset, to store the pixel values if each pixel only has a few possible values (e.g. 0 or 1). This can reduce the memory usage and improve the performance of inserting and accessing the values.
Use C++11's emplace_back() method, which avoids the overhead of constructing and copying elements by constructing the element in place in the container.
Preallocate the memory for the container to avoid the overhead of frequent memory reallocations as the container grows. You can use the reserve() method of std::vector or std::deque to preallocate the memory.
Consider using a faster algorithm or data structure for the image processing task itself. For example, you may be able to use optimized image processing libraries or parallelize the image processing using multi-threading or SIMD instructions.
I'm trying to generate a python wrapper for my C code with swig4.
As I do not have any experience I'm having some issues converting a numpy-array to a c array and returning a numpy-array from a c array.
Here is my repository with all my code:
https://github.com/FelixWeichselgartner/SimpleDFT
void dft(int *x, float complex *X, int N) {
X = malloc(N * sizeof(float complex));
float complex wExponent = -2 * I * pi / N;
float complex temp;
for (int l = 0; l < N; l++) {
temp = 0 + 0 * I;
for (int k = 0; k < N; k++) {
temp = x[k] * cexp(wExponent * k * l) + temp;
}
X[l] = temp/N;
}
}
This is what my C function looks like.
int *x is a pointer to a integer array with the length N.
float complex *X will be allocated by the function dft with the length of N.
/* File: dft.i */
%module dft
%{
#define SWIG_FILE_WITH_INIT
#include "dft.h"
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply (int * IN_ARRAY1, int DIM1) {(int *x, int N)}
%apply (float complex * ARGOUT_ARRAY1, int DIM1) {(float complex *X, int N)}
%include "dft.h"
This is what my swig file looks like.
I guess the %apply part in the end is false, but I have no idea how to match this.
When I build my project by executing my build.sh file the created wrapper looks like this:
def dft(x, X, N):
return _dft.dft(x, X, N)
Here I can see that swig did not recognize my %apply as I want it to look like this
def dft(x):
# N is len(x)
# do some stuff
return X # X is a numpy.array() with len(X) == len(x)
I want to use it like this:
X = dft(x)
Obviously I get this error by executing:
File "example.py", line 12, in <module>
X = dft(x)
TypeError: dft() missing 2 required positional arguments: 'X' and 'N'
Do I have to write my own typemap or is it possible for me to use numpy.i.
And how would a typemap with an IN_ARRAY1 and ARGOUT_ARRAY1 in one function work.
Thanks for your help.
I was working on something similar and saw that this post had no answer. Then I realized that #schnauzbartS had already solved it and his codes are on his github (linked in the question), which is a good example.
I think a simple example people might find useful if the question is about how to typemap with IN_ARRAY1 and ARGOUT_ARRAY1 is this page:
%apply (double* IN_ARRAY1 , int DIM1) {(double* vec , int m)};
%apply (double* ARGOUT_ARRAY1, int DIM1) {(double* vec2, int n)};
Caveat: I struggled with my code a bit until I realized that IN_ARRAY from python has to be in the format of float32 not 64 (in my case it was float array). Not sure why. Another thing I found was that ARGOUT_ARRAY always needs the integer input for the size of array. So from my python end, I ended up with calling the function
myswigfunc(outarraysize,inarray,inarray2)
where outarraysize is for ARGOUT_ARRAY and the other inputs are IN_ARRAY_1 and _2 for example.
Here is a comparison I made. np.argsort was timed on a float32 ndarray consists of 1,000,000 elements.
In [1]: import numpy as np
In [2]: a = np.random.randn(1000000)
In [3]: a = a.astype(np.float32)
In [4]: %timeit np.argsort(a)
86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And here is a C++ program do the same procedure but on vectors referring to this answer.
#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
std::vector<float> numbers;
for (int i = 0; i != 1000000; ++i) {
numbers.push_back((float)rand() / (RAND_MAX));
}
double e1 = (double)cv::getTickCount();
std::vector<size_t> idx(numbers.size());
std::iota(idx.begin(), idx.end(), 0);
std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
{ return numbers[a] < numbers[b];});
double e2 = (double)cv::getTickCount();
std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
return 0;
}
It prints Finished in 525.908 milliseconds. and it is far slower than the numpy version. So could anyone explain what makes np.argsort so fast? Thanks.
Edit1: np.__version__ returns 1.15.0 which runs on Python 3.6.6 |Anaconda custom (64-bit) and g++ --version prints 8.2.0. Operating system is Manjaro Linux.
Edit2: I treid to compile with -O2 and -O3 flags in g++ and I got result within 216.515 miliseconds and 205.017 miliseconds. That is an improve but still slower than numpy version. (Referring to this question) This was deleted because I mistakenly run the test with my laptop's DC adapter unplugged, which would cause it slow down. In a fair competition, C-array and vector version perform equally (take about 100ms).
Edit3: Another approach would be to replace vector with C like array: float numbers[1000000];. After that the running time is about 100ms(+/-5ms). Full code here:
#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
//std::vector<float> numbers;
float numbers[1000000];
for (int i = 0; i != 1000000; ++i) {
numbers[i] = ((float)rand() / (RAND_MAX));
}
double e1 = (double)cv::getTickCount();
std::vector<size_t> idx(1000000);
std::iota(idx.begin(), idx.end(), 0);
std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
{ return numbers[a] < numbers[b];});
double e2 = (double)cv::getTickCount();
std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
return 0;
}
I took your implementation and measured it with 10000000 items. It took approximated 1.7 seconds.
Now I introduced a class
class valuePair {
public:
valuePair(int idx, float value) : idx(idx), value(value){};
int idx;
float value;
};
with is initialized as
std::vector<valuePair> pairs;
for (int i = 0; i != 10000000; ++i) {
pairs.push_back(valuePair(i, (double)rand() / (RAND_MAX)));
}
and the sorting than is done with
std::sort(pairs.begin(), pairs.end(), [&](const valuePair &a, const valuePair &b) { return a.value < b.value; });
This code reduces the runtime down to 1.1 seconds. This is I think due to a better cache coherency, but still quite far away from the python results.
Ideas:
different underlying algorithm:. np.argsort uses quicksort as default, the implementation in C++ might depend on your compiler.
function call overhead: I'm not sure if C++ compilers inline your comparison function. If not, calling this function also might introduce some overhead. not the case according to this post
compiler flags ?
EDIT 3
I have some C++ code (externed as C) which I access from python.
I want to allocate a double** in python, pass it to the C/C++ code to copy the content of a class internal data, and then use it in python similarly to how I would use a list of lists.
Unfortunately I can not manage to specify to python the size of the most inner array, so it reads invalid memory when iterating over it and the program segfaults.
I can not change the structure of the internal data in C++, and I'd like to have python do the bound checking for me (like if I was using a c_double_Array_N_Array_M instead of an array of pointers).
test.cpp (compile with g++ -Wall -fPIC --shared -o test.so test.cpp )
#include <stdlib.h>
#include <string.h>
class Dummy
{
double** ptr;
int e;
int i;
};
extern "C" {
void * get_dummy(int N, int M) {
Dummy * d = new Dummy();
d->ptr = new double*[N];
d->e = N;
d->i = M;
for(int i=0; i<N; ++i)
{
d->ptr[i]=new double[M];
for(int j=0; j <M; ++j)
{
d->ptr[i][j] = i*N + j;
}
}
return d;
}
void copy(void * inst, double ** dest) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
memcpy(dest[i], d->ptr[i], sizeof(double) * d->i);
}
}
void cleanup(void * inst) {
if (inst != NULL) {
Dummy * d = static_cast<Dummy*>(inst);
for(int i=0; i < d->e; ++i)
{
delete[] d->ptr[i];
}
delete[] d->ptr;
delete d;
}
}
}
Python (this segfaults. Put it in the same dir in which the test.so is)
import os
from contextlib import contextmanager
import ctypes as ct
DOUBLE_P = ct.POINTER(ct.c_double)
library_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'test.so')
lib = ct.cdll.LoadLibrary(library_path)
lib.get_dummy.restype = ct.c_void_p
N=15
M=10
#contextmanager
def work_with_dummy(N, M):
dummy = None
try:
dummy = lib.get_dummy(N, M)
yield dummy
finally:
lib.cleanup(dummy)
with work_with_dummy(N,M) as dummy:
internal = (ct.c_double * M)
# Dest is allocated in python, it will live out of the with context and will be deallocated by python
dest = (DOUBLE_P * N)()
for i in range(N):
dest[i] = internal()
lib.copy(dummy, dest)
#dummy is not available anymore here. All the C resources has been cleaned up
for i in dest:
for n in i:
print(n) #it segfaults reading more than the length of the array
What can I change in my python code so that I can treat the array as a list?
(I need only to read from it)
3 ways to pass a int** array from Python to C and back
So that Python knows the size of the array when iterating
The data
This solutions work for either 2d array or array of pointers to arrays with slight modifications, without the use of libraries like numpy.
I will use int as a type instead of double and we will copy source, which is defined as
N = 10;
M = 15;
int ** source = (int **) malloc(sizeof(int*) * N);
for(int i=0; i<N; ++i)
{
source[i] = (int *) malloc(sizeof(int) * M);
for(int j=0; j<M; ++j)
{
source[i][j] = i*N + j;
}
}
1) Assigning the array pointers
Python allocation
dest = ((ctypes.c_int * M) * N) ()
int_P = ctypes.POINTER(ctypes.c_int)
temp = (int_P * N) ()
for i in range(N):
temp[i] = dest[i]
lib.copy(temp)
del temp
# temp gets collected by GC, but the data was stored into the memory allocated by dest
# You can now access dest as if it was a list of lists
for row in dest:
for item in row:
print(item)
C copy function
void copy(int** dest)
{
for(int i=0; i<N; ++i)
{
memcpy(dest[i], source[i], sizeof(int) * M);
}
}
Explanation
We first allocate a 2D array. A 2D array[N][M] is allocated as a 1D array[N*M], with 2d_array[n][m] == 1d_array[n*M + m].
Since our code is expecting a int**, but our 2D array in allocated as a int *, we create a temporary array to provide the expected structure.
We allocate temp[N][M], and than we assign the address of the memory we allocated previously temp[n] = 2d_array[n] = &1d_array[n*M] (the second equal is there to show what is happening with the real memory we allocated).
If you change the copying code so that it copies more than M, let's say M+1, you will see that it will not segfault, but it will override the memory of the next row because they are contiguous (if you change the copying code, remember to add increase by 1 the size of dest allocated in python, otherwise it will segfault when you write after the last item of the last row)
2) Slicing the pointers
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
for row in dest:
# Python knows the length of dest, so everything works fine here
for item in row:
# Python doesn't know that row is an array, so it will continue to read memory without ever stopping (actually, a segfault will stop it)
print(item)
dest = [internal[:M] for internal in dest]
for row in dest:
for item in row:
# No more segfaulting, as now python know that internal is M item long
print(item)
C copy function
Same as for solution 1
Explanation
This time we are allocating an actual array of pointers of array, like source was allocated.
Since the outermost array ( dest ) is an array of pointers, python doesn't know the length of the array pointed to (it doesn't even know that is an array, it could be a pointer to a single int as well).
If you iterate over that pointer, python will not bound check and it will start reading all your memory, resulting in a segfault.
So, we slice the pointer taking the first M elements (which actually are all the elements in the array). Now python knows that it should only iterate over the first M elements, and it won't segfault any more.
I believe that python copies the content pointed to a new list using this method ( see sources )
2.1) Slicing the pointers, continued
Eryksun jumped in in the comments and proposed a solution which avoids the copying of all the elements in new lists.
Python allocation
int_P = ctypes.POINTER(ctypes.c_int)
inner_array = (ctypes.c_int * M)
inner_array_P = ctypes.POINTER(inner_array)
dest = (int_P * N) ()
for i in range(N):
dest[i] = inner_array()
lib.copy(dest)
dest_arrays = [inner_array_p.from_buffer(x)[0] for x in dest]
for row in dest_arrays:
for item in row:
print(item)
C copying code
Same as for solution 1
3) Contiguous memory
This method is an option only if you can change the copying code on the C side. source will not need to be changed.
Python allocation
dest = ((ctypes.c_int * M) * N) ()
lib.copy(dest)
for row in dest:
for item in row:
print(item)
C copy function
void copy(int * dest) {
for(int i=0; i < N; ++i)
{
memcpy(&dest[i * M], source[i], sizeof(int) * M);
}
}
Explanation
This time, like in case 1) we are allocating a contiguous 2D array. But since we can change the C code, we don't need to create a different array and copy the pointers since we will be giving the expected type to C.
In the copy function, we pass the address of the first item of every row, and we copy M elements in that row, then we go to the next row.
The copy pattern is exactly as in case 1), but this time instead of writing the interface in python so that the C code receives the data how it expects it, we changed the C code to expect the data in that precise format.
If you keep this C code, you'll be able to use numpy arrays as well, as they are 2D row major arrays.
All of this answer is possible thanks the great (and concise) comments of #eryksun below the original question.
The following code is an algorithm to determine the amount of integer triangles, with their biggest side being smaller or equal to MAX, that have an integer median. The Python version works but is too slow for bigger N, while the C++ version is a lot faster but doesn't give the right result.
When MAX is 10, C++ and Python both return 3.
When MAX is 100, Python returns 835 and C++ returns 836.
When MAX is 200, Python returns 4088 and C++ returns 4102.
When MAX is 500, Python returns 32251 and C++ returns 32296.
When MAX is 1000, Python returns 149869 and C++ returns 150002.
Here's the C++ version:
#include <cstdio>
#include <math.h>
const int MAX = 1000;
int main()
{
long long int x = 0;
for (int b = MAX; b > 4; b--)
{
printf("%lld\n", b);
for (int a = b; a > 4; a -= 2){
for (int c = floor(b/2); c < floor(MAX/2); c+=1)
{
if (a+b > 2*c){
int d = 2*(pow(a,2)+pow(b,2)-2*pow(c,2));
if (sqrt(d)/2==floor(sqrt(d)/2))
x+=1;
}
}
}
}
printf("Done: ");
printf("%lld\n", x);
}
Here's the original Python version:
import math
def sumofSquares(n):
f = 0
for b in range(n,4,-1):
print(b)
for a in range(b,4,-2):
for C in range(math.ceil(b/2),n//2+1):
if a+b>2*C:
D = 2*(a**2+b**2-2*C**2)
if (math.sqrt(D)/2).is_integer():
f += 1
return f
a = int(input())
print(sumofSquares(a))
print('Done')
I'm not too familiar with C++ so I have no idea what could be happening that's causing this (maybe an overflow error?).
Of course, any optimizations for the algorithm are more than welcome!
The issue is that the range for your c (C in python) variables do not match. To make them equivalent to your existing C++ range, you can change your python loop to:
for C in range(int(math.floor(b/2)), int(math.floor(n/2))):
...
To make them equivalent to your existing python range, you can change your C++ loop to:
for (int c = ceil(b/2.0); c < MAX/2 + 1; c++) {
...
}
Depending on which loop is originally correct, this will make the results match.
It seams some troubles could be here:
(sqrt(d)==floor(sqrt(d)))