Performance of LLVM-Compiler on native C code vs Python+Numba

Performance of LLVM-Compiler on native C code vs Python+Numba - python

I recently did some tests on performance optimization in Python. One part was doing a benchmark on Monte-Carlo Pi calculation using SWIG and compile a library to import in Python. The other solution was using Numba. Now I totally wonder why the native C solution is worse than Numba even if LLVM compiler is used for both. So I'm wondering if I'm doing something wrong.
Runtime on my Laptop
native C module: 7.09 s
Python+Numba: 2.75 s
Native C code
#include "swigtest.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
float monte_carlo_pi(long nsamples)
{
int accGlob=0;
int accLoc=0;
int i,ns;
float x,y;
float res;
float iRMX=1.0/(float) RAND_MAX;
srand(time(NULL));
for(i=0;i<nsamples;i++)
{
x = (float)rand()*iRMX;
y = (float)rand()*iRMX;
if((x*x + y*y) < 1.0) { acc += 1;}
}
res = 4.0 * (float) acc / (float) nsamples;
printf("cres = %.5f\n",res);
return res;
}
swigtest.i
%module swigtest
%{
#define SWIG_FILE_WITH_INIT
#include "swigtest.h"
%}
float monte_carlo_pi(long nsamples);
Compiler call
clang.exe swigtest.c swigtest_wrap.c -Ofast -o _swigtest.pyd -I C:\python37\include -shared -L c:\python37\libs -g0 -mtune=intel -msse4.2 -mmmx
testswig.py
from swigtest import monte_carlo_pi
import time
import os
start = time.time()
pi = monte_carlo_pi(250000000)
print("pi: %.5f" % pi)
print("tm:",time.time()-start)
Python version with Numba
from numba import jit
import random
import time
start = time.time()
#jit(nopython=True,cache=True,fastmath=True)
def monte_carlo_pi(nsamples: int)-> float:
acc:int = 0
for i in range(nsamples):
x:float = random.random()
y:float = random.random()
if (x * x + y * y) < 1.0: acc += 1
return 4.0 * acc / nsamples
pi = monte_carlo_pi(250000000)
print("pi:",pi)
print("tm:",time.time()-start)

Summary up to now:
The rand() function seems to consume most of the time. Using a deterministic approach like this
...
ns = (long) sqrt((double)nsamples)+1;
dx = 1./sqrt((double)nsamples);
dy = dx;
...
for(i=0;i<ns;i++)
for(k=0;k<ns;k++)
{
x = i*dx;
y = k*dy;
if((x*x + y*y) < 1.0) { accLoc += 1;}
}
...
instead of rand() results in an execution tim of only 0.04 s! Obviously Numba uses another much more efficient random function.

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

I want to copy a 2D numpy array (matrix) in a C function a get it back in python (and then do some calculation on it in C taking the speed advantage of C) . Therefore I need the C function matrix_copy to return a 2D array (or, I guess, a pointer to it). I tried with the following code but I get the following output (where one can see the second dimension of the array is lost).
matrix_in.shape:
(300, 200)
matrix_out.shape:
(300,)
How could I change the code (I guess the matrix_copy.c adding some pointer magic) so I could obtain an exact copy of the matrix_in in matrix_out?
Here is the main.py script:
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import pdb
n = 300
m = 200
matrix_in = np.random.randn(n, m)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.restype = ndpointer(dtype=c_double,
shape=(n,))
matrix_out = matrix_copy(c_void_p(matrix_in.ctypes.data),
c_int(n),
c_int(m))
print("matrix_in.shape:")
print(matrix_in.shape)
print("matrix_out.shape:")
print(matrix_out.shape)
Here is the matrix_copy.c script:
#include <stdlib.h>
#include <stdio.h>
double * matrix_copy(const double * matrix_in, int n, int m){
double * matrix_out = (double *)malloc(sizeof(double) * (n*m));
int index = 0;
for(int i=0; i< n; i++){
for(int j=0; j<m; j++){
matrix_out[i+j] = matrix_in[i+j];
//matrix_out[i][j] = matrix_in[i][j];
// some heavy computations not yet implemented
}
}
return matrix_out;
}
which I compile with the command
cc -fPIC -shared -o matrix_copy.so matrix_copy.c
And as a side note, why does the notation matrix_out[i][j] = matrix_in[i][j]; throws me an error on compilation?
matrix_copy.c:10:26: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
~~~~~~~~~~~~~^~
matrix_copy.c:10:44: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];

The second dimension is 'lost' because you explicitly omit it in the named shape argument of ndpointer. Change:
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,))
to
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,m), flags='C')
Where flags='C' additionally notes that the returned data is stored contiguously in row major order.
With regards to matrix_out[i][j] = matrix_in[i][j]; throwing an error, consider that matrix_in is of type const double *. matrix_in[i] would yield a value of type const double - how do you further index this value (i.e., with [j])?
If you want to emulate accessing a 2-dimensional array via a single pointer, you must calculate offsets manually. matrix_out[i+j] is not sufficient, as you must account for the span of each sub array:
matrix_out[i * m + j] = matrix_in[i * m + j];
Note that in C, size_t is the generally preferred type to use when dealing with memory sizes or array lengths.
matrix_copy.c, simplified:
#include <stdlib.h>
double *matrix_copy(const double *matrix_in, size_t n, size_t m)
{
double *matrix_out = malloc(sizeof *matrix_out * n * m);
for (size_t i = 0; i < n; i++)
for (size_t j = 0; j < m; j++)
matrix_out[i * m + j] = matrix_in[i * m + j];
return matrix_out;
}
matrix.py, with more explicit typing:
from ctypes import c_void_p, c_double, c_size_t, cdll, POINTER
from numpy.ctypeslib import ndpointer
import numpy as np
c_double_p = POINTER(c_double)
n = 300
m = 200
matrix_in = np.random.randn(n, m).astype(c_double)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.argtypes = c_double_p, c_size_t, c_size_t
matrix_copy.restype = ndpointer(
dtype=c_double,
shape=(n,m),
flags='C')
matrix_out = matrix_copy(
matrix_in.ctypes.data_as(c_double_p),
c_size_t(n),
c_size_t(m))
print("matrix_in.shape:", matrix_in.shape)
print("matrix_out.shape:", matrix_out.shape)
print("in == out", matrix_in == matrix_out)

The incoming data is a probably single block of memory. You need to create the substructure.
In my C++ code I have to do the following on data (block) coming in via swig:
void divide2DDoubleArray(double * &block, double ** &subblockdividers, int noofsubblocks, int subblocksize){
/* The starting address of a block of doubles is used to generate
* pointers to subblocks.
*
* block: memory containing the original block of data
* subblockdividers: array of subblock addresses
* noofsubblocks: specify the number of subblocks produced
* subblocksize: specify the size of the subblocks produced
*
* Design by contract: application should make sure the memory
* in block is allocated and initialized properly.
*/
// Build 2D matrix for cols
subblockdividers=new double *[noofsubblocks];
subblockdividers[0]= block;
for (int i=1; i<noofsubblocks; ++i) {
subblockdividers[i] = &subblockdividers[i-1][subblocksize];
}
}
Now the pointer returned in subblockdividers can be used the way you would like to.
Don't forget to free subblockdividers when your done. (Note: adjustments might be needed to compile this as C code)

How to get rid of for loop in my function? [duplicate]

T(i) = Tm(i) + (T(i-1)-Tm(i))**(-tau(i))
Tm and tau are NumPy vectors of the same length that have been previously calculated, and the desire is to create a new vector T. The i is included only to indicate the element index for what is desired.
Is a for loop necessary for this case?

You might think this would work:
import numpy as np
n = len(Tm)
t = np.empty(n)
t[0] = 0 # or whatever the initial condition is
t[1:] = Tm[1:] + (t[0:n-1] - Tm[1:])**(-tau[1:])
but it doesn't: you can't actually do recursion in numpy this way (since numpy calculates the whole RHS and then assigns it to the LHS).
So unless you can come up with a non-recursive version of this formula, you're stuck with an explicit loop:
tt = np.empty(n)
tt[0] = 0.
for i in range(1,n):
tt[i] = Tm[i] + (tt[i-1] - Tm[i])**(-tau[i])

2019 Update. The Numba code broke with the new version of numba. Changing dtype="float32" to dtype=np.float32 solved it.
I performed some benchmarks and in 2019 using Numba is the first option people should try to accelerate recursive functions in Numpy (adjusted proposal of Aronstef). Numba is already preinstalled in the Anaconda package and has one of the fastest times (about 20 times faster than any Python). In 2019 Python supports #numba annotations without additional steps (at least versions 3.6, 3.7, and 3.8). Here are three benchmarks: performed on 2019-12-05, 2018-10-20 and 2016-05-18.
And, as mentioned by Jaffe, in 2018 it is still not possible to vectorize recursive functions. I checked the vectorization by Aronstef and it does NOT work.
Benchmarks sorted by execution time:
-------------------------------------------
|Variant |2019-12 |2018-10 |2016-05 |
-------------------------------------------
|Pure C | na | na | 2.75 ms|
|C extension | na | na | 6.22 ms|
|Cython float32 | 0.55 ms| 1.01 ms| na |
|Cython float64 | 0.54 ms| 1.05 ms| 6.26 ms|
|Fortran f2py | 4.65 ms| na | 6.78 ms|
|Numba float32 |73.0 ms| 2.81 ms| na |
|(Aronstef) | | | |
|Numba float32v2| 1.82 ms| 2.81 ms| na |
|Numba float64 |78.9 ms| 5.28 ms| na |
|Numba float64v2| 4.49 ms| 5.28 ms| na |
|Append to list |73.3 ms|48.2 ms|91.0 ms|
|Using a.item() |36.9 ms|58.3 ms|74.4 ms|
|np.fromiter() |60.8 ms|60.0 ms|78.1 ms|
|Loop over Numpy|71.3 ms|71.9 ms|87.9 ms|
|(Jaffe) | | | |
|Loop over Numpy|74.6 ms|74.4 ms| na |
|(Aronstef) | | | |
-------------------------------------------
Corresponding code is provided at the end of the answer.
It seems that with time Numba and Cython times get better. Now both of them are faster than Fortran f2py. Cython is faster 8.6 times now and Numba 32bit is faster 2.5 times. Fortran was very hard to debug and compile in 2016. So now there is no reason to use Fortran at all.
I did not check Pure C and C extension in 2019 and 2018, because it is not easy to compile them in Jupyter notebooks.
I had the following setup in 2019:
Processor: Intel i5-9600K 3.70GHz
Versions:
Python: 3.8.0
Numba: 0.46.0
Cython: 0.29.14
Numpy: 1.17.4
I had the following setup in 2018:
Processor: Intel i7-7500U 2.7GHz
Versions:
Python: 3.7.0
Numba: 0.39.0
Cython: 0.28.5
Numpy: 1.15.1
The recommended Numba code using float32 (adjusted Aronstef):
#numba.jit("float32[:](float32[:], float32[:])", nopython=True, nogil=True)
def calc_py_jit32v2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float32)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
All the other code:
Data creation (like Aronstef + Mike T comment):
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float64'))
tau = np.random.uniform(-1, 0, size=n).astype('float64')
ar = np.column_stack([Tm,tau])
Tm32 = Tm.astype('float32')
tau32 = tau.astype('float32')
Tm_l = list(Tm)
tau_l = list(tau)
The code in 2016 was slightly different as I used abs() function to prevent nans and not the variant of Mike T. In 2018 the function is exactly the same as OP (Original Poster) wrote.
Cython float32 using Jupyter %% magic. The function can be used directly in Python. Cython needs a C++ compiler in which Python was compiled. Installation of the right version of Visual C++ compiler (for Windows) could be problematic:
%%cython
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
cdef extern from "math.h":
np.float32_t exp(np.float32_t m)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop32(np.float32_t[:] Tm,np.float32_t[:] tau,int alen):
cdef np.float32_t[:] T=np.empty(alen, dtype=np.float32)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Cython float64 using Jupyter %% magic. The function can be used directly in Python:
%%cython
cdef extern from "math.h":
double exp(double m)
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop(double[:] Tm,double[:] tau,int alen):
cdef double[:] T=np.empty(alen)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Numba float64:
#numba.jit("float64[:](float64[:], float64[:])", nopython=False, nogil=True)
def calc_py_jitv2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float64)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
Append to list. Fastest non-compiled solution:
def rec_py_loop(Tm,tau,alen):
T = [Tm[0]]
for i in range(1,alen):
T.append(Tm[i] - (T[i-1] + Tm[i])**(-tau[i]))
return np.array(T)
Using a.item():
def rec_numpy_loop_item(Tm_,tau_):
n_ = len(Tm_)
tt=np.empty(n_)
Ti=tt.item
Tis=tt.itemset
Tmi=Tm_.item
taui=tau_.item
Tis(0,Tm_[0])
for i in range(1,n_):
Tis(i,Tmi(i) - (Ti(i-1) + Tmi(i))**(-taui(i)))
return tt[1:]
np.fromiter():
def it(Tm,tau):
T=Tm[0]
i=0
while True:
yield T
i+=1
T=Tm[i] - (T + Tm[i])**(-tau[i])
def rec_numpy_iter(Tm,tau,alen):
return np.fromiter(it(Tm,tau), np.float64, alen)[1:]
Loop over Numpy (based on the Jaffe's idea):
def rec_numpy_loop(Tm,tau,alen):
tt=np.empty(alen)
tt[0]=Tm[0]
for i in range(1,alen):
tt[i] = Tm[i] - (tt[i-1] + Tm[i])**(-tau[i])
return tt[1:]
Loop over Numpy (Aronstef's code). On my computer float64 is the default type for np.empty.
def calc_py(Tm_, tau_):
tt = np.empty(len(Tm_),dtype="float64")
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = (Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i]))
return tt[1:]
Pure C without using Python at all. Version from year 2016 (with fabs() function):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <sys\timeb.h>
double randn() {
double u = rand();
if (u > 0.5) {
return sqrt(-1.57079632679*log(1.0 - pow(2.0 * u - 1, 2)));
}
else {
return -sqrt(-1.57079632679*log(1.0 - pow(1 - 2.0 * u,2)));
}
}
void rec_pure_c(double *Tm, double *tau, int alen, double *T)
{
for (int i = 1; i < alen; i++)
{
T[i] = Tm[i] + pow(fabs(T[i - 1] - Tm[i]), (-tau[i]));
}
}
int main() {
int N = 100000;
double *Tm= calloc(N, sizeof *Tm);
double *tau = calloc(N, sizeof *tau);
double *T = calloc(N, sizeof *T);
double time = 0;
double sumtime = 0;
for (int i = 0; i < N; i++)
{
Tm[i] = randn();
tau[i] = randn();
}
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
for (int j = 0; j < 1000; j++)
{
for (int i = 0; i < 3; i++)
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
rec_pure_c(Tm, tau, N, T);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
if (i == 0)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
else {
if (time > (double)ElapsedMicroseconds.QuadPart / 1000)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
}
}
sumtime += time;
}
printf("1000 loops,best of 3: %.3f ms per loop\n",sumtime/1000);
free(Tm);
free(tau);
free(T);
}
Fortran f2py. Function can be used from Python. Version from year 2016 (with abs() function):
subroutine rec_fortran(tm,tau,alen,result)
integer*8, intent(in) :: alen
real*8, dimension(alen), intent(in) :: tm
real*8, dimension(alen), intent(in) :: tau
real*8, dimension(alen) :: res
real*8, dimension(alen), intent(out) :: result
res(1)=0
do i=2,alen
res(i) = tm(i) + (abs(res(i-1) - tm(i)))**(-tau(i))
end do
result=res
end subroutine rec_fortran

Update: 21-10-2018
I have corrected my answer based on comments.
It is possible to vectorize operations on vectors as long as the calculation is not recursive. Because a recursive operation depends on the previous calculated value it is not possible to parallel process the operation.
This does therefore not work:
def calc_vect(Tm_, tau_):
return Tm_[1:] - (Tm_[:-1] + Tm_[1:]) ** (-tau_[1:])
Since (serial processing / a loop) is necessary, the best performance is gained by moving as close as possible to optimized machine code, therefore Numba and Cython are the best answers here.
A Numba approach can be achieves as follows:
init_string = """
from math import pow
import numpy as np
from numba import jit, float32
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float32'))
tau = np.random.uniform(-1, 0, size=n).astype('float32')
def calc_python(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
#jit(float32[:](float32[:], float32[:]), nopython=False, nogil=True)
def calc_numba(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
"""
import timeit
py_time = timeit.timeit('calc_python(Tm, tau)', init_string, number=100)
numba_time = timeit.timeit('calc_numba(Tm, tau)', init_string, number=100)
print("Python Solution: {}".format(py_time))
print("Numba Soltution: {}".format(numba_time))
Timeit comparison of the Python and Numba functions:
Python Solution: 54.58057559299999
Numba Soltution: 1.1389029540000024

This is a good question. I am also interested to know if this is possible but so far I have not found a way to do it except in some simple cases.
Option 1. numpy.ufunc.accumulate
This seems to be a promising option as mentioned by #Karl Knechtel. You need to create a ufunc first. This web page explains how.
In the simple case of a recurrent function that takes two scalars as input and outputs one scaler, it seems to work:
import numpy as np
def test_add(x, data):
return x + data
assert test_add(1, 2) == 3
assert test_add(2, 3) == 5
# Make a Numpy ufunc from my test_add function
test_add_ufunc = np.frompyfunc(test_add, 2, 1)
assert test_add_ufunc(1, 2) == 3
assert test_add_ufunc(2, 3) == 5
assert np.all(test_add_ufunc([1, 2], [2, 3]) == [3, 5])
data_sequence = np.array([1, 2, 3, 4])
f_out = test_add_ufunc.accumulate(data_sequence, dtype=object)
assert np.array_equal(f_out, [1, 3, 6, 10])
[Note the dtype=object argument which is necessary as explained on the web page linked above].
But in your case (and mine) we want to compute a recurrent equation that has more than one data input (and potentially more than one state variable too).
When I tried this using the ufunc.accumulate approach above I got ValueError: accumulate only supported for binary functions.
If anyone knows a way round that constraint I would be very interested.
Option 2. Python's builtin accumulate function
In the mean time, this solution doesn't quite achieve what you wanted in terms of a vectorized calculation in numpy, but it does at least avoid a for loop.
from itertools import accumulate, chain
def t_next(t, data):
Tm, tau = data # Unpack more than one data input
return Tm + (t - Tm)**tau
assert t_next(2, (0.38, 0)) == 1.38
t0 = 2 # Initial t
Tm_values = np.array([0.38, 0.88, 0.56, 0.67, 0.45, 0.98, 0.58, 0.72, 0.92, 0.82])
tau_values = np.linspace(0, 0.9, 10)
# Combine the input data into a 2D array
data_sequence = np.vstack([Tm_values, tau_values]).T
t_out = np.fromiter(accumulate(chain([t0], data_sequence), t_next), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]
# Slightly more readable version possible in Python 3.8+
t_out = np.fromiter(accumulate(data_sequence, t_next, initial=t0), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]

To build on NPE's answer, I agree that there has to be a loop somewhere. Perhaps your goal is to avoid the overhead associated with a Python for loop? In that case, numpy.fromiter does beat out a for loop, but only by a little:
Using the very simple recursion relation,
x[i+1] = x[i] + 0.1
I get
#FOR LOOP
def loopit(n):
x = [0.0]
for i in range(n-1): x.append(x[-1] + 0.1)
return np.array(x)
#FROMITER
#define an iterator (a better way probably exists -- I'm a novice)
def it():
x = 0.0
while True:
yield x
x += 0.1
#use the iterator with np.fromiter
def fi_it(n):
return np.fromiter(it(), np.float, n)
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 31.7 ms per loop
%timeit -n 100 fi_it(100000)
#100 loops, best of 3: 18.6 ms per loop
Interestingly, pre-allocating a numpy array results in a substantial loss in performance. This is a mystery to me, though I would guess that there must be more overhead associated with accessing an array element than with appending to a list.
def loopit(n):
x = np.zeros(n)
for i in range(n-1): x[i+1] = x[i] + 0.1
return x
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 50.1 ms per loop

Same random numbers in C++ as computed by Python3 numpy.random.rand

I would like to duplicate in C++ the testing for some code that has already been implemented in Python3 which relies on numpy.random.rand and randn values and a specific seed (e.g., seed = 1).
I understand that Python's random implementation is based on a Mersenne twister. The C++ standard library also supplies this in std::mersenne_twister_engine.
The C++ version returns an unsigned int, whereas Python rand is a floating point value.
Is there a way to obtain the same values in C++ as are generated in Python, and be sure that they are the same? And the same for an array generated by randn ?

You can do it this way for integer values:
import numpy as np
np.random.seed(12345)
print(np.random.randint(256**4, dtype='<u4', size=1)[0])
#include <iostream>
#include <random>
int main()
{
std::mt19937 e2(12345);
std::cout << e2() << std::endl;
}
The result of both snippets is 3992670690
By looking at source code of rand you can implement it in your C++ code this way:
import numpy as np
np.random.seed(12345)
print(np.random.rand())
#include <iostream>
#include <iomanip>
#include <random>
int main()
{
std::mt19937 e2(12345);
int a = e2() >> 5;
int b = e2() >> 6;
double value = (a * 67108864.0 + b) / 9007199254740992.0;
std::cout << std::fixed << std::setprecision(16) << value << std::endl;
}
Both random values are 0.9296160928171479
It would be convenient to use std::generate_canonical, but it uses another method to convert the output of Mersenne twister to double. The reason they differ is likely that generate_canonical is more optimized than the random generator used in NumPy, as it avoids costly floating point operations, especially multiplication and division, as seen in source code. However it seems to be implementation dependent, while NumPy produces the same result on all platforms.
double value = std::generate_canonical<double, std::numeric_limits<double>::digits>(e2);
This doesn't work and produces result 0.8901547132827379, which differs from the output of Python code.

For completeness and to avoid re-inventing the wheel, here is an implementation for both numpy.rand and numpy.randn in C++
The header file:
#ifndef RANDOMNUMGEN_NUMPYCOMPATIBLE_H
#define RANDOMNUMGEN_NUMPYCOMPATIBLE_H
#include "RandomNumGenerator.h"
//Uniform distribution - numpy.rand
class RandomNumGen_NumpyCompatible {
public:
RandomNumGen_NumpyCompatible();
RandomNumGen_NumpyCompatible(std::uint_fast32_t newSeed);
std::uint_fast32_t min() const { return m_mersenneEngine.min(); }
std::uint_fast32_t max() const { return m_mersenneEngine.max(); }
void seed(std::uint_fast32_t seed);
void discard(unsigned long long); // NOTE!! Advances and discards twice as many values as passed in to keep tracking with Numpy order
uint_fast32_t operator()(); //Simply returns the next Mersenne value from the engine
double getDouble(); //Calculates the next uniformly random double as numpy.rand does
std::string getGeneratorType() const { return "RandomNumGen_NumpyCompatible"; }
private:
std::mt19937 m_mersenneEngine;
};
///////////////////
//Gaussian distribution - numpy.randn
class GaussianRandomNumGen_NumpyCompatible {
public:
GaussianRandomNumGen_NumpyCompatible();
GaussianRandomNumGen_NumpyCompatible(std::uint_fast32_t newSeed);
std::uint_fast32_t min() const { return m_mersenneEngine.min(); }
std::uint_fast32_t max() const { return m_mersenneEngine.max(); }
void seed(std::uint_fast32_t seed);
void discard(unsigned long long); // NOTE!! Advances and discards twice as many values as passed in to keep tracking with Numpy order
uint_fast32_t operator()(); //Simply returns the next Mersenne value from the engine
double getDouble(); //Calculates the next normally (Gaussian) distrubuted random double as numpy.randn does
std::string getGeneratorType() const { return "GaussianRandomNumGen_NumpyCompatible"; }
private:
bool m_haveNextVal;
double m_nextVal;
std::mt19937 m_mersenneEngine;
};
#endif
And the implementation:
#include "RandomNumGen_NumpyCompatible.h"
RandomNumGen_NumpyCompatible::RandomNumGen_NumpyCompatible()
{
}
RandomNumGen_NumpyCompatible::RandomNumGen_NumpyCompatible(std::uint_fast32_t seed)
: m_mersenneEngine(seed)
{
}
void RandomNumGen_NumpyCompatible::seed(std::uint_fast32_t newSeed)
{
m_mersenneEngine.seed(newSeed);
}
void RandomNumGen_NumpyCompatible::discard(unsigned long long z)
{
//Advances and discards TWICE as many values to keep with Numpy order
m_mersenneEngine.discard(2*z);
}
std::uint_fast32_t RandomNumGen_NumpyCompatible::operator()()
{
return m_mersenneEngine();
}
double RandomNumGen_NumpyCompatible::getDouble()
{
int a = m_mersenneEngine() >> 5;
int b = m_mersenneEngine() >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
///////////////////
GaussianRandomNumGen_NumpyCompatible::GaussianRandomNumGen_NumpyCompatible()
: m_haveNextVal(false)
{
}
GaussianRandomNumGen_NumpyCompatible::GaussianRandomNumGen_NumpyCompatible(std::uint_fast32_t seed)
: m_haveNextVal(false), m_mersenneEngine(seed)
{
}
void GaussianRandomNumGen_NumpyCompatible::seed(std::uint_fast32_t newSeed)
{
m_mersenneEngine.seed(newSeed);
}
void GaussianRandomNumGen_NumpyCompatible::discard(unsigned long long z)
{
//Burn some CPU cyles here
for (unsigned i = 0; i < z; ++i)
getDouble();
}
std::uint_fast32_t GaussianRandomNumGen_NumpyCompatible::operator()()
{
return m_mersenneEngine();
}
double GaussianRandomNumGen_NumpyCompatible::getDouble()
{
if (m_haveNextVal) {
m_haveNextVal = false;
return m_nextVal;
}
double f, x1, x2, r2;
do {
int a1 = m_mersenneEngine() >> 5;
int b1 = m_mersenneEngine() >> 6;
int a2 = m_mersenneEngine() >> 5;
int b2 = m_mersenneEngine() >> 6;
x1 = 2.0 * ((a1 * 67108864.0 + b1) / 9007199254740992.0) - 1.0;
x2 = 2.0 * ((a2 * 67108864.0 + b2) / 9007199254740992.0) - 1.0;
r2 = x1 * x1 + x2 * x2;
} while (r2 >= 1.0 || r2 == 0.0);
/* Box-Muller transform */
f = sqrt(-2.0 * log(r2) / r2);
m_haveNextVal = true;
m_nextVal = f * x1;
return f * x2;
}

After doing a bit of testing, it does seem that the values are within a tolerance (see #fdermishin 's comment below) when the C++ unsigned int is divided by the maximum value for an unsigned int like this:
#include <limits>
...
std::mt19937 generator1(seed); // mt19937 is a standard mersenne_twister_engine
unsigned val1 = generator1();
std::cout << "Gen 1 random value: " << val1 << std::endl;
std::cout << "Normalized Gen 1: " << static_cast<double>(val1) / std::numeric_limits<std::uint32_t>::max() << std::endl;
However, Python's version seems to skip every other value.
Given the following two programs:
#!/usr/bin/env python3
import numpy as np
def main():
np.random.seed(1)
for i in range(0, 10):
print(np.random.rand())
###########
# Call main and exit success
if __name__ == "__main__":
main()
sys.exit()
and
#include <cstdlib>
#include <iostream>
#include <random>
#include <limits>
int main()
{
unsigned seed = 1;
std::mt19937 generator1(seed); // mt19937 is a standard mersenne_twister_engine
for (unsigned i = 0; i < 10; ++i) {
unsigned val1 = generator1();
std::cout << "Normalized, #" << i << ": " << (static_cast<double>(val1) / std::numeric_limits<std::uint32_t>::max()) << std::endl;
}
return EXIT_SUCCESS;
}
the Python program prints:
0.417022004702574
0.7203244934421581
0.00011437481734488664
0.30233257263183977
0.14675589081711304
0.0923385947687978
0.1862602113776709
0.34556072704304774
0.39676747423066994
0.538816734003357
whereas the C++ program prints:
Normalized, #0: 0.417022
Normalized, #1: 0.997185
Normalized, #2: 0.720324
Normalized, #3: 0.932557
Normalized, #4: 0.000114381
Normalized, #5: 0.128124
Normalized, #6: 0.302333
Normalized, #7: 0.999041
Normalized, #8: 0.146756
Normalized, #9: 0.236089
I could easily skip every other value in the C++ version, which should give me numbers that match the Python version (within a tolerance). But why would Python's implementation seem to skip every other value, or where do these extra values in the C++ version come from?

Debugging C++ extension for Python

I have a problem when I try to debug my C++ extension for Python.
The error is
Fatal Python error: PyThreadState_Get: no current thread
I followed this guide and it works when I run in the release version.
Python code:
from itertools import islice
from random import random
from time import perf_counter
COUNT = 500000 # Change this value depending on the speed of your computer
DATA = list(islice(iter(lambda: (random() - 0.5) * 3.0, None), COUNT))
e = 2.7182818284590452353602874713527
def sinh(x):
return (1 - (e ** (-2 * x))) / (2 * (e ** -x))
def cosh(x):
return (1 + (e ** (-2 * x))) / (2 * (e ** -x))
def tanh(x):
tanh_x = sinh(x) / cosh(x)
return tanh_x
def sequence_tanh(data):
'''Applies the hyperbolic tangent function to map all values in
the sequence to a value between -1.0 and 1.0.
'''
result = []
for x in data:
result.append(tanh(x))
return result
def test(fn, name):
start = perf_counter()
result = fn(DATA)
duration = perf_counter() - start
print('{} took {:.3f} seconds\n\n'.format(name, duration))
for d in result:
assert -1 <= d <=1, " incorrect values"
from superfastcode import fast_tanh
if __name__ == "__main__":
test(lambda d: [fast_tanh(x) for x in d], '[fast_tanh(x) for x in d]')
C++ code:
#include <Python.h>
#include <cmath>
const double e = 2.7182818284590452353602874713527;
double sinh_impl(double x) {
return (1 - pow(e, (-2 * x))) / (2 * pow(e, -x));
}
double cosh_impl(double x) {
return (1 + pow(e, (-2 * x))) / (2 * pow(e, -x));
}
PyObject* tanh_impl(PyObject *, PyObject* o) {
double x = PyFloat_AsDouble(o);
double tanh_x = sinh_impl(x) / cosh_impl(x);
return PyFloat_FromDouble(tanh_x);
}
static PyMethodDef superfastcode_methods[] = {
// The first property is the name exposed to Python, fast_tanh, the second is the C++
// function name that contains the implementation.
{ "fast_tanh", (PyCFunction)tanh_impl, METH_O, nullptr },
// Terminate the array with an object containing nulls.
{ nullptr, nullptr, 0, nullptr }
};
static PyModuleDef superfastcode_module = {
PyModuleDef_HEAD_INIT,
"superfastcode", // Module name to use with Python import statements
"Provides some functions, but faster", // Module description
0,
superfastcode_methods // Structure that defines the methods of the module
};
PyMODINIT_FUNC PyInit_superfastcode() {
return PyModule_Create(&superfastcode_module);
}
I am using the 64 bit version of Python 3.6, and are building the C++ code in x64 mode. Visual Studio 2017 15.6.4
I am linking with C:\Python\Python36.x64\libs\python36_d.lib and including header files from C:\Python\Python36.x64\include
My Python interpreter is in C:\Python\Python36.x64\
I get this result when I run the release build
[fast_tanh(x) for x in d] took 0.067 seconds
Update: I got it running in Py x86 but not x64.
When I hit the break point and step over (F10) it throws an exception.

I got this solution from Steve Dower # Microsoft:
This looks more like a mismatch between debug binaries and release headers.
The guide you've referenced is designed to always use the release binaries of Python, even if you are building a debug extension. So either you should be linking against python36.lib/python36.dll, or ignoring most of the setting changes listed in the guide and linking against python36_d.lib/python36_d.dll (the linking should be automatic once you set the paths - the choice of C runtime library will determine whether debug/release Python is used).
Reference: PTVS issues

Memory Leak in Ctypes method

I have a project mostly written in Python. This project runs on my Raspberry Pi (Model B). With the use of the Pi Camera I record to a stream. Every second I pauze the recording to take the last frame from the stream and compare it with a older frame. The comparing is done in C code (mainly because it is faster than Python).
The C code is called from Python using Ctypes. See the code below.
# Load picturecomparer.so and set argument and return types
cmethod = ctypes.CDLL(Paths.CMODULE_LOCATION)
cmethod.compare_pictures.restype = ctypes.c_double
cmethod.compare_pictures.argtypes = [ctypes.c_char_p, types.c_char_p]
The 2 images that must be compared are stored on the disk. Python gives the paths of both images as arguments to the C code. The C code will return a value (double) which is the difference in percentage of both images.
# Call the C method to compare the images
difflevel = cmethod.compare_pictures(path1, path2)
The C code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#ifndef STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#ifndef STBI_ASSERT
#define STBI_ASSERT(x)
#endif
#endif
#define COLOR_R 0
#define COLOR_G 1
#define COLOR_B 2
#define OFFSET 10
double compare_pictures(const char* path1, const char* path2);
double compare_pictures(const char* path1, const char* path2)
{
double totalDiff = 0.0, value;
unsigned int x, y;
int width1, height1, comps1;
unsigned char * image1 = stbi_load(path1, &width1, &height1, &comps1, 0);
int width2, height2, comps2;
unsigned char * image2 = stbi_load(path2, &width2, &height2, &comps2, 0);
// Perform some checks to be sure images are valid
if (image1 == NULL | image2 == NULL) { return 0; }
if (width1 != width2 | height1 != height2) { return 0; }
for (y = 0; y < height1; y++)
{
for (x = 0; x < width1; x++)
{
// Calculate difference in RED
value = (int)image1[(x + y*width1) * comps1 + COLOR_R] - (int)image2[(x + y*width2) * comps2 + COLOR_R];
if (value < OFFSET && value > (OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in GREEN
value = (int)image1[(x + y*width1) * comps1 + COLOR_G] - (int)image2[(x + y*width2) * comps2 + COLOR_G];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in BLUE
value = (int)image1[(x + y*width1) * comps1 + COLOR_B] - (int)image2[(x + y*width2) * comps2 + COLOR_B];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
}
}
totalDiff = 100.0 * totalDiff / (double)(width1 * height1 * 3);
return totalDiff;
}
The C code will be executed every ~2 seconds. I just noticed that there is a memory leak. After around 10 to 15 minutes my Raspberry Pi haves like 10MB ram left to use. A few minutes later it crashes and doesn't respond anymore.
I have done some checks to find out what causes this in my project. My entire project uses around 30-40MB ram if I disable the C code. This project is all my Raspberry Pi have to execute.
Model B: 512MB ram which shares between CPU and GPU.
GPU: 128MB (/boot/config.txt).
My Linux distro uses: ~60MB.
So I have ~300MB for my project.
Hope someone could point me where it goes wrong, or if I have to call GC myself, etc..
Thanks in advance.
p.s. I know the image comparing is not the best way, but it works for me now.

Since the images are being returned as pointers to buffers stbi_load must be allocating space for them and you are not releasing this space before returning so the memory leak is not surprising.
Check for the documentation to see if there is a specific stpi_free function or try adding free(image1); free(image2); before the final return.
Having checked I can categorically say that you should be calling STBI_FREE(image1); STBI_FREE(image2); before returning.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance of LLVM-Compiler on native C code vs Python+Numba - python

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

How to get rid of for loop in my function? [duplicate]

Same random numbers in C++ as computed by Python3 numpy.random.rand

Debugging C++ extension for Python

Memory Leak in Ctypes method

Categories

Resources