I am using several techniques (NumPy, Weave, Cython, Numba) to perform a Python performance benchmark. The code takes two numpy arrays of size NxN and multiplies them element-wise and stores the values in another array C.
My weave.inline() code gives me a scipy.weave.build_tools.CompileError. I have created a minimalist piece of code which generates the same error. Could someone please help?
import time
import numpy as np
from scipy import weave
from scipy.weave import converters
def benchmark():
N = np.array(5000, dtype=np.int)
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
weave_inline_loop(A, B, C, N)
print time.clock() - t
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
return_val = C;
"""
C = weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
benchmark()
Three small changes are needed:
N can't be a 0D-numpy array (it has to be an integer so that i < N works in the C code). You should write N = 5000 instead of N = np.array(5000, dtype=np.int).
The C array is being modified in-place so it doesn't have to be returned. I don't know the restrictions on the kind of objects than return_val can handle, but if you try to keep return_val = C; it fails compiling: don't know how to convert ‘blitz::Array<double, 2>’ to ‘const py::object&’.
After that, weave.inline returns None. Keeping the assignment C = weave.inline(... makes the code look confusing, even if it works fine and the array named C will hold the result in the benchmark scope.
This is the end result:
import time
import numpy as np
from scipy import weave
from scipy.weave import converters
def benchmark():
N = 5000
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
weave_inline_loop(A, B, C, N)
print time.clock() - t
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
"""
weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
Two issues. First, you don't need the line return_val = C. You are directly manipulating the data in the variable C in your inlined code, so its already available to python and there's no need to explicitly return it to the environment (and trying to do so is causing errors when trying to do the appropriate type conversions). So change your function to:
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
"""
weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
return C
Second issue. You are comparing i and j (both ints), to N an array of length 1. This also generated an error. But if you call your code as:
def benchmark():
N = np.array(5000, dtype=np.int)
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
print weave_inline_loop(A, B, C, int(N))
# I added a print statement so you can see that C is being
# populated with the new 2d array
print time.clock() - t
Related
Edit
I created a similar question which I found more understandable and practical there: How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?
Original question
I want to use C in python to perform a computation an all entry of a big non square matrix of size n times m. I copied the code from the excellent tutorial there: https://medium.com/spikelab/calling-c-functions-from-python-104e609f2804. The code there is for a square matrix
I first compiled the c_sum.c script
$ cc -fPIC -shared -o c_sum.so c_sum.c
and then ran the python script:
$ python main.py
and that ran well. However if I set the values of n and m in the main.py to different values, I get a segmentation fault. I guess one has to allocate memory separately for n and m but my knowledge of C is to rudimentary to know how to do it. What would be a code that would work with, let's say, m=3000 and n=2000?
Here are the script c_sum.c:
#include <stdlib.h>
double * c_sum(const double * matrix, int n, int m){
double * results = (double *)malloc(sizeof(double) * n);
int index = 0;
for(int i=0; i< n*m; i+=n){
results[index] = 0;
for(int j=0; j<m; j++){
results[index] += matrix[i+j];
}
index += 1;
}
return results;
}
Here is the main.c script:
# https://medium.com/spikelab/calling-c-functions-from-python-104e609f2804
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import time
import pdb
def py_sum(matrix: np.array, n: int, m: int) -> np.array:
result = np.zeros(n)
for i in range(0, n):
for j in range(0, m):
result[i] += matrix[i][j]
return result
n = 3000
m = 3000
matrix = np.random.randn(n, m)
time1 = time.time()
py_result = py_sum(matrix, n, m)
time2 = time.time() - time1
print("py running time in seconds:", time2)
py_time = time2
lib = cdll.LoadLibrary("c_sum.so")
c_sum = lib.c_sum
c_sum.restype = ndpointer(dtype=c_double,
shape=(n,))
time1 = time.time()
result = c_sum(c_void_p(matrix.ctypes.data),
c_int(n),
c_int(m))
time2 = time.time() - time1
print("c running time in seconds:", time2)
c_time = time2
print("speedup:", py_time/c_time)
I assume you want to compute sum along last axis for a (n,m) matrix. Segmentation fault occurs when you access memory which you have no access. The issue lies in the the erroneous outer loop. You need to iterate over both dimensions but you iterate over same dimension twice.
double * results = (double *)malloc(sizeof(double) * n); /* you allocate n doubles.
Do you free this Outside function? If not, you are having a memory leak.
An alternative way is to pass the output array to function, so that you can avoid creating memory in the function*/
for(int i=0; i< n*m; i+=n){ /* i+=n => you are iterating for m times. also you are iterating over last dimension */
results[index] = 0; /* when index > n ; you are accessing data which
you don't have access leading to segmentation fault */
for(int j=0; j<m; j++) /* you are iterating again over last axis*/
{
results[index] += matrix[i+j];
}
index += 1; /* this leads to index > n as you iterate for m times and m>n in this case.
For a square matrix, m=n, so you don't have any issue */
}
TLDR: To fix the segmentation fault, you need to replace for(int i=0; i< n*m; i+=n) with for(int i=0; i< n*m; i+=m) so that you only iterate for n times and over both dimensions.
I came across some classical Knapsack solutions and they always build a 2-dimensional DP array.
In my opinion, my code below solves the classical knapsack problem but with only a 1-dim DP array.
Can someone tell me where my solution does not work or why it is computationally inefficient compared to the 2D-DP version?
A 2D-DP version can be found here
https://www.geeksforgeeks.org/python-program-for-dynamic-programming-set-10-0-1-knapsack-problem/
example input:
weights = [(3,30),(2,20),(1,50),(4,30)]
constraint = 5
And my solution:
def knapsack(weights,constraint):
n = len(weights)
#define dp array
dp = [0]*(constraint+1)
#start filling in the array
for k in weights:
for i in range(constraint,k[0]-1,-1):
dp[i] = max(dp[i],dp[i-k[0]]+k[1])
return dp[constraint]
The version using O(nW) memory is more intuitive and makes it possible to easily retrieve the subset of items that produce the optimal answer value.
But, using O(n + W) of memory, we cannot retrieve this subset directly. While it is possible to do this, using the divide-and-conquer technique as explained in https://codeforces.com/blog/entry/47247?#comment-316200.
Sample code
#include <bits/stdc++.h>
using namespace std;
using vi = vector<int>;
#define FOR(i, b) for(int i = 0; i < (b); i++)
template<class T>
struct Knapsack{
int n, W;
vector<T> dp, vl;
vi ans, opt, wg;
Knapsack(int n_, int W): n(0), W(W),
dp(W + 1), vl(n_), opt(W + 1), wg(n_){}
void Add(T v, int w){
vl[n] = v;
wg[n++] = w;
}
T conquer(int l, int r, int W){
if(l == r){
if(W >= wg[l])
return ans.push_back(l), vl[l];
return 0;
}
FOR(i, W + 1)
opt[i] = dp[i] = 0;
int m = (l + r) >> 1;
for(int i = l; i <= r; i++)
for(int sz = W; sz >= wg[i]; sz--){
T dpCur = dp[sz - wg[i]] + vl[i];
if(dpCur > dp[sz]){
dp[sz] = dpCur;
opt[sz] = i <= m ? sz : opt[sz - wg[i]];
}
}
T ret = dp[W];
int K = opt[W];
T ret2 = conquer(l, m, K) + conquer(m + 1, r, W - K);
assert(ret2 == ret);
return ret;
}
T Solve(){
return conquer(0, n - 1, W);
}
};
int main(){
cin.tie(0)->sync_with_stdio(0);
int n, W, vl, wg;
cin >> n >> W;
Knapsack<int> ks(n, W);
FOR(i, n){
cin >> vl >> wg;
ks.Add(vl, wg);
}
cout << ks.Solve() << endl;
}
I'm doing my task to use OpenMP (speed increase program and compare results). I use scipy.weave to do it.
I subtract from the matrix the vector multiplied by the number. I use Python 2.7 (because only for this version weave is exist)
import weave
import numpy
from numpy import *
from random import *
from time import time
codeOpenMP = \
"""
int i = 0;
omp_set_num_threads(2);
#pragma omp parallel shared(matrix, randRow, c) private(i)
{
#pragma omp for
for(i = 0; i < N*M; i++) {
matrix[0,i] = matrix[0,i] - (c * randRow[i%M]);
}
}
"""
# generate matrix
def randMat(x, y):
randRaw = lambda a: [randint(0, 100) for i in xrange(0, a)]
randConst = lambda x, y: [randRaw(x) for e in xrange(0, y)]
return array(randConst(x, y))
def test():
sizeMat = [100, 1000, 2000, 3000]
results = []
for n in sizeMat:
sourceMat = randMat(n, n)
N, M = sourceMat.shape
randRow = sourceMat[randint(0, N)]
c = randint(0, n)
print "\nTest on size: %dx%d" % (n, n)
""" python test """
matrix = array(sourceMat)
t1 = time()
for i in xrange(N):
matrix[i, :] -= c * randRow
timePython = (time() - t1) * MACRO
print "\tPure python: ", timePython
results.append(matrix)
""" C & OpenMP test """
matrix = array(sourceMat)
t1 = time()
weave.inline(codeOpenMP, ['matrix', 'c', 'randRow', 'N', 'M'],
extra_compile_args=['-O3 fopenmp'],
compiler='gcc', libraries=['gomp'],
headers=['<omp.h>'])
timeOpenMP = (time() - t1) * MACRO
print "\tC plus OpenMP: %s" % (timeOpenMP)
results.append(matrix)
if array_equal(results[0], results[1]) and \
array_equal(results[1], results[2]):
print "\tTest - ok"
else:
print "\tTest - false"
test()
But I've got an ERROR (image on link):
ERROR
Smth. with encoding, but I don't understand what exactly?
I've tried to do smth. like this (add in code):
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
But it doesn't help me!
I read in this question that eigen has very good performance. However, I tried to compare eigen MatrixXi multiplication speed vs numpy array multiplication. And numpy performs better (~26 seconds vs. ~29). Is there a more efficient way to do this eigen?
Here is my code:
Numpy:
import numpy as np
import time
n_a_rows = 4000
n_a_cols = 3000
n_b_rows = n_a_cols
n_b_cols = 200
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
start = time.time()
d = np.dot(a, b)
end = time.time()
print "time taken : {}".format(end - start)
Result:
time taken : 25.9291000366
Eigen:
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
int n_a_rows = 4000;
int n_a_cols = 3000;
int n_b_rows = n_a_cols;
int n_b_cols = 200;
MatrixXi a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXi b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXi d (n_a_rows, n_b_cols);
clock_t begin = clock();
d = a * b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Time taken : " << elapsed_secs << std::endl;
}
Result:
Time taken : 29.05
I am using numpy 1.8.1 and eigen 3.2.0-4.
My question has been answered by #Jitse Niesen and #ggael in the comments.
I need to add a flag to turn on the optimizations when compiling: -O2 -DNDEBUG (O is capital o, not zero).
After including this flag, eigen code runs in 0.6 seconds as opposed to ~29 seconds without it.
Change:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
into:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)*1.0
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)*1.0
This gives factor 100 boost at least at my laptop:
time taken : 11.1231250763
vs:
time taken : 0.124922037125
Unless you really want to multiply integers. In Eigen it is also quicker to multiply double precision numbers (amounts to replacing MatrixXi with MatrixXd three times), but there I see just 1.5 factor: Time taken : 0.555005 vs 0.846788.
Is there a more efficient way to do this eigen?
Whenever you have a matrix multiplication where the matrix on the left side of the = does not also appear on the right side, you can safely tell the compiler that there is no aliasing taking place. This will safe you one unnecessary temporary variable and assignment operation, which for big matrices can make an important difference in performance. This is done with the .noalias() function as follows.
d.noalias() = a * b;
This way a*b is directly evaluated and stored in d. Otherwise, to avoid aliasing problems, the compiler will first store the product into a temporary variable and then assign the this variable to your target matrix d.
So, in your code, the line:
d = a * b;
is actually compiled as follows:
temp = a*b;
d = temp;
I am doing some performance test on a variant of the prime numbers generator from http://docs.cython.org/src/tutorial/numpy.html.
The below performance measures are with kmax=1000
Pure Python implementation, running in CPython: 0.15s
Pure Python implementation, running in Cython: 0.07s
def primes(kmax):
p = []
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p.append(n)
k = k + 1
n = n + 1
return p
Pure Python+Numpy implementation, running in CPython: 1.25s
import numpy
def primes(kmax):
p = numpy.empty(kmax, dtype=int)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Cython implementation using int*: 0.003s
from libc.stdlib cimport malloc, free
def primes(int kmax):
cdef int n, k, i
cdef int *p = <int *>malloc(kmax * sizeof(int))
result = []
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
free(p)
return result
The above performs great but looks horrible, as it holds two copies of the data... so I tried reimplementing it:
Cython + Numpy: 1.01s
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def primes(DTYPE_t kmax):
cdef DTYPE_t n, k, i
cdef np.ndarray p = np.empty(kmax, dtype=DTYPE)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Questions:
why is the numpy array so incredibly slower than a python list, when running on CPython?
what did I do wrong in the Cython+Numpy implementation? cython is obviously NOT treating the numpy array as an int[] as it should.
how do I cast a numpy array to a int*? The below doesn't work
cdef numpy.nparray a = numpy.zeros(100, dtype=int)
cdef int * p = <int *>a.data
cdef DTYPE_t [:] p_view = p
Using this instead of p in the calculations. reduced the runtime from 580 ms down to 2.8 ms for me. About the exact same runtime as the implementation using *int. And that's about the max you can expect from this.
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def primes(DTYPE_t kmax):
cdef DTYPE_t n, k, i
cdef np.ndarray p = np.empty(kmax, dtype=DTYPE)
cdef DTYPE_t [:] p_view = p
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p_view[i] != 0:
i = i + 1
if i == k:
p_view[k] = n
k = k + 1
n = n + 1
return p
why is the numpy array so incredibly slower than a python list, when running on CPython?
Because you didn't fully type it. Use
cdef np.ndarray[dtype=np.int, ndim=1] p = np.empty(kmax, dtype=DTYPE)
how do I cast a numpy array to a int*?
By using np.intc as the dtype, not np.int (which is a C long). That's
cdef np.ndarray[dtype=int, ndim=1] p = np.empty(kmax, dtype=np.intc)
(But really, use a memoryview, they're much cleaner and the Cython folks want to get rid of the NumPy array syntax in the long run.)
Best syntax I found so far:
import numpy
cimport numpy
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def primes(int kmax):
cdef int n, k, i
cdef numpy.ndarray[int] p = numpy.empty(kmax, dtype=numpy.int32)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Note where I used numpy.int32 instead of int. Anything on the left side of a cdef is a C type (thus int = int32 and float = float32), while anything on the RIGHT side of it (or outside of a cdef) is a python type (int = int64 and float = float64)