Cython Numpy code not faster than pure python

Cython Numpy code not faster than pure python - python

First I know that there are many similarly themed question on SO, but I can't find a solution after a day of searching, reading, and testing.
I have a python function which calculates the pairwise correlations of a numpy ndarray (m x n). I was orginally doing this purely in numpy but the function also computed the reciprocal pairs (i.e. as well as calculating the the correlation betwen rows A and B of the matrix, it calculated the correlation between rows B and A too.) So I took a slightly different approach that is about twice as fast for matrices of large m (realistic sizes for my problem are m ~ 8000).
This was great but still a tad slow, as there will be many such matrices, and to do them all will take a long time. So I started investigating cython as a way to speed things up. I understand from what I've read that cython won't really speed up numpy all that much. Is this true, or is there something I am missing?
I think the bottlenecks below are the np.sqrt, np.dot, the call to the ndarray's .T method and np.absolute. I've seen people use sqrt from libc.math to replace the np.sqrt, so I suppose my first question is, are the similar functions for the other methods in libc.math that I can use? I am afraid that I am completely and utterly unfamiliar with C/C++/C# or any of the C family languages, so this typing and cython business are very new territory to me, apologies if the reason/solution is obvious.
Failing that, any ideas about what I could do to get some performance gains?
Below are my pyx code, the setup code, and the call to the pyx function. I don't know if it's important, but when I call python setup build_ext --inplace It works but there are a lot warnings which I don't really understand. Could these also be a reason why I am not seeing a speed improvement?
Any help is very much appreciated, and sorry for the super long post.
setup.py
from distutils.core import setup
from distutils.extension import Extension
import numpy
from Cython.Distutils import build_ext
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("calcBrownCombinedP",
["calcBrownCombinedP.pyx"],
include_dirs=[numpy.get_include()])]
)
and the ouput of setup:
>python setup.py build_ext --inplace
running build_ext
cythoning calcBrownCombinedP.pyx to calcBrownCombinedP.c
building 'calcBrownCombinedP' extension
C:\Anaconda\Scripts\gcc.bat -DMS_WIN64 -mdll -O -Wall -IC:\Anaconda\lib\site-packages\numpy\core\include -IC:\Anaconda\include -IC:\Anaconda\PC -c calcBrownCombinedP.c -o build\temp.win-amd64-2.7\Release\calcbrowncombinedp.o
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1728:0,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:17,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/arrayobject.h:15,
from calcBrownCombinedP.c:340:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/npy_deprecated_api.h:8:9: note: #pragma message: C:\Anaconda\lib\site-packages\numpy\core\include/numpy/npy_deprecated_api.h(8) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
calcBrownCombinedP.c: In function '__Pyx_RaiseTooManyValuesError':
calcBrownCombinedP.c:4473:18: warning: unknown conversion type character 'z' in format [-Wformat]
calcBrownCombinedP.c:4473:18: warning: too many arguments for format [-Wformat-extra-args]
calcBrownCombinedP.c: In function '__Pyx_RaiseNeedMoreValuesError':
calcBrownCombinedP.c:4479:18: warning: unknown conversion type character 'z' in format [-Wformat]
calcBrownCombinedP.c:4479:18: warning: format '%s' expects argument of type 'char *', but argument 3 has type 'Py_ssize_t' [-Wformat]
calcBrownCombinedP.c:4479:18: warning: too many arguments for format [-Wformat-extra-args]
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:26:0,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/arrayobject.h:15,
from calcBrownCombinedP.c:340:
calcBrownCombinedP.c: At top level:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/__multiarray_api.h:1594:1: warning: '_import_array' defined but not used [-Wunused-function]
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ufuncobject.h:311:0,
from calcBrownCombinedP.c:341:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/__ufunc_api.h:236:1: warning: '_import_umath' defined but not used [-Wunused-function]
writing build\temp.win-amd64-2.7\Release\calcBrownCombinedP.def
C:\Anaconda\Scripts\gcc.bat -DMS_WIN64 -shared -s build\temp.win-amd64-2.7\Release\calcbrowncombinedp.o build\temp.win-amd64-2.7\Release\calcBrownCombinedP.def -LC:\Anaconda\libs -LC:\Anaconda\PCbuild\amd64 -lpython27 -lmsvcr90 -o C:\cygwin64\home\Davy\SNPsets\src\calcBrownCombinedP.pyd
the pyx code - 'calcBrownCombinedP.pyx'
import numpy as np
cimport numpy as np
from scipy import stats
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def calcBrownCombinedP(np.ndarray genotypeArray):
cdef int nSNPs, i
cdef np.ndarray ms, datam, datass, d, rs, temp
cdef float runningSum, sigmaSq, E, df
nSNPs = genotypeArray.shape[0]
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.sqrt(stats.ss(datam,axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += sum(rs*(3.25+(0.75*rs)))
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
return runningSum
The code that tests the above against some pure python - 'test.py'
import numpy as np
from scipy import stats
import random
import time
from calcBrownCombinedP import calcBrownCombinedP
from PycalcBrownCombinedP import PycalcBrownCombinedP
ms = [10,50,100,500,1000,5000]
for m in ms:
print '---testing implentation with m = {0}---'.format(m)
genotypeArray = np.empty((m,20),dtype=int)
for i in xrange(m):
genotypeArray[i] = [random.randint(0,2) for j in xrange(20)]
print genotypeArray.shape
start = time.time()
print calcBrownCombinedP(genotypeArray)
print 'cython implementation took {0}'.format(time.time() - start)
start = time.time()
print PycalcBrownCombinedP(genotypeArray)
print 'python implementation took {0}'.format(time.time() - start)
and the ouput of that code is:
---testing implentation with m = 10---
(10L, 20L)
2.13660168648
cython implementation took 0.000999927520752
2.13660167749
python implementation took 0.000999927520752
---testing implentation with m = 50---
(50L, 20L)
8.82721138
cython implementation took 0.00399994850159
8.82721130234
python implementation took 0.00500011444092
---testing implentation with m = 100---
(100L, 20L)
16.7438983917
cython implementation took 0.0139999389648
16.7438965333
python implementation took 0.0120000839233
---testing implentation with m = 500---
(500L, 20L)
80.5343856812
cython implementation took 0.183000087738
80.5343694046
python implementation took 0.161000013351
---testing implentation with m = 1000---
(1000L, 20L)
160.122573853
cython implementation took 0.615000009537
160.122491308
python implementation took 0.598000049591
---testing implentation with m = 5000---
(5000L, 20L)
799.813842773
cython implementation took 10.7159998417
799.813880445
python implementation took 11.2510001659
Lastly, the pure python implementation 'PycalcBrownCombinedP.py'
import numpy as np
from scipy import stats
def PycalcBrownCombinedP(genotypeArray):
nSNPs = genotypeArray.shape[0]
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.sqrt(stats.ss(datam,axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += sum(rs*(3.25+(0.75*rs)))
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
return runningSum

Profiling with kernprof shows the bottleneck is the last line of the loop:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
<snip>
16 5000 6145280 1229.1 86.6 runningSum += sum(rs*(3.25+(0.75*rs)))
This is no surprise as you're using the Python built-in function sum in both the Python and Cython versions. Switching to np.sum speeds the code up by a factor of 4.5 when the input array has shape (5000, 20).
If a small loss in accuracy is alright, then you can leverage linear algebra to speed up the final line further:
np.sum(rs * (3.25 + 0.75 * rs))
is really a vector dot product, i.e.
np.dot(rs, 3.25 + 0.75 * rs)
This is still suboptimal as it loops over rs three times and constructs two rs-sized temporary arrays. Using elementary algebra, this expression can be rewritten as
3.25 * np.sum(rs) + .75 * np.dot(rs, rs)
which not only gives the original result without the round-off error in the previous version, but only loops over rs twice and uses constant memory.(*)
The bottleneck is now np.dot, so installing a better BLAS library is going to buy you more than rewriting the whole thing in Cython.
(*) Or logarithmic memory in the very latest NumPy, which has a recursive reimplementation of np.sum that is faster than the old iterative one.

Related

Issues Transferring Numpy array into Fortran using Ctypes

I am trying to wrap some fortran code into python using the ctypes library but have am having major issues with the data transfer. When I print the data from my python script, it looks substantially different from when I print it within the fortran code. Can someone please help me figure out what is going on here? I tried playing around with the datatypes which did not fix the solution and so far I have not found any other SO questions that have addressed my issue. Also please note that using f2py will not work for the end product that this example refers too.
Below is an example of my code:
temp.f90
subroutine test(num_mod_sel,goodval,x,y,coeff,coeff_flag)
implicit none
! Input/Output
integer num_mod_sel, goodval
real, dimension(goodval,num_mod_sel) :: x
real, dimension(goodval) :: y
real, dimension(num_mod_sel) :: coeff
integer, dimension(num_mod_sel) :: coeff_flag
print*, num_mod_sel,goodval,x
return
end subroutine test
!===================================================================================================!
The above f90 code is compiled with:
gfortran -shared -fPIC -o temp.so temp.f90
test.py
from ctypes import CDLL,cdll, POINTER, c_int, c_float
import numpy as np
def test_fcode(num_mod_sel,goodval,x,y,coeff,coeff_flag):
fortran = CDLL('./temp.so')
fortran.test_.argtypes = [ POINTER(c_int),
POINTER(c_int),
POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int) ]
fortran.test_.restype = None
num_mod_sel_ = c_int(num_mod_sel)
goodval_ = c_int(goodval)
x_ = x.ctypes.data_as(POINTER(c_float))
y_ = y.ctypes.data_as(POINTER(c_float))
coeff_ = coeff.ctypes.data_as(POINTER(c_float))
coeff_flag_ = coeff_flag.ctypes.data_as(POINTER(c_int))
fortran.test_(num_mod_sel_,goodval_,x_,y_,coeff_,coeff_flag_)
#Create some test data
num_mod_sel = 4
goodval = 10
x = np.full((num_mod_sel,goodval),999.,dtype=float)
x[:] = np.random.rand(num_mod_sel,goodval)
y = np.full(goodval,999.,dtype=float)
y[:] = np.random.rand(goodval)
coeff = np.empty(num_mod_sel,dtype=float)
coeff_flag = np.empty(num_mod_sel,dtype=int)
#Run the fortran code
test_fcode(num_mod_sel,goodval,x,y,coeff,coeff_flag)
print(x) from the python code:
[[0.36677304 0.8734628 0.72076823 0.20234787 0.91754331 0.26591916
0.46325577 0.00334941 0.98890871 0.3284262 ]
[0.15428096 0.24979671 0.97374747 0.83996786 0.59849493 0.55188578
0.9668523 0.98441142 0.50954678 0.22003844]
[0.54362548 0.42636074 0.65118397 0.69455346 0.30531619 0.88668116
0.97278714 0.29046492 0.64851937 0.64885967]
[0.31798739 0.37279389 0.88855305 0.38754276 0.94985151 0.56566525
0.99488508 0.13812829 0.0940132 0.07921261]]
print*, x from f90:
-2.91465824E-17 1.68338645 13.0443134 1.84336567 -7.44153724E-34 1.80519199 -2.87629426E+27 1.57734776 -1297264.38 1.85438573 -236487.531 1.63295949 -1.66118658E-33 1.73162782 -6.73423983E-09 0.919681191 -1.09687280E+21 1.87222707 5.50313165E+09 1.66421306 8.38275158E+34 1.52928090 -2.15154066E-13 1.62479663 3.88800366E+30 1.86843681 127759.977 1.83499193 -3.55062879E+15 1.77462363 2.43241945E+19 1.76297140 3.16150975E-03 1.86671305 1.35183692E+21 1.87110281 1.74403865E-31 1.75238669 9.85857248E-02 1.59503841 -2.33541620E+30 1.79045486 -1.86185171E+11 1.78229403 4.23132255E-20 1.81525886 2.96771497E-04 1.82888138 -4.55096013E-26 1.86097753 0.00000000 3.68934881E+19 -7.37626273E+15 1.58494916E+29 0 -1064355840 -646470284 -536868869

The problem is a mismatch of datatypes.
The Fortran real is usually a 32 bit float (C float) while numpy interprets the Python datatype float as numpy.float_ which is an alias of numpy.float64, the C double with 64 bits.
Solution: In Python use numpy.float32 as dtype for numpy array creation.

Cython Optimization of Numpy for Loop

I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)

Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.

Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)

Why is my Fortran code wrapped with f2py using so much memory?

I am trying to calculate all the distances between approximately a hundred thousand points. I have the following code written in Fortran and compiled using f2py:
C 1 2 3 4 5 6 7
C123456789012345678901234567890123456789012345678901234567890123456789012
subroutine distances(coor,dist,n)
double precision coor(n,3),dist(n,n)
integer n
double precision x1,y1,z1,x2,y2,z2,diff2
cf2py intent(in) :: coor,dist
cf2py intent(in,out):: dist
cf2py intent(hide)::n
cf2py intent(hide)::x1,y1,z1,x2,y2,z2,diff2
do 200,i=1,n-1
x1=coor(i,1)
y1=coor(i,2)
z1=coor(i,3)
do 100,j=i+1,n
x2=coor(j,1)
y2=coor(j,2)
z2=coor(j,3)
diff2=(x1-x2)*(x1-x2)+(y1-y2)*(y1-y2)+(z1-z2)*(z1-z2)
dist(i,j)=sqrt(diff2)
100 continue
200 continue
end
I am compiling the fortran code using the following python code setup_collision.py:
# System imports
from distutils.core import *
from distutils import sysconfig
# Third-party modules
import numpy
from numpy.distutils.core import Extension, setup
# Obtain the numpy include directory. This logic works across numpy versions.
try:
numpy_include = numpy.get_include()
except AttributeError:
numpy_include = numpy.get_numpy_include()
# simple extension module
collision = Extension(name="collision",sources=['./collision.f'],
include_dirs = [numpy_include],
)
# NumyTypemapTests setup
setup( name = "COLLISION",
description = "Module calculates collision energies",
author = "Stvn66",
version = "0.1",
ext_modules = [collision]
)
Then running it as follows:
import numpy as np
import collision
coor = np.loadtxt('coordinates.txt')
n_atoms = len(coor)
dist = np.zeros((n_atoms, n_atoms), dtype=np.float16) # float16 reduces memory
n_dist = n_atoms*(n_atoms-1)/2
n_GB = n_dist * 2 / float(2**30) # 1 kB = 1024 B
n_Gb = n_dist * 2 / 1E9 # 1 kB = 1000 B
print 'calculating %d distances between %d atoms' % (n_dist, n_atoms)
print 'should use between %f and %f GB of memory' % (n_GB, n_Gb)
dist = collision.distances(coor, dist)
Using this code with 30,000 atoms, what should use around 1 GB of memory to store the distances, it instead uses 10 GB. With this difference, performing this calculation with 100,000 atoms will require 100 GB instead of 10 GB. I only have 20 GB in my computer.
Am I missing something related to passing the data between Python and Fortran? The huge difference indicates a major flaw in the implementation.

You are feeding double precision arrays to the Fortran subroutine. Each element in double precision requires 8 Byte of memory. For N=30,000 that makes
coor(n,3) => 30,000*3*8 ~ 0.7 MB
dist(n,n) => 30,000^2*8 ~ 6.7 GB
Since the half precision floats are additionally required for Python, that accounts for another 1-2GB. So the overall requirement is 9-10GB.
The same holds true for N=100,000, which will require ~75GB for the Fortran part alone.
Instead of double precision floats, you should use single precision reals - if that is sufficient for your calculations. This will lead to half the memory requirements. [I have no experience with that, but I assume that if both parts use the same precision, Python can operate on the data directly...]
As #VladimirF noted in his comment, "usual compilers do not support 2 byte reals". I checked with gfortran and ifort, and they both do not. So you need to use at least single precision.

Python dictionaries vs C++ std:unordered_map (cython) vs cythonized python dict

I was trying to measure the performance between python dictionaries, cythonized python dictionaries and cythonized cpp std::unordered_map doing only a init procedure. If the cythonized cpp code is compiled I thought it should be faster than the pure python version. I did a test using 4 different scenario/notation options:
Cython CPP code using std::unordered_map and Cython book notation (defining a pair and using insert method)
Cython CPP code using std::unordered_map and python notation (map[key] = value)
Cython code (typed code) using python dictionaries (map[key] = value)
Pure python code
I was expecting see how cython code outperforms pure python code, but in this case there is not improvement. Which could be the reason? I'm using Cython-0.22, python-3.4 and g++-4.8.
I got this exec time (seconds) using timeit:
Cython CPP book notation -> 15.696417249999968
Cython CPP python notation -> 16.481350984999835
Cython python notation -> 18.585355018999962
Pure python -> 18.162724677999904
Code is here and you can use it:
cython -a map_example.pyx
python3 setup_map.py build_ext --inplace
python3 use_map_example.py
map_example.pyx
from libcpp.unordered_map cimport unordered_map
from libcpp.pair cimport pair
cpdef int example_cpp_book_notation(int limit):
cdef unordered_map[int, int] mapa
cdef pair[int, int] entry
cdef int i
for i in range(limit):
entry.first = i
entry.second = i
mapa.insert(entry)
return 0
cpdef int example_cpp_python_notation(int limit):
cdef unordered_map[int, int] mapa
cdef pair[int, int] entry
cdef int i
for i in range(limit):
mapa[i] = i
return 0
cpdef int example_ctyped_notation(int limit):
mapa = {}
cdef int i
for i in range(limit):
mapa[i] = i
return 0
setup_map.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import os
os.environ["CC"] = "g++"
os.environ["CXX"] = "g++"
modules = [Extension("map_example",
["map_example.pyx"],
language = "c++",
extra_compile_args=["-std=c++11"],
extra_link_args=["-std=c++11"])]
setup(name="map_example",
cmdclass={"build_ext": build_ext},
ext_modules=modules)
use_map_example.py
import map_example
C_MAXV = 100000000
C_NUMBER = 10
def cython_cpp_book_notation():
x = 1
while(x<C_MAXV):
map_example.example_cpp_book_notation(x)
x *= 10
def cython_cpp_python_notation():
x = 1
while(x<C_MAXV):
map_example.example_cpp_python_notation(x)
x *= 10
def cython_ctyped_notation():
x = 1
while(x<C_MAXV):
map_example.example_ctyped_notation(x)
x *= 10
def pure_python():
x = 1
while(x<C_MAXV):
map_a = {}
for i in range(x):
map_a[i] = i
x *= 10
return 0
if __name__ == '__main__':
import timeit
print("Cython CPP book notation")
print(timeit.timeit("cython_cpp_book_notation()", setup="from __main__ import cython_cpp_book_notation", number=C_NUMBER))
print("Cython CPP python notation")
print(timeit.timeit("cython_cpp_python_notation()", setup="from __main__ import cython_cpp_python_notation", number=C_NUMBER))
print("Cython python notation")
print(timeit.timeit("cython_ctyped_notation()", setup="from __main__ import cython_ctyped_notation", number=C_NUMBER))
print("Pure python")
print(timeit.timeit("pure_python()", setup="from __main__ import pure_python", number=C_NUMBER))

My timings from your code (after correcting that python *10 indent :) ) are
Cython CPP book notation
21.617647969018435
Cython CPP python notation
21.229907534987433
Cython python notation
24.44413448998239
Pure python
23.609809526009485
Basically everyone is in the same ballpark, with a modest edge for the CPP versions.
Nothing special about my machine, the usual Ubuntu 14.10, 0.202 Cython, 3.42 Python.

Efficiently select elements from numpy array with multiple criteria

I'm looking for the fastest way to select the elements of a numpy array that satisfy several criteria. As an example, say I want to select all elements that lie between 0.2 and 0.8 from an array. I normally do something like this:
the_array = np.random.random(100000)
idx = (the_array > 0.2) * (the_array < 0.8)
selected_elements = the_array[idx]
However, this creates two additional arrays with the same size as the_array (one for the_array > 0.2 and one for the_array < 0.8). If the array is large, this can consume a lot of memory. Is there any way to get around this? All of the built-in numpy functions (such as logical_and) seem to do the this same thing under the hood.

You could implement a custom C call for the select. The most basic way to do this is through a ctypes implementation.
select.c
int select(float lower, float upper, float* in, float* out, int n)
{
int ii;
int outcount = 0;
float val;
for (ii=0;ii<n;ii++)
{
val = in[ii];
if ((val>lower) && (val<upper))
{
out[outcount] = val;
outcount++;
}
}
return outcount;
}
which is compiled as:
gcc -lm -shared select.c -o lib.so
And on the python side:
select.py
import ctypes as C
from numpy.ctypeslib import as_ctypes
import numpy as np
# open the library in python
lib = C.CDLL("./lib.so")
# explicitly tell ctypes the argument and return types of the function
pfloat = C.POINTER(C.c_float)
lib.select.argtypes = [C.c_float,C.c_float,pfloat,pfloat,C.c_int]
lib.select.restype = C.c_int
size = 1000000
# create numpy arrays
np_input = np.random.random(size).astype(np.float32)
np_output = np.empty(size).astype(np.float32)
# expose the array contents to ctypes
ctypes_input = as_ctypes(np_input)
ctypes_output = as_ctypes(np_output)
# call the function and get the number of selected points
outcount = lib.select(0.2,0.8,ctypes_input,ctypes_output,size)
# select those points
selected = np_output[:outcount]
Don't expect wild speedups with such a vanilla implementation, but in the C side you have the option of adding in OpenMP pragmas to get quick and dirty parallelism which may give you significant boosts.
Also as mentioned in the comments, numexpr may be a faster neater way to do all this in just a few lines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython Numpy code not faster than pure python - python

Related

Issues Transferring Numpy array into Fortran using Ctypes

Cython Optimization of Numpy for Loop

Why is my Fortran code wrapped with f2py using so much memory?

Python dictionaries vs C++ std:unordered_map (cython) vs cythonized python dict

Efficiently select elements from numpy array with multiple criteria

Categories

Resources