I tried to Cythonize part of my code as following to hopefully gain some speed:
# cython: boundscheck=False
import numpy as np
cimport numpy as np
import time
cpdef object my_function(np.ndarray[np.double_t, ndim = 1] array_a,
np.ndarray[np.double_t, ndim = 1] array_b,
int n_rows,
int n_columns):
cdef double minimum_of_neighbours, difference, change
cdef int i
cdef np.ndarray[np.int_t, ndim =1] locations
locations = np.argwhere(array_a > 0)
for i in locations:
minimum_of_neighbours = min(array_a[i - n_columns], array_a[i+1], array_a[i + n_columns], array_a[i-1])
if array_a[i] - minimum_of_neighbours < 0:
difference = minimum_of_neighbours - array_a[i]
change = min(difference, array_a[i] / 5.)
array_a[i] += change
array_b[i] -= change * 5.
print time.time()
return array_a, array_b
I can compile it without an error but when I use the function I got this error:
from cythonized_code import my_function
import numpy as np
array_a = np.random.uniform(low=-100, high=100, size = 100).astype(np.double)
array_b = np.random.uniform(low=0, high=20, size = 100).astype(np.double)
a, b = my_function(array_a,array_b,5,20)
# which gives me this error:
# locations = np.argwhere(array_a > 0)
# ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Do I need to declare locations type here? The reason I wanted to declare it is that it has yellow colour in the annotated HTML file generated by compiling the code.
It's a good idea not to use the python-functionality locations[i], because it is too expensive: Python would create a full-fledged Python-integer* from the lowly c-integer (which is what is stored in the locations-numpy array), register it in the garbage collector, then cast it back to int, destroy the Python-object - quite an overhead.
To get a direct access to the lowly c-integers one needs to bind locations to a type. The normal course of action would be too look up, which properties locations has:
>>> locations.ndim
2
>>> locations.dtype
dtype('int64')
which translates to cdef np.ndarray[np.int64_t, ndim =2] locations.
However, this will (probably, can not check it right now) not be enough to get rid of Python-overhead because of a Cython-quirk:
for i in locations:
...
will not be interpreted as a raw-array access but will invoke the Python-machinery. See for example here.
So you will have to change it to:
for index in range(len(locations)):
i=locations[index][0]
then Cython "understands", that you want the access to the raw c-int64 array.
Actually, it is not completely true: In this case first an nd.array is created (e.g. locations[0] or locations[1]) and then __Pyx_PyInt_As_int (which is more or less an alias for [PyLong_AsLongAndOverflow][2]) is called, which creates a PyLongObject, from which C-int value is obtained before the temporary PyLongObject and nd.array are destructed.
Here we get lucky, because length-1 numpy-arrays can be converted to Python scalars. The code would not work if the second dimension of locations would be >1.
Related
This runs smoothly and fast:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
cdef bytes line
cdef str decoded_line
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tile info)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
sums[n, ord(decoded_line[n])] +=1
counter+=1
Here the numpy ndarray sums contains the results.
However, instead of a single numpy array, I need an unknown number of arrays in a dictionary (named tiles) and this is the code that should accomplish my goal:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
cdef dict tiles = {} # each tile will have it's own 'sums' numpy array
cdef bytes line
cdef str decoded_line
cdef str tile
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
decoded_line = line.decode('utf-8')
tile = decoded_line.split(':')[4]
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
tiles[tile][n, ord(decoded_line[n])] +=1
counter+=1
In this second example, I don't know a priori the number of keys in the dictionary tiles and therefore, the numpy arrays will be declared and initialized during runtime (please, correct me if I am wrong or using the wrong terms).
Cython did not translate/compile when using the cython-declaration of the numpy arrays and hence, I left it as tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32).
Since all other cython optimizations for code that is shared between the two snippets are fine, I believe that this numpy array declaration is the problem.
How should I fix that? Here, the manual indicates ways of dynamically allocate memory, but I don't know how this works with numpy arrays and if I should do it al all.
Thank you!
I would ignore the documentation about dynamically allocating memory. That's not what you want to do - it's very much at C level and you're handling Python objects.
You can easily reassign a variable typed as a Numpy array (or equally the newer typed memoryview) multiple times so that it refers to a different Numpy array. I suspect what you want is something like
# start of function
cdef np.ndarray[np.uint32_t, ndim=2] tile_array
# in "if counter%4==0":
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
tile_array = tiles[tile] # not a copy! Just two references to exactly the same object
# in "if counter%3==0"
tile_array[n, ord(decoded_line[n])] +=1
There's a small cost to tile_array = tiles[tile] just to do some type-checking, so it'll probably only be worthwhile if you use tile_array a few times between each assignment (it's hard to guess exactly what the threshold is, but time it against your current version).
I am trying to speed up my Python program by implementing a function in C++ and embedding it in my code using CFFI. The function takes two 3x3 arrays and computes a distance.
The Python code is the following:
import cffi
import numpy as np
ffi = cffi.FFI()
ffi.cdef("""
extern double dist(const double s[3][3], const double t[3][3]);
""")
lib = ffi.dlopen("./dist.so")
S = np.array([[-1.63538, 0.379116, -1.16372],[-1.63538, 0.378137, -1.16366 ],[-1.63193, 0.379116, -1.16366]], dtype=np.float32)
T = np.array([[-1.6467834, 0.3749715, -1.1484985],[-1.6623441, 0.37410975, -1.1647063 ],[-1.6602284, 0.37400728, -1.1496595 ]], dtype=np.float32)
Sp = ffi.cast("double(*) [3]", S.ctypes.data)
Tp = ffi.cast("double(*) [3]", T.ctypes.data)
dd = lib.dist(Sp,Tp);
This solution doesn't work as intended. Indeed the arguments printed by the C function are:
Sp=[[0.000002, -0.270760, -0.020458]
[0.000002, 0.000000, 0.000000]
[0.000000, 0.000000, 0.000000]]
Tp=[[0.000002, -0.324688, -0.020588]
[0.000002, 0.000000, 0.000000]
[0.000000, 0.000000, -nan]]
I have also tried the following to initialize the pointers:
Sp = ffi.new("double *[3]")
for i in range(3):
Sp[i] = ffi.cast("double *", S[i].ctypes.data)
Tp = ffi.new("double *[3]")
for i in range(3):
Tp[i] = ffi.cast("double *", T[i].ctypes.data)
dd = lib.dist(Sp,Tp);
But this solution rises an error in dist(Sp,Tp):
TypeError: initializer for ctype 'double(*)[3]' must be a pointer to same type, not cdata 'double *[3]'
Do you have any idea on how to make it work? Thanks.
The types double[3][3] and double *[3] are not equivalent. The former is an 2D array of double 3x3, stored as 9 contiguous double. The latter is an array of 3 double pointers, which is not how 2D static arrays are implemented in C or C++.
Both numpy arrays and C++ static arrays are represented in memory as a contiguous block of elements, it's just the double *[3] type in the middle that's throwing a wrench in the works. What you'd want is to use double[3][3] proper or double[3]* (a pointer to a row of three doubles). Note that if you use the latter you may need to change the function prototype to take double [][3].
I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)
Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.
Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)
I'd like to overwrite part of a PyOpenCL array with another array.
Let's say
import numpy as np, pyopencl.array as cla
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,2),'int8')
Now I want to do something like a[0:2,0:2] = b and hopefully get
1 1 0
1 1 0
0 0 0
How would I do that without copying everything to the host for speed reasons?
Pyopencl arrays are able to do that - to a very limited extent at the time of this answer - with the numpy syntax (ie exactly how you wrote it) the limitation being: you can only use slices along the first axis.
import numpy as np, pyopencl.array as cla
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,3),'int8')
# note b is 2x3 here
a[0:2]=b #<-works
a[0:2,0:2]=b[:,0:2] #<-Throws an error about non-contiguity
So a[0:2,0:2] = b won't work as the destination sliced array have non-contiguous data.
The only solution I am aware of (as nothing in the pyopencl.array class is yet able to work with sliced arrays/non-contiguous data), is to write your own openCL kernel to do the copy "by hand".
Here is a piece of code I use to do copies on 1D or 2D pyopencl arrays of all dtype:
import numpy as np, pyopencl as cl, pyopencl.array as cla
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
kernel = cl.Program(ctx, """__kernel void copy(
__global char *dest, const int offsetd, const int stridexd, const int strideyd,
__global const char *src, const int offsets, const int stridexs, const int strideys,
const int word_size) {
int write_idx = offsetd + get_global_id(0) + get_global_id(1) * stridexd + get_global_id(2) * strideyd ;
int read_idx = offsets + get_global_id(0) + get_global_id(1) * stridexs + get_global_id(2) * strideys;
dest[write_idx] = src[read_idx];
}""").build()
def copy(dest,src):
assert dest.dtype == src.dtype
assert dest.shape == src.shape
if len(dest.shape) == 1 :
dest.shape=(dest.shape[0],1)
src.shape=(src.shape[0],1)
dest.strides=(dest.strides[0],0)
src.strides=(src.strides[0],0)
kernel.copy(queue, (src.dtype.itemsize,src.shape[0],src.shape[1]), None, dest.base_data, np.uint32(dest.offset), np.uint32(dest.strides[0]), np.uint32(dest.strides[1]), src.base_data, np.uint32(src.offset), np.uint32(src.strides[0]), np.uint32(src.strides[1]), np.uint32(src.dtype.itemsize))
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,2),'int8')
copy(a[0:2,0:2],b)
print(a)
In the pyopencl mailing list, Andreas Klöckner gave me a hint: there's an undocumented function in pyopencl.array called multiput(). Syntax is like this:
cla.multi_put([arr],indices,out=[out])
'arr' is the source array, 'out' is the destination array and 'indices' is a 1D array of ints (also on the device) which contains the linear element indices counted row-major.
For example, in my first post, indices for putting 'b' into 'a' would be (0, 1, 3, 4). You just need to put together your indices somehow and can use multiput() instead of writing a kernel. len(indices) must of course be equal to b.size. There's also a take() and multitake() function for reading elements from an array.
I'm looking for the fastest way to select the elements of a numpy array that satisfy several criteria. As an example, say I want to select all elements that lie between 0.2 and 0.8 from an array. I normally do something like this:
the_array = np.random.random(100000)
idx = (the_array > 0.2) * (the_array < 0.8)
selected_elements = the_array[idx]
However, this creates two additional arrays with the same size as the_array (one for the_array > 0.2 and one for the_array < 0.8). If the array is large, this can consume a lot of memory. Is there any way to get around this? All of the built-in numpy functions (such as logical_and) seem to do the this same thing under the hood.
You could implement a custom C call for the select. The most basic way to do this is through a ctypes implementation.
select.c
int select(float lower, float upper, float* in, float* out, int n)
{
int ii;
int outcount = 0;
float val;
for (ii=0;ii<n;ii++)
{
val = in[ii];
if ((val>lower) && (val<upper))
{
out[outcount] = val;
outcount++;
}
}
return outcount;
}
which is compiled as:
gcc -lm -shared select.c -o lib.so
And on the python side:
select.py
import ctypes as C
from numpy.ctypeslib import as_ctypes
import numpy as np
# open the library in python
lib = C.CDLL("./lib.so")
# explicitly tell ctypes the argument and return types of the function
pfloat = C.POINTER(C.c_float)
lib.select.argtypes = [C.c_float,C.c_float,pfloat,pfloat,C.c_int]
lib.select.restype = C.c_int
size = 1000000
# create numpy arrays
np_input = np.random.random(size).astype(np.float32)
np_output = np.empty(size).astype(np.float32)
# expose the array contents to ctypes
ctypes_input = as_ctypes(np_input)
ctypes_output = as_ctypes(np_output)
# call the function and get the number of selected points
outcount = lib.select(0.2,0.8,ctypes_input,ctypes_output,size)
# select those points
selected = np_output[:outcount]
Don't expect wild speedups with such a vanilla implementation, but in the C side you have the option of adding in OpenMP pragmas to get quick and dirty parallelism which may give you significant boosts.
Also as mentioned in the comments, numexpr may be a faster neater way to do all this in just a few lines.