I'm having trouble phrasing this problem in Numpy. I need to simulate an analog maximum tracker (resistor diode capacitor). I have some very long 1-D array X from which I want to calculate the output array Y, such that
Y[0] = X[0]
Y[i] = max(0.99 * Y[i - 1], X[i])
I've faked it by approximating my above rules with Y^30 = ExpDecayFunc * X^30 where the asteriks is convolution. Surely there is something much more straight forward I'm missing? Thanks so much!
Are you trying to simulate an asymmetric signal filter (resistor, diode, capacitor)? It is a nasty non-linear operation, which cannot be calculated in parallel. So, this is really not something nice for NumPy to solve.
The trivial solution is:
import numpy as np
# just do something random
X = np.random.random(1000000)
def my_filter(X):
Y = np.empty(len(X))
Y[0] = X[0]
for i in range(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
This takes time, my machine needs whopping 1.36 s for this (1.36 us for item). Not very nice. (Edit: The stupid use of np.arange changed to range.)
The algorithm can be made a bit faster by rearranging it to avoid lookups:
def my_filter_2(X):
Y = np.empty(len(X))
Y[0] = X[0]
a = .99 * Y[0]
for i in range(1, len(X)):
a = max(a, X[i])
Y[i] = a
a *= .99
return Y
Now we have 1.16 ms (1.16 us per element). An improvement, but not very fast after all.
But then we have cython. This is done with IPython's %%cython (not my solution, Andrew Jaffe shows this in his great answer):
%%cython
import numpy as np
cimport numpy as np
# just do something random
cdef np.ndarray cX = np.random.random(1000000)
def cy_filter(np.ndarray[np.double_t] X):
cdef int i
cdef np.ndarray[np.double_t] Y = np.empty(len(X))
Y[0] = X[0]
for i in range(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
This is fast! My computer claims 6.43 ms (6.43 ns/element).
Another almost-Pythonic solution is numba as suggested by DSM in their answer:
from numba import autojit
import numpy as np
#autojit
def my_filter_nb(X, Y):
Y[0] = X[0]
for i in range(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
def my_filter_fast(X):
Y = np.empty(len(X))
my_filter_nb(X, Y)
return Y
This gives 4.18 ms (4.18 ns/element).
But if we still need speed, let's C:
import numpy as np
import scipy.weave
X = np.random.random(1000000)
def my_filter_c(X):
x_len = len(X)
Y = np.empty(x_len)
c_source = """
#include <math.h>
int i;
double a, x;
Y(0) = X(0);
a = .99 * Y(0);
for (i = 1; i < x_len; i++)
{
x = X(i);
if (x > a)
a = x;
Y(i) = a;
a *= .99;
}
"""
scipy.weave.inline(c_source, ["X","Y","x_len"],
compiler="gcc",
headers=["<math.h>"],
type_converters=scipy.weave.converters.blitz)
return Y
This one gives 3.72 ms (3.72 ns/round). (BTW, my brain is not multi-threaded, and writing inlined C into Python would require two threads - it's amazing how many semicolons one can miss when writing a simple program in C.) The improvement is not that big, the trouble is.
To see how bad or good this is compared to plain C:
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
#define NUMITER 100000000
int main(void)
{
double *x, *y;
double a, b, time_delta;
int i;
struct rusage ru0, ru1;
x = (double *)malloc(NUMITER * sizeof(double));
y = (double *)malloc(NUMITER * sizeof(double));
for (i = 0; i < NUMITER; i++)
x[i] = rand() / (double)(RAND_MAX - 1);
getrusage(RUSAGE_SELF, &ru0);
y[0] = x[0];
a = .99 * y[0];
for (i = 0; i < NUMITER; i++)
{
b = x[i];
if (b > a)
a = b;
y[i] = a;
a *= .99;
}
getrusage(RUSAGE_SELF, &ru1);
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("Took %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITER);
return (int)y[1234] % 2; // just to make sure the optimizer is not too clever
}
This compiled with gcc -Ofast takes 318 ms or 3.18 ns/element (note the larger number of elements) and is thus the winner.
All Python timings have been performed with IPython's %timeit and they include some overhead from the np.empty, but that is quite insignificant. However probably due to memory management issues the results vary somewhat from one run to another, so they need to be taken with a pinch a salt in any case.
I also tried the faster solutions with 500 million elements to avoid call overheads:
%cython: 7.5 ns/element
numba: 7.3 ns/element
inlined C (weave): 5.7 ns/element
plain C: 3.2 ns/element
I also tried some hand-optimizing tricks with plain C, but at least without looking at the compiling results it seems that gcc is at least as clever as I am.
Out of this stack I'd probably take numba or plain C depending on the rush I am having. With this specific problem scipy.weave.inline is too much trouble compared to the advantage.
Also -- depending on the data -- this could possibly be made slightly faster with parallel processing, but the worst case is then worse, and the whole thing may be memory-bandwidth-limited anyway.
Cython is very fast. I ran this in iPython using the cython magic.
%% cython
import numpy as np
cimport numpy as np
# just do something random
cdef np.ndarray cX = np.random.random(1000000)
def cy_filter(np.ndarray[np.double_t] X):
cdef int i
cdef np.ndarray[np.double_t] Y = np.empty(len(X))
Y[0] = X[0]
for i in range(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
Using %timeit, I get a speedup from
1 loops, best of 3: 1.52 s per loop
to
100 loops, best of 3: 4.67 ms per loop
(For what it's worth, when I missed the cdef int i it was only about a factor 3 speedup, instead of 300!)
You could also use numba, although it would require a few changes:
from numba import autojit
import numpy as np
#autojit
def my_filter_nb(X, Y):
Y[0] = X[0]
for i in range(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
def my_filter_fast(X):
Y = np.empty(len(X))
my_filter_nb(X, Y)
return Y
def my_filter(X):
Y = np.empty(len(X))
Y[0] = X[0]
for i in np.arange(1, len(X)):
Y[i] = max(.99*Y[i-1], X[i])
return Y
which gives me:
>>> X = np.random.random(1000000)
>>> %timeit my_filter(X)
1 loops, best of 3: 936 ms per loop
>>> %timeit my_filter_fast(X)
100 loops, best of 3: 3.83 ms per loop
>>> (my_filter(X) == my_filter_fast(X)).all()
True
Related
I have the following simple function written using cython syntaxes:
%%cython
import numpy as np
cimport cython
import math
#cython.boundscheck(False)
#cython.wraparound(False)
def calc_cy(float[:, ::1] matrix, int nXX, int nYY, float git, float dgit, float[:, ::1] bus, float[:, ::1] kapa):
cdef Py_ssize_t x_max = nXX + 1
cdef Py_ssize_t y_max = nYY + 1
result = np.zeros((x_max, y_max), dtype=np.float32)
cdef float[:, ::1] result_view = result
cdef float tmp = 0.0, tmp1 = 0.0, pref = 0.0, dgit_u = 0.0
cdef Py_ssize_t x, y
pref = 5.1008 * 10.0**-5 * (3.92**(0.08 / 5.214 * (10**2) / (git + 78.05)))
dgit = dgit/30601
for x in range(x_max):
for y in range(y_max):
dgit_u = dgit * (matrix[x, y]**1.692 / pref)
tmp = kapa[x, y] + dgit_u
tmp1 = bus[x, y] - (2.7182**(- tmp ** 4.0 / 1.73)) * dgit_u / 7.13
#result_view[x, y] = tmp
return result
If I run this function for 100 loop with random variable (following code), it only takes around 0.09 sec. But if I uncomment "result_view[x, y] = tmp" in the line before the last line in the function and I run the same loop, it takes 2.7 sec. Does anyone know, why the assignment process to the result_view arrays is very slow? Any comment would be highly appreciated.
nXX, nYY = 999, 999
git, dgit = np.float32(35.0), np.float32(0.01)
matrix = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
bus = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
kapa = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
past = time.time()
for i in range(100):
calc_cy(matrix, nXX, nYY, git, dgit, bus, kapa)
print (time.time() - past)
Many thanks!
I tried to recast the data type, but it didn't solve the problem. I also checked to make sure the data type generated by the function is the same as the data type needed by the array. I expected that the assigning process should only take 1 sec at maximum, but it is taking 2 sec.
But if I uncomment result_view[x, y] = tmp in the line before the last line in the function and I run the same loop, it takes 2.7 sec.
This is something seen quite a bit in optimization questions. What you're seeing is that if you don't use the result of the loop then the C compiler eliminates the whole loop body and it seems really quick.
The 2.7s is the speed it actually takes to run.
It looks to be like you're typing most variables correctly so there aren't any quick obvious optimisations.
I've got a Python function I try to export to Cython. I have tested two implementations but I don't understand why the second one is slower than the first one. Furthermore, I am looking for ways to improve speed a little more but I have no clue how ?
Base code
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
cdef extern from "math.h":
double exp(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
def bilateral_filter_C(np.ndarray[np.float_t, ndim=1] samples, int w=20):
# Filter Parameters
cdef Py_ssize_t size = samples.shape[0]
cdef float rang
cdef float sigma = 2*3.0*3.0
cdef int j, L
cdef unsigned int a, b
cdef np.float_t W, num, sub_sample, intensity
# Initialization
cdef np.ndarray[np.float_t, ndim=1] gauss = np.zeros(2*w+1, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] sub_samples, intensities = np.empty(size, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] samples_filtered = np.empty(size, dtype=np.float)
L = 2*w+1
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
<CODE TO IMPROVE>
return samples_filtered
I tried to inject those two code samples in the <CODE TO IMPROVE> section:
Most efficient approach
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
L = b-a
sub_samples = samples[a:b]-samples[i]
sub_samples *= sub_samples
for j in xrange(L):
sub_samples[j] = exp(-sub_samples[j]/sigma)
intensities = gauss[w-i+a:w-i+b]*sub_samples
num = 0.0
W = 0.0
for j in xrange(L):
W += intensities[j]
num += intensities[j]*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 45 ms per loop
Less efficient
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
num = 0.0
W = 0.0
for j in xrange(b-a):
sub_sample = samples[a+j]-samples[i]
intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma)
W += intensity
num += intensity*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 125 ms per loop
You have a few typos:
1) You forgot to define i, just add cdef int i, j, L
2) In the second algorithm you wrote intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma), it should be intensity, without the 1
3) I would add #cython.cdivision(True) to avoid the check of division by zero
With those changes and with x = np.random.rand(10000)I got the following results
%timeit bilateral_filter_C1(x, 20) # First code
10 loops, best of 3: 74.1 ms per loop
%timeit bilateral_filter_C2(x, 20) # Second code
100 loops, best of 3: 9.5 ms per loop
And, to check the results
np.all(np.equal(bilateral_filter_C1(x, 20), bilateral_filter_C2(x, 20)))
True
To avoid these problems I suggest to use the option cython my_file.pyx -a, it generates an html file that shows you the possible problems you have in your code
EDIT
Reading again the code, it seems to have more errors:
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
gauss has the same value always, what is the definition of rang?
I'm trying to find the fastest way to to get the functionality of numpy's 'where' statement on a 2D numpy array; namely, retrieving the indices where a condition is met. It is simply much slower than other languages I have used (e.g., IDL, Matlab).
I have cythonized a function that marches through the array in nested for-loops. There is almost an order of magnitude increase in speed, but I would like to increase performance even more, if possible.
TEST.py:
from cython_where import *
import time
import numpy as np
data = np.zeros((2600,5200))
data[100:200,100:200] = 10
t0 = time.time()
inds,ct = cython_where(data,'EQ',10)
print time.time() - t0
t1 = time.time()
tmp = np.where(data == 10)
print time.time() - t1
My cython_where.pyx program:
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
DTYPE1 = np.float
ctypedef np.float_t DTYPE1_t
DTYPE2 = np.int
ctypedef np.int_t DTYPE2_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def cython_where(np.ndarray[DTYPE1_t, ndim=2] data, oper, DTYPE1_t val):
assert data.dtype == DTYPE1
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef unsigned int x, y
cdef int count = 0
cdef np.ndarray[DTYPE2_t, ndim=1] xind = np.zeros(100000,dtype=int)
cdef np.ndarray[DTYPE2_t, ndim=1] yind = np.zeros(100000,dtype=int)
if(oper == 'EQ' or oper == 'eq'): #I didn't want to include GT, GE, LT, LE here
for x in xrange(xmax):
for y in xrange(ymax):
if(data[x,y] == val):
xind[count] = x
yind[count] = y
count += 1
return tuple([xind[0:count],yind[0:count]]),count
Output of TEST.py:
cython_test]$ python TEST.py
0.0139019489288
0.0982608795166
I've also tried numpy's argwhere, which is about as fast as where. I'm pretty new to numpy and cython, so if you have any other ideas to really increase performance, I'm all ears!
Contributions:
Numpy can be speed up on flattened array for a 4x gain:
%timeit np.where(data==10)
1 loops, best of 3: 105 ms per loop
%timeit np.unravel_index(np.where(data.ravel()==10),data.shape)
10 loops, best of 3: 26.0 ms per loop
I think you can optimize your cython code with that, avoiding computing k=i*ncol+j for each cell.
Numba give a simple alternative :
from numba import jit
dtype=data.dtype
#jit(nopython=True)
def numbaeq(flatdata,x,nrow,ncol):
size=ncol*nrow
ix=np.empty(size,dtype=dtype)
jx=np.empty(size,dtype=dtype)
count=0
k=0
while k<size:
if flatdata[k]==x :
ix[count]=k//ncol
jx[count]=k%ncol
count+=1
k+=1
return ix[:count],jx[:count]
def whereequal(data,x): return numbaeq(data.ravel(),x,*data.shape)
which gives :
%timeit whereequal(data,10)
10 loops, best of 3: 20.2 ms per loop
Not great optimisation for numba on such problem, under cython performance.
k//ncol and k%ncol can be computed at same time with a optimized divmod operation.
ultimate steps are assembly language and parallélisation , but it's other sports.
I was wondering if I'm missing something when using Cython with Numpy because I haven't seen much of an improvement. I wrote this code as an example.
Naive version:
import numpy as np
from skimage.util import view_as_windows
it = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (it, it), it)
container = np.zeros((windows.shape[0], windows.shape[1]))
def test(windows):
for i in range(windows.shape[0]):
for j in range(windows.shape[1]):
container[i,j] = np.mean(windows[i,j])
return container
%%timeit
test(windows)
1 loops, best of 3: 131 ms per loop
Cythonized version:
%%cython --annotate
import numpy as np
cimport numpy as np
from skimage.util import view_as_windows
import cython
cdef int step = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (step, step), step)
#cython.boundscheck(False)
def cython_test(np.ndarray[np.float64_t, ndim=4] windows):
cdef np.ndarray[np.float64_t, ndim=2] container = np.zeros((windows.shape[0], windows.shape[1]),dtype=np.float64)
cdef int i, j
I = windows.shape[0]
J = windows.shape[1]
for i in range(I):
for j in range(J):
container[i,j] = np.mean(windows[i,j])
return container
%timeit cython_test(windows)
10 loops, best of 3: 126 ms per loop
As you can see, there is a very modest improvement, so maybe I'm doing something wrong. By the way, the annotation that Cython produces the following:
As you can see, the numpy lines have a yellow background even after including the efficient indexing syntax np.ndarray[DTYPE_t, ndim=2]. Why?
By the way, in my view the ideal outcome is being able to use most numpy functions but still get some reasonable improvement after taking advantage of efficient indexing syntax or maybe memory views as in HYRY's answer.
UPDATE
It seems I'm not doing anything wrong in the code I posted above and that the yellow background in some lines is normal, so I was left wondering the following: In which situations I can get a benefit from typing cdef np.ndarray[np.float64_t, ndim=2] in front of numpy arrays? I suppose there are specific instances where this is helpful, otherwise there wouldn't be much purpose in doing it.
You need to implement the mean() function yourself to speedup the code, this is because the overhead of calling a numpy function is very high.
#cython.boundscheck(False)
#cython.wraparound(False)
def cython_test(double[:, :, :, :] windows):
cdef double[:, ::1] container
cdef int i, j, k, l
cdef int n0, n1, n2, n3
cdef double inv_n
cdef double s
n0, n1, n2, n3 = windows.base.shape
container = np.zeros((n0, n1))
inv_n = 1.0 / (n2 * n3)
for i in range(n0):
for j in range(n1):
s = 0
for k in range(n2):
for l in range(n3):
s += windows[i, j, k, l]
container[i,j] = s * inv_n
return container.base
Here is the %timeit results:
python_test(windows): 63.7 ms
cython_test(windows): 1.24 ms
np.mean(windows, axis=(2, 3)): 2.66 ms
As part of a large piece of code, I need to calculate arrays of incomplete gamma functions. For example, I need a function that returns (the log of) (gamma(z + m, a, inf)/m!) for m in [0, m_max], for various values of m_max (typically around 400), z, and a. I need to do this quickly. Currently, this step is the the slowest in my code by around a factor of ~2. However, the full code takes ~a day to run, so reducing the computation time of this step by 2 would save me a lot of wall time.
I am using the following cython code for the calculation:
import numpy as np
cimport numpy as np
from mpmath import mp
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(np.int m_max, np.float z, np.float a):
cdef np.ndarray gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = np.float(gammainc_mem(z, a))
cdef np.ndarray i_array = np.arange(1., m_max + 1., dtype=np.float)
cdef Py_ssize_t i
for i in np.arange(1, m_max + 1):
gi_list[i] = (i_array[i-1] - 1. + z)*gi_list[i-1]/i + np.exp((i_array[i-1] - 1. + z)*np.log(a) - a - log_factorial_ary[i])
return gi_list
As an example, when I call gammainc_up_fct_ary_log(400,-0.3,10.0) it takes around ~0.015-0.025 seconds. I would like to speed this up by at least a factor of 2 (or, ideally, as fast as possible).
Is there a clear way to speed up this computation using cython? If not, would C or Fortran be significantly faster? If so, what is the fastest way to write this function in that language and then call the code from python (the rest of my code is written in python/cython).
Thanks in advance.
There are several big issues in your cython version:
i_array is useless, you can safely replace i_array[i-1] by just i
You're not getting the most of cython. If you have a look to the output of cython -a on your code, you'll see that cython is just generating calls to the C-API, while you need calls to C code to have it run fast.
Here is an example of what you could achieve (incomplete, but the speedup is already great)
import numpy as np
cimport numpy as np
cimport cython
from mpmath import mp
cdef extern from "math.h":
double log(double x) nogil
double exp(double x) nogil
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
factorial_ary = np.array([np.float(mp.factorial(m)) for m in np.arange(sp_max)])
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(m_max, z, a):
return gammainc_up_fct_ary_log_impl(m_max, z, a)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef gammainc_up_fct_ary_log_impl(int m_max, double z, double a):
cdef double[::1] gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = gammainc_mem(z, a)
cdef Py_ssize_t i
for i in range(1, m_max + 1):
t0 = (i - 1. + z)
t1 = (i - 1. + z)*log(a) - a
gi_list[i] = t0*gi_list[i-1]/i + exp(t1 - log_factorial_ary[i])
return gi_list
running this code gives me:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
10000 loops, best of 3: 132 usec per loop
while your version hardly gives:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
100 loops, best of 3: 2.44 msec per loop