T(i) = Tm(i) + (T(i-1)-Tm(i))**(-tau(i))
Tm and tau are NumPy vectors of the same length that have been previously calculated, and the desire is to create a new vector T. The i is included only to indicate the element index for what is desired.
Is a for loop necessary for this case?
You might think this would work:
import numpy as np
n = len(Tm)
t = np.empty(n)
t[0] = 0 # or whatever the initial condition is
t[1:] = Tm[1:] + (t[0:n-1] - Tm[1:])**(-tau[1:])
but it doesn't: you can't actually do recursion in numpy this way (since numpy calculates the whole RHS and then assigns it to the LHS).
So unless you can come up with a non-recursive version of this formula, you're stuck with an explicit loop:
tt = np.empty(n)
tt[0] = 0.
for i in range(1,n):
tt[i] = Tm[i] + (tt[i-1] - Tm[i])**(-tau[i])
2019 Update. The Numba code broke with the new version of numba. Changing dtype="float32" to dtype=np.float32 solved it.
I performed some benchmarks and in 2019 using Numba is the first option people should try to accelerate recursive functions in Numpy (adjusted proposal of Aronstef). Numba is already preinstalled in the Anaconda package and has one of the fastest times (about 20 times faster than any Python). In 2019 Python supports #numba annotations without additional steps (at least versions 3.6, 3.7, and 3.8). Here are three benchmarks: performed on 2019-12-05, 2018-10-20 and 2016-05-18.
And, as mentioned by Jaffe, in 2018 it is still not possible to vectorize recursive functions. I checked the vectorization by Aronstef and it does NOT work.
Benchmarks sorted by execution time:
-------------------------------------------
|Variant |2019-12 |2018-10 |2016-05 |
-------------------------------------------
|Pure C | na | na | 2.75 ms|
|C extension | na | na | 6.22 ms|
|Cython float32 | 0.55 ms| 1.01 ms| na |
|Cython float64 | 0.54 ms| 1.05 ms| 6.26 ms|
|Fortran f2py | 4.65 ms| na | 6.78 ms|
|Numba float32 |73.0 ms| 2.81 ms| na |
|(Aronstef) | | | |
|Numba float32v2| 1.82 ms| 2.81 ms| na |
|Numba float64 |78.9 ms| 5.28 ms| na |
|Numba float64v2| 4.49 ms| 5.28 ms| na |
|Append to list |73.3 ms|48.2 ms|91.0 ms|
|Using a.item() |36.9 ms|58.3 ms|74.4 ms|
|np.fromiter() |60.8 ms|60.0 ms|78.1 ms|
|Loop over Numpy|71.3 ms|71.9 ms|87.9 ms|
|(Jaffe) | | | |
|Loop over Numpy|74.6 ms|74.4 ms| na |
|(Aronstef) | | | |
-------------------------------------------
Corresponding code is provided at the end of the answer.
It seems that with time Numba and Cython times get better. Now both of them are faster than Fortran f2py. Cython is faster 8.6 times now and Numba 32bit is faster 2.5 times. Fortran was very hard to debug and compile in 2016. So now there is no reason to use Fortran at all.
I did not check Pure C and C extension in 2019 and 2018, because it is not easy to compile them in Jupyter notebooks.
I had the following setup in 2019:
Processor: Intel i5-9600K 3.70GHz
Versions:
Python: 3.8.0
Numba: 0.46.0
Cython: 0.29.14
Numpy: 1.17.4
I had the following setup in 2018:
Processor: Intel i7-7500U 2.7GHz
Versions:
Python: 3.7.0
Numba: 0.39.0
Cython: 0.28.5
Numpy: 1.15.1
The recommended Numba code using float32 (adjusted Aronstef):
#numba.jit("float32[:](float32[:], float32[:])", nopython=True, nogil=True)
def calc_py_jit32v2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float32)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
All the other code:
Data creation (like Aronstef + Mike T comment):
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float64'))
tau = np.random.uniform(-1, 0, size=n).astype('float64')
ar = np.column_stack([Tm,tau])
Tm32 = Tm.astype('float32')
tau32 = tau.astype('float32')
Tm_l = list(Tm)
tau_l = list(tau)
The code in 2016 was slightly different as I used abs() function to prevent nans and not the variant of Mike T. In 2018 the function is exactly the same as OP (Original Poster) wrote.
Cython float32 using Jupyter %% magic. The function can be used directly in Python. Cython needs a C++ compiler in which Python was compiled. Installation of the right version of Visual C++ compiler (for Windows) could be problematic:
%%cython
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
cdef extern from "math.h":
np.float32_t exp(np.float32_t m)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop32(np.float32_t[:] Tm,np.float32_t[:] tau,int alen):
cdef np.float32_t[:] T=np.empty(alen, dtype=np.float32)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Cython float64 using Jupyter %% magic. The function can be used directly in Python:
%%cython
cdef extern from "math.h":
double exp(double m)
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop(double[:] Tm,double[:] tau,int alen):
cdef double[:] T=np.empty(alen)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Numba float64:
#numba.jit("float64[:](float64[:], float64[:])", nopython=False, nogil=True)
def calc_py_jitv2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float64)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
Append to list. Fastest non-compiled solution:
def rec_py_loop(Tm,tau,alen):
T = [Tm[0]]
for i in range(1,alen):
T.append(Tm[i] - (T[i-1] + Tm[i])**(-tau[i]))
return np.array(T)
Using a.item():
def rec_numpy_loop_item(Tm_,tau_):
n_ = len(Tm_)
tt=np.empty(n_)
Ti=tt.item
Tis=tt.itemset
Tmi=Tm_.item
taui=tau_.item
Tis(0,Tm_[0])
for i in range(1,n_):
Tis(i,Tmi(i) - (Ti(i-1) + Tmi(i))**(-taui(i)))
return tt[1:]
np.fromiter():
def it(Tm,tau):
T=Tm[0]
i=0
while True:
yield T
i+=1
T=Tm[i] - (T + Tm[i])**(-tau[i])
def rec_numpy_iter(Tm,tau,alen):
return np.fromiter(it(Tm,tau), np.float64, alen)[1:]
Loop over Numpy (based on the Jaffe's idea):
def rec_numpy_loop(Tm,tau,alen):
tt=np.empty(alen)
tt[0]=Tm[0]
for i in range(1,alen):
tt[i] = Tm[i] - (tt[i-1] + Tm[i])**(-tau[i])
return tt[1:]
Loop over Numpy (Aronstef's code). On my computer float64 is the default type for np.empty.
def calc_py(Tm_, tau_):
tt = np.empty(len(Tm_),dtype="float64")
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = (Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i]))
return tt[1:]
Pure C without using Python at all. Version from year 2016 (with fabs() function):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <sys\timeb.h>
double randn() {
double u = rand();
if (u > 0.5) {
return sqrt(-1.57079632679*log(1.0 - pow(2.0 * u - 1, 2)));
}
else {
return -sqrt(-1.57079632679*log(1.0 - pow(1 - 2.0 * u,2)));
}
}
void rec_pure_c(double *Tm, double *tau, int alen, double *T)
{
for (int i = 1; i < alen; i++)
{
T[i] = Tm[i] + pow(fabs(T[i - 1] - Tm[i]), (-tau[i]));
}
}
int main() {
int N = 100000;
double *Tm= calloc(N, sizeof *Tm);
double *tau = calloc(N, sizeof *tau);
double *T = calloc(N, sizeof *T);
double time = 0;
double sumtime = 0;
for (int i = 0; i < N; i++)
{
Tm[i] = randn();
tau[i] = randn();
}
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
for (int j = 0; j < 1000; j++)
{
for (int i = 0; i < 3; i++)
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
rec_pure_c(Tm, tau, N, T);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
if (i == 0)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
else {
if (time > (double)ElapsedMicroseconds.QuadPart / 1000)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
}
}
sumtime += time;
}
printf("1000 loops,best of 3: %.3f ms per loop\n",sumtime/1000);
free(Tm);
free(tau);
free(T);
}
Fortran f2py. Function can be used from Python. Version from year 2016 (with abs() function):
subroutine rec_fortran(tm,tau,alen,result)
integer*8, intent(in) :: alen
real*8, dimension(alen), intent(in) :: tm
real*8, dimension(alen), intent(in) :: tau
real*8, dimension(alen) :: res
real*8, dimension(alen), intent(out) :: result
res(1)=0
do i=2,alen
res(i) = tm(i) + (abs(res(i-1) - tm(i)))**(-tau(i))
end do
result=res
end subroutine rec_fortran
Update: 21-10-2018
I have corrected my answer based on comments.
It is possible to vectorize operations on vectors as long as the calculation is not recursive. Because a recursive operation depends on the previous calculated value it is not possible to parallel process the operation.
This does therefore not work:
def calc_vect(Tm_, tau_):
return Tm_[1:] - (Tm_[:-1] + Tm_[1:]) ** (-tau_[1:])
Since (serial processing / a loop) is necessary, the best performance is gained by moving as close as possible to optimized machine code, therefore Numba and Cython are the best answers here.
A Numba approach can be achieves as follows:
init_string = """
from math import pow
import numpy as np
from numba import jit, float32
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float32'))
tau = np.random.uniform(-1, 0, size=n).astype('float32')
def calc_python(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
#jit(float32[:](float32[:], float32[:]), nopython=False, nogil=True)
def calc_numba(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
"""
import timeit
py_time = timeit.timeit('calc_python(Tm, tau)', init_string, number=100)
numba_time = timeit.timeit('calc_numba(Tm, tau)', init_string, number=100)
print("Python Solution: {}".format(py_time))
print("Numba Soltution: {}".format(numba_time))
Timeit comparison of the Python and Numba functions:
Python Solution: 54.58057559299999
Numba Soltution: 1.1389029540000024
This is a good question. I am also interested to know if this is possible but so far I have not found a way to do it except in some simple cases.
Option 1. numpy.ufunc.accumulate
This seems to be a promising option as mentioned by #Karl Knechtel. You need to create a ufunc first. This web page explains how.
In the simple case of a recurrent function that takes two scalars as input and outputs one scaler, it seems to work:
import numpy as np
def test_add(x, data):
return x + data
assert test_add(1, 2) == 3
assert test_add(2, 3) == 5
# Make a Numpy ufunc from my test_add function
test_add_ufunc = np.frompyfunc(test_add, 2, 1)
assert test_add_ufunc(1, 2) == 3
assert test_add_ufunc(2, 3) == 5
assert np.all(test_add_ufunc([1, 2], [2, 3]) == [3, 5])
data_sequence = np.array([1, 2, 3, 4])
f_out = test_add_ufunc.accumulate(data_sequence, dtype=object)
assert np.array_equal(f_out, [1, 3, 6, 10])
[Note the dtype=object argument which is necessary as explained on the web page linked above].
But in your case (and mine) we want to compute a recurrent equation that has more than one data input (and potentially more than one state variable too).
When I tried this using the ufunc.accumulate approach above I got ValueError: accumulate only supported for binary functions.
If anyone knows a way round that constraint I would be very interested.
Option 2. Python's builtin accumulate function
In the mean time, this solution doesn't quite achieve what you wanted in terms of a vectorized calculation in numpy, but it does at least avoid a for loop.
from itertools import accumulate, chain
def t_next(t, data):
Tm, tau = data # Unpack more than one data input
return Tm + (t - Tm)**tau
assert t_next(2, (0.38, 0)) == 1.38
t0 = 2 # Initial t
Tm_values = np.array([0.38, 0.88, 0.56, 0.67, 0.45, 0.98, 0.58, 0.72, 0.92, 0.82])
tau_values = np.linspace(0, 0.9, 10)
# Combine the input data into a 2D array
data_sequence = np.vstack([Tm_values, tau_values]).T
t_out = np.fromiter(accumulate(chain([t0], data_sequence), t_next), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]
# Slightly more readable version possible in Python 3.8+
t_out = np.fromiter(accumulate(data_sequence, t_next, initial=t0), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]
To build on NPE's answer, I agree that there has to be a loop somewhere. Perhaps your goal is to avoid the overhead associated with a Python for loop? In that case, numpy.fromiter does beat out a for loop, but only by a little:
Using the very simple recursion relation,
x[i+1] = x[i] + 0.1
I get
#FOR LOOP
def loopit(n):
x = [0.0]
for i in range(n-1): x.append(x[-1] + 0.1)
return np.array(x)
#FROMITER
#define an iterator (a better way probably exists -- I'm a novice)
def it():
x = 0.0
while True:
yield x
x += 0.1
#use the iterator with np.fromiter
def fi_it(n):
return np.fromiter(it(), np.float, n)
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 31.7 ms per loop
%timeit -n 100 fi_it(100000)
#100 loops, best of 3: 18.6 ms per loop
Interestingly, pre-allocating a numpy array results in a substantial loss in performance. This is a mystery to me, though I would guess that there must be more overhead associated with accessing an array element than with appending to a list.
def loopit(n):
x = np.zeros(n)
for i in range(n-1): x[i+1] = x[i] + 0.1
return x
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 50.1 ms per loop
Related
What I am trying to do
I am trying to create a very simple function which I want to optimise with numba (or at least verify if numba makes any difference).
I am running numpy 1.19.2 and numba 0.51.2 in an Anaconda installation on Windows.
The function takes 3 numeric inputs: a, b , c; the inputs can be scalars or numpy arrays; the output will of course be, respectively, a scalar or a numpy array
The function is fairly simple:
if a == 0 --> it returns np.nan
if b == 0 --> it returns a certain number
otherwise it performs some very simple algebra
The issue
I have come up with the toy example below (my actual formulas are more complex but I can show what I need to show with this easier example).
if the inputs are arrays, it works perfectly
if the inputs are scalar, numba doesn't work (Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2' )
if the inputs are arrays of size 1 (I make an array out of each scalar) numba works again
What I tried / similar questions
The closest question I found was this, but the mismatch there was between an int and a float.
Here it is between an array(int64, 0d, C) and a float64. I can convert my inputs to float but the mismatch remains.
Any ideas? I am not sure what the array and the float being compared are, to be honest.
The one solution I have found is to add a = np.array([a]) at the beginning of the function, but I don't understand why, plus this returns an array of size 1, whereas I'd like a scalar returned in these cases.
Toy example
#numba.jit
def my_fun(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , c**2) )
a = np.arange(0,11)
b = np.arange(3,14)
b[1] = 0
c = np.arange(10,21)
out_array = my_fun(a,b,c)
out_scalar = my_fun(0,0,1)
The exact warning:
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function my_fun failed at nopython mode lowering due to: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2', defined at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: typing of assignment at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: lowering "$36call_method.17 = call $4load_method.1($10compare_op.4, $14load_attr.6, $34call_method.16, func=$4load_method.1, args=[Var($10compare_op.4, refactor numba.py:8), Var($14load_attr.6, refactor numba.py:8), Var($34call_method.16, refactor numba.py:9)], kws=(), vararg=None)" at D:\MY DATA\USERNAME\Python\scratch scripts\refactor numba.py (8)
#numba.jit
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:177: NumbaWarning: Function "my_fun" was compiled in object mode without forceobj=True.
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaWarning(warn_msg,
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:187: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
For more information visit https://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaDeprecationWarning(msg,
I have found a solution, but it's far from elegant, and I am hoping there is a better one.
To recap, I needed a function which:
works with numba
works with both scalars and arrays
returns scalar (not a one-sized array) when the inputs are scalars, and arrays when the inputs are arrays
I have tried the following, and found option 2 to be the fastest.
my_fun_optimised_1: a function which, without numba, determines whether the inputs are scalar or not, and then calls, accordingly, a sub-function for the scalar case and one for the arrays. Both sub-functions run with numba, but take forever. I guess this is because numba must be re-initialised at each call of the main function.
my_fun_optimised_2: similar to the above, except the scalar and array functions, both running with numba, are main functions and not subfunctions. Much much faster.
my_fun_non_opt_no_numba : a function which runs without numba.
The results are:
+-------------------------+----------------------------+-----------------------------+
| Function | Array: time vs the fastest | Scalar: time vs the fastest |
+-------------------------+----------------------------+-----------------------------+
| optimised numba 1 | 54,403 | 42,961 |
| optimised numba 2 | 1 | 1 |
| non-optimised, no numba | 3.409 | 4.53892 |
+-------------------------+----------------------------+-----------------------------+
What this means is that, on my PC, the non-optimised, no numba code takes 4.5 times longer than "optimsied numba 2" to run on scalars and 3.4 times longer for arrays.
The "optimised numba 1" is not optimsied at all and takes an insane amount of time.
I hope all of this can be of use to other people.
PS I am very well familiar with pitfalls of premature optimisation. I am only doing this because I have a specific case where 60% of the time is spent doing a similar (but not identical) calculation to the one shown here.
The code to time the functions is:
import numpy as np
import numba
import timeit
import pandas as pd
def my_fun_optimised_1(a,b,c):
#numba.jit
def my_fun_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
#numba.jit
def my_fun_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return my_fun_scalar(a,b,c)
else:
return my_fun_vectorised(a,b,c)
def my_fun_optimised_2(a,b,c):
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return fun_2_scalar(a,b,c)
else:
return fun_2_vectorised(a,b,c)
#numba.jit
def fun_2_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
#numba.jit
def fun_2_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
def my_fun_non_opt_no_numba(a,b,c):
# multipl by 1 converts from array to scalar
return 1 * np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
# I couldn't get this to work with Numba
##numba.jit
def my_fun_non_opt_numba(a,b,c):
a = np.array([a])
b = np.array([b])
c = np.array([c])
out = np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
return out
r = 4
n = int(100)
a = 3
b = 4
c = 5
x = my_fun_optimised_2(a,b,c)
t_scalar_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_scalar = pd.DataFrame(index = ['min time'])
resdf_scalar['optimised numba 1'] = [min(t_scalar_opt_numba_1)]
resdf_scalar['optimised numba 2'] = [min(t_scalar_opt_numba_2)]
resdf_scalar['non-optimised, no numba'] = [min(t_scalar_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_scalar = resdf_scalar.transpose()
resdf_scalar['diff vs fastest'] = (resdf_scalar / resdf_scalar.min() )
a = np.arange(3,13)
b = np.arange(0,10)
c = np.arange(20,30)
t_array_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_array = pd.DataFrame(index = ['min time'])
resdf_array['optimised numba 1'] = [min(t_array_opt_numba_1)]
resdf_array['optimised numba 2'] = [min(t_array_opt_numba_2)]
resdf_array['non-optimised, no numba'] = [min(t_array_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_array = resdf_array.transpose()
resdf_array['diff vs fastest'] = (resdf_array / resdf_array.min() )
I'm trying to solve a 2D-Ising model with Monte Carlo approach.
As it is slow I used Cython to accelerate the code execution. I would like to push it even further and parallelize the Cython code. My idea is to split the 2D-lattice in two, so for any point on a lattice has it's nearest neigbours on the other lattice. This way I can randomly choose one lattice and I can flip all the spins and this could be done in parallel since all those spins are independent.
So far this is my code :( inspired from http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/ )
%load_ext Cython
%%cython
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_ising_step(np.int64_t[:, :] field,float beta):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef int offset = np.random.randint(0,2)
cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)
cdef int m,n,i,j
for m in prange(M,nogil=True):
i = m % 2
for j in range(n_update.shape[0]) :
n = n_update[j]
cy_spin_flip(field,(n+i) %N,m%M,beta)
return np.array(field,dtype=np.int64)
cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
if dE <= 0 :
field[n,m] *= -1
elif np.exp(-dE * beta) > np.random.rand():
field[n,m] *= -1
I tried using a prange-constructor but I'm having a lots of troubles with GIL-lock. I'am new to Cython and parallel computing so I could easily have missed something.
The error :
Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil
Q : "How to use prange in cython?" . . . . + ( an Epilogue on True-[PARALLEL] True-randomness ... )
Short version : best in those and only those places, where performance gains.
Longer version :Your problem starts not with avoiding a GIL-lock ownership, but with the Physics & the Performance losses from almost computational anti-patterns, irrespective of all the powers the cython-isation may have ever enabled.
The code as-is attempts to apply a 2D-kernel operator over a whole 2D-domain of the {-1|+1}-spin-field[N,M], best in some fast and smart manner.
The actual result is INCONGRUENT with PHYSICAL FIELD ISING, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m] right "during" a current generation of [PAR][SEQ]-organised coverage of the 2D-domain of the field[:,:] of current spin values sequentially modifies the state of the field[i,j], which obviously does not happen in the real-world of the recognised Laws of Physics. Computers are ignorant of these rules, we, humans, should prefer not to.
Next, the prange'd attempt calls ( M * N / 2 )-times a cdef-ed cy_spin_flip() in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.
If one benchmarks the costs of invoking about 1E6-calls to a repaired, so as to become congruent with the Laws of Physics, cy_spin_flip() function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.
Vectorised code using a 2D-kernel convolution :
A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation #Divakar, can produce one step per ~ 3k3 [us] without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200] :
The initial state is :
spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv = spin_2Dstate.copy()
The actual const convolution kernel is :
spin_2Dkernel = np.array( [ [ 0, 1, 0 ],
[ 1, 0, 1 ],
[ 0, 1, 0 ]
],
dtype = np.int8 # [PERF] to be field-tested,
) # some architectures may get faster if matching CPU-WORD
The actual CPU-architecture may benefit from smart-aligned data types, yet for larger 2D-domains ~ [ > 200, > 200 ] users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 } or even more compact bitmap stored-{ 0 | 1 } spin-information.
Next, instead of double-looping calls on each field[:,:]-cell, rather block-update the full 2D-domain in one step, the helpers get:
# T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
spin_kernel,
boundary = 'wrap',
mode = 'same'
)[:,:]
Because of the Physics inside the spin-kernel properties,this helper array will consist of only { -4 | -2 | 0 | +2 | +4 } values.
A simplified, fast vector code :
def aVectorisedSpinUpdateSTEPrandom( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )
For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole 2D-domain at a same value, replace a scalar produced from the np.random.rand() with a 2D-field-of-(individualised † )-probabilities delivered from np.random.rand( N, M )[:,:] and this will now add some costs up to some 7k3 ~ 9k3 [us] per a spin update step :
def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )
>>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
7280 [us]
8984 [us]
9299 [us]
wide-screen commented as-was source :
// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
# https://www.desmos.com/calculator/bgz9t3s3nm
#cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy
#cython.wraparound( False ) # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython
def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
float beta # beta: a float-factor
): #
cdef int N = field.shape[0] # const
cdef int M = field.shape[1] # const
cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] }
cdef int m, n, i, j # idxs{ (E) | [O] }
# #
for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M |||||||||
i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes
for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
# n = n_update[j] # n = n_update[j] |||||||||
# cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # |||||||||
# ||||| # INCONGRUENT with PHYSICAL FIELD ISING |||||||||
# vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
# # |||||||||
return np.array( field, dtype = np.int64 ) # ||||||||| RET?
#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
int n, # const int
int m, # const int
float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ?
float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
):
cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
*( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP
+ field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::|
+ field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J }
+ field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP #random-scaled by 2*J*beta
)# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY )
) # | | #
if dE <= 0 : # | | #
field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy)
elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
> np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ????
# # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????
The Physics:
The beta-controlled ( given const J ) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 } which are the only cases for ~ 2 * spin_2Dkernel-convolutions across the whole 2D-domain of the current spin_2Dstate, is available here : https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }, as np.exp( -dE * 2 * J * beta ) is strongly controlled by beta and the larger the beta the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1) will not dominate the np.exp()-result.
† An Epilogue on a Post-Festum Remark :
"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."– Angelo C 7 hours ago
Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.
While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance ( a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source ).
You either design your process to be truly a randomness based ( using some of the available sources of indeed non-deterministic entropy ), or willing to be sub-ordinated to a policy to allowing repeatable experiments ( for re-inspection & re-validation of scientific computing ), for which you have one more duty - a duty of Configuration Management of such scientific experiment ( to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.
Here, given the nature warrants the spins to be mutually independent in the 2D-domain of the field[:,:], the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1). As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this ( be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in" ).
That means, either of the spins will remain fair & congruent with the Laws of Physics if and only if it sources a spin-flip decision treshhold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget it implement right & run it efficiently).
For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly ND-entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) Finite-State-Automata, isn't it?
From a Cython point-of-view the main problem is that cy_spin_flip requires the GIL. You need to add nogil to the end of its signature, and set the return type to void (since by default it returns a Python object, which requires the GIL).
However, np.exp and np.random.rand also require the GIL, because they're Python function calls. np.exp is probably easily replaced with libc.math.exp. np.random is a bit harder, but there's plenty of suggestions for C- and C++-based approaches: 1 2 3 4 (+ others).
A more fundamental problem is the line:
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
You've parallelized this with respect to m (i.e. different values of m are run in different threads), and each iteration changes field. However in this line you are looking up several different values of m. This means the whole thing is a race-condition (the result depends on which order the different threads finish) and suggests your algorithm may be fundamentally unsuitable for parallelization. Or that you should copy field and have field_in and field_out. It isn't obvious to me, but this is something that you should be able to work out.
Edit: it does look like you've given the race condition some thought with using i%2. It isn't obvious to me that this is right though. I think a working implementation of your "alternate cells" scheme would look something like:
for oddeven in range(2):
for m in prange(M):
for n in range(N):
# some mechanism to pick the alternate cells here.
i.e. you need a regular loop to pick the alternate cells outside your parallel loop.
An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes
0a 0b 0c 0d (only lower nibble has data)
become
ab cd
Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.
Combining every two bytes from a Pillow "L" (monochrome 8 bit) Image:
imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
nib = n & 0x0F
if msn is not None:
b = msn << 4 | nib
packed.append(b)
msn = None
else:
msn = nib
This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:
mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)
# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
if msn is not None:
b = msn << 4 | nib
o.append(b)
msn = None
else:
msn = nib
takes about twice as long (0.9 s vs 0.4 s).
The struct module does not support nibbles (half-bytes).
bitstring does allow packing nibbles:
>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>
But there does not seem to be a method to unpack this into a list of integers quickly -- this takes 30 seconds!:
a = bitstring.BitStream()
for p in imd:
a.append( bitstring.Bits(uint=p//16, length=4) )
packed=[]
a.pos=0
for p in range(256*64//2):
packed.append( a.read(8).uint )
Does Python 3 have the means to do this efficiently or do I need an alternative?
External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.
Down to 130 ms from 200 ms by just wrapping the loop in a function
def packer0(imd):
"""same loop in a def"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 35 ms by Cythonizing the same code
def packer1(imd):
"""Cythonize python nibble packing loop"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 16 ms with type
def packer2(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int b
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Not much of a difference with a "simplified" loop
def packer3(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int i
for i in range(256*64/2):
packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
return packed
Maybe a tiny bit faster even (15 ms)
def packer4(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
Here's with timeit
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi#raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop
This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).
Or with parallelism on the GPU or the multi-core Raspberry Pis.
A crude comparison with the same loop in C (ideone):
#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
uint8_t in[SIZE] = {0};
uint8_t out[SIZE/2] = {0};
uint8_t t;
for(t=0; t<100; t++){
uint16_t i;
for(i=0; i<SIZE/2; i++){
out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
}
}
return 0;
}
It's apparently 100 times faster:
pi#raspberry:~ $ gcc p.c
pi#raspberry:~ $ time ./a.out
real 0m0.085s
user 0m0.060s
sys 0m0.010s
Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):
def packs(bytes it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
results in
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop
I implemented exponentially weighted moving average (ewma) in python3 and in Haskell (compiled). It takes about the same time. However when this function is applied twice, haskell version slows down unpredictably (more than 1000 times, whereas python version is only about 2 times slower).
Python3 version:
import numpy as np
def ewma_f(y, tau):
a = 1/tau
avg = np.zeros_like(y)
for i in range(1, len(y)):
avg[i] = a*y[i-1]+(1-a)*avg[i-1]
return avg
Haskell with lists:
ewmaL :: [Double] -> Double -> [Double]
ewmaL ys tau = reverse $ e (reverse ys) (1.0/tau)
where e [x] a = [a*x]
e (x:xs) a = (a*x + (1-a)*(head $ e xs a) : e xs a)
Haskell with arrays:
import qualified Data.Vector as V
ewmaV :: V.Vector Double -> Double -> V.Vector Double
ewmaV x tau = V.map f $ V.enumFromN 0 (V.length x)
where
f (-1) = 0
f n = (x V.! n)*a + (1-a)*(f (n-1))
a = 1/tau
In all cases, computation takes about the same time (tested for an array with 10000 elements).
Haskell code was compiled without any flags, though "ghc -O2" didn't make any difference.
I used computed ewma to compute absolute deviation from this ewma; I then applied ewma function to this deviation.
Python3:
def ewmd_f(y, tau):
ewma = ewma_f(y, tau)
return ewma_f(np.abs(y-ewma), tau)
It runs twice longer compared to ewma.
Haskell with lists:
ewmdL :: [Double] -> Double -> [Double]
ewmdL xs tau = ewmaL devs tau
where devs = zipWith (\ x y -> abs $ x-y) xs avg
avg = (ewmaL xs tau)
Haskell with vectors:
ewmdV :: V.Vector Double -> Double -> V.Vector Double
ewmdV xs tau = ewmaV devs tau
where devs = V.zipWith (\ x y -> abs $ x-y) xs avg
avg = ewmaV xs tau
Both ewmd run > 1000 slower than their ewma counterparts.
I evaluated python3 code with:
from time import time
x = np.sin(np.arange(10000))
tau = 100.0
t1 = time()
ewma = ewma_f(x, tau)
t2 = time()
ewmd = ewmd_f(x, tau)
t3 = time()
print("EWMA took {} s".format(t2-t1))
print("EWMD took {} s".format(t3-t2))
I evaluated Haskell code with:
import System.CPUTime
timeIt f = do
start <- getCPUTime
end <- seq f getCPUTime
let d = (fromIntegral (end - start)) / (10^12) in
return (show d)
main = do
let n = 10000 :: Int
let tau = 100.0
let l = map sin [0.0..(fromIntegral $ n-1)]
let x = V.map sin $ V.enumFromN 0 n
putStrLn "Vectors"
aV <- timeIt $ V.last $ ewmaV x tau
putStrLn $ "EWMA (vector) took "++aV
dV <- timeIt $ V.last $ ewmdV x tau
putStrLn $ "EWMD (vector) took "++dV
putStrLn ""
putStrLn "Lists"
lV <- timeIt $ last $ ewmaL l tau
putStrLn $ "EWMA (list) took "++lV
lD <- timeIt $ last $ ewmdL l tau
putStrLn $ "EWMD (list) took "++lD
Your Python and Haskell algorithms may look equivalent, but they actually have different asymptotic complexity:
ewmaV x tau = V.map f $ V.enumFromN 0 (V.length x)
where
f (-1) = 0
f n = (x V.! n)*a + (1-a)
*(f (n-1)) -- Recursion!
a = 1/tau
This makes the Haskell implementation O (n²), which is inacceptable. The reason you don't notice this when only evaluating V.last . ewmaV is lazyness: to evaluate the last element only, you don't really need to process the entire vector, instead you only get one recursion-loop across x. OTOH, ewmdV actually forces all of the elements, hence the extra cost.
One simple (but not optimal, I daresay) way to get around this is to memoise the result:
ewmaV :: V.Vector Double -> Double -> V.Vector Double
ewmaV x tau = result
where result = V.map f $ V.enumFromN 0 (V.length x)
f 0 = V.head x * a
f n = (x V.! n)*a + (1-a)*(result V.! (n-1))
a = 1/tau
Now ewmdV takes ≈twice as long as ewmaV:
sagemuej#sagemuej-X302LA:/tmp$ ghc wtmpf-file6122.hs -O2 && ./wtmpf-file6122
[1 of 1] Compiling Main ( wtmpf-file6122.hs, wtmpf-file6122.o )
Linking wtmpf-file6122 ...
Vectors
EWMA (vector) took 4.932e-3
EWMD (vector) took 7.758e-3
(Those timings aren't very reliable; for accurate performance tests use criterion.)
A better solution IMO would be to avoid this indexing business entirely – we're not writing Fortran, are we? IIRs like EWMA are better implemented in a purely “local” manner; this can be expressed nicely in Haskell with a state monad, so you're independent of what container the data ships in.
import Data.Traversable
import Control.Monad (forM)
import Control.Monad.State
ewma :: Traversable t => t Double -> Double -> t Double
ewma x tau = (`evalState`0) . forM x $
\xi -> state $ \carry
-> let yi = a*xi + (1-a)*carry
in (yi, yi)
where a = 1/tau
While we're at generalising: there's no reason to restrict this only to work with Double data; you can filter any kind of variable that can be scaled and interpolated.
{-# LANGUAGE FlexibleContexts #-}
import Data.VectorSpace
ewma :: (Traversable t, VectorSpace v, Fractional (Scalar v))
=> t v -> Scalar v -> t v
ewma x tau = (`evalState`zeroV) . forM x $
\xi -> state $ \carry
-> let yi = a*^xi ^+^ (1-a)*^carry
in (yi, yi)
where a = 1/tau
This way, you can in principle use the same filter for motion-blurring video data stored in a lazily streamed infinite list of picture frames, as for lowpass-filtering a radio signal pulse stored in an unboxed Vector. (VU.Vector actually has no Traversable instance; you need to substitute oforM then.)
The following makes two recursive calls:
ewmaL ys tau = reverse $ e (reverse ys) (1.0/tau)
where e [x] a = [a*x]
e (x:xs) a = (a*x + (1-a)*(head $ e xs a) : e xs a)
We can make one recursive call, and use the result for both cases:
ewmaLcse :: [Double] -> Double -> [Double]
ewmaLcse ys tau = reverse $ e (reverse ys) (1.0/tau)
where e [x] a = [a*x]
e (x:xs) a = (a*x + (1-a)*(head zs) : zs)
where zs = e xs a
I also chose to benchmark the sum of the list, so to force all of it to be computed:
lV <- timeIt $ sum $ ewmaL l tau
putStrLn $ "EWMA (list) took "++lV
lVcse <- timeIt $ sum $ ewmaLcse l tau
putStrLn $ "EWMAcse (list) took "++lVcse
Results, with n=10000
Lists
EWMA (list) took 2.384
EWMAcse (list) took 0.0
with n=20000
Lists
EWMA (list) took 16.472
EWMAcse (list) took 4.0e-3
By the way, one can also use standard library loops for this specific recursion. Here I resorted to mapAccumL: no need to double-reverse lists.
ewmaL2 :: [Double] -> Double -> [Double]
ewmaL2 ys tau = snd $ mapAccumL e 0 ys
where a = 1/tau
e !prevAvg !y = (nextAvg,nextAvg)
where !nextAvg = a*y+(1-a)*prevAvg
(Actually, the fact that I am using (nextAvg,nextAvg) means that a simpler scanl would also do the job. Oh, well...)
To add up to leftaroundabout's great answer, I have eventually solved my problem using V.scanl:
ewma :: V.Vector Double -> Double -> V.Vector Double
ewma x tau = V.tail $ V.scanl ma 0 x
where
ma avg x = x*a + (1-a)*avg
a = 1/tau
My guess is that scanl has O(n) complexity, not O(n2) as my initial code.
With unboxed vectors, it gives pretty decent performance.
Numpy doesn't yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower than numpy's quicksort.
Test and benchmark:
a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a)))
%timeit np.sort(a)
%timeit radix_sort(a)
The mask_b loop can be at least partially vectorized, broadcasting out across masks from &, and using cumsum with axis arg, but that ends up being a pessimization, presumably due to the increased memory footprint.
If anyone can see a way to improve on what I have I'd be interested to hear, even if it's still slower than np.sort...this is more a case of intellectual curiosity and interest in numpy tricks.
Note that you can implement a fast counting sort easily enough, though that's only relevant for small integer data.
Edit 1: Taking np.arange(n) out of the loop helps a little, but that's not very exiciting.
Edit 2: The cumsum was actually redundant (ooops!) but this simpler version only helps marginally with performance..
def radix_sort(a):
bit_len = np.max(a).bit_length()
n = len(a)
cached_arange = arange(n)
idx = np.empty(n, dtype=int) # fully overwritten each iteration
for mask_b in xrange(bit_len):
is_one = (a & 2**mask_b).astype(bool)
n_ones = np.sum(is_one)
n_zeros = n-n_ones
idx[~is_one] = cached_arange[:n_zeros]
idx[is_one] = cached_arange[:n_ones] + n_zeros
# next three lines just do: a[idx] = a, but correctly
new_a = np.empty(n, dtype=a.dtype)
new_a[idx] = a
a = new_a
return a
Edit 3: rather than loop over single bits, you can loop over two or more at a time, if you construct idx in multiple steps. Using 2 bits helps a little, I've not tried more:
idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)
Edits 4 and 5: going to 4 bits seems best for the input I'm testing. Also, you can get rid of the idx step entirely. Now only about 5 times, rather than 10 times, slower than np.sort (source available as gist):
Edit 6: This is a tidied up version of the above, but it's also a tiny bit slower. 80% of the time is spent on repeat and extract - if only there was a way to broadcast the extract :( ...
def radix_sort(a, batch_m_bits=3):
bit_len = np.max(a).bit_length()
batch_m = 2**batch_m_bits
mask = 2**batch_m_bits - 1
val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
a = np.extract((a & mask)[nax, :] == val_set,
np.repeat(a[nax, :], batch_m, axis=0))
val_set <<= batch_m_bits
mask <<= batch_m_bits
return a
Edits 7 & 8: Actually, you can broadcast the extract using as_strided from numpy.lib.stride_tricks, but it doesn't seem to help much performance-wise:
Initially this made sense to me on the grounds that extract will be iterating over the whole array batch_m times, so the total number of cache lines requested by the CPU will be the same as before (it's just that by the end of the process it has request each cache line batch_m times). However the reality is that extract is not clever enough to iterate over arbitrary stepped arrays, and has to expand out the array before beginning, i.e. the repeat ends up being done anyway.
In fact, having looked at the source for extract, I now see that the best we can do with this approach is:
a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]
which is marginally slower than extract. However, if len(a) is a power of two we can replace the expensive mod operation with & (len(a) - 1), which does end up being a bit faster than the extract version (now about 4.9x np.sort for a=randint(0, 1e8, 2**20). I suppose we could make this work for non-power of two lengths by zero-padding, and then cropping the extra zeros at the end of the sort...however this would be a pessimisation unless the length was already close to being power of two.
I had a go with Numba to see how fast a radix sort could be. The key to good performance with Numba (often) is to write out all the loops, which is very instructive. I ended up with the following:
from numba import jit
#jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
mask = (1 << batch_m_bits) - 1
for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
# set bit sums to zero
for i in range(bitsums.shape[0]):
bitsums[i] = 0
# determine bit sums
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
bitsums[j] += 1
# take the cumsum of the bit sums
cumsum = 0
for i in range(bitsums.shape[0]):
temp = bitsums[i]
bitsums[i] = cumsum
cumsum += temp
# sorting loop
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
out[bitsums[j]] = a[i]
bitsums[j] += 1
# prepare next iteration
mask <<= batch_m_bits
# cant use `temp` here because of numba internal types
temp2 = a
a = out
out = temp2
return a
From the 4 inner loops, it's easy to see it's the 4th one making it hard to vectorize with Numpy.
One way to cheat around that problem is to pull in a particular C++ function from Scipy: scipy.sparse.coo.coo_tocsr. It does pretty much the same inner loops as the Python function above, so it can be abused to write a faster "vectorized" radix sort in Python. Maybe something like:
from scipy.sparse.coo import coo_tocsr
def radix_step(radix, keys, bitsums, a, w):
coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
return w, a
def scipysparse_radix_perbyte(a):
# coo_tocsr internally works with system int and upcasts
# anything else. We need to copy anyway to not mess with
# original array. Also take into account endianness...
a = a.astype('<i', copy=True)
bitlen = int(a.max()).bit_length()
radix = 256
work = np.empty_like(a)
_ = np.empty(radix+1, int)
for i in range((bitlen-1)//8 + 1):
keys = a.view('u1')[i::a.itemsize].astype(int)
a, work = radix_step(radix, keys, _, a, work)
return a
EDIT: Optimized the function a little bit.. see edit history.
One inefficiency of LSB radix sorting like above is that the array is completely shuffled in RAM a number of times, which means the CPU cache isn't used very well. To try to mitigate this effect, one could opt to first do a pass with MSB radix sort, to put items in roughly the right block of RAM, before sorting every resulting group with a LSB radix sort. Here's one implementation:
def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
"""
Parameters
----------
a : Array of non-negative integers to be sorted.
bbits : Number of bits in radix for LSB sorting.
gbits : Number of bits in radix for MSB grouping.
"""
a = a.copy()
bitlen = int(a.max()).bit_length()
work = np.empty_like(a)
# Group values by single iteration of MSB radix sort:
# Casting to np.int_ to get rid of python BigInt
ngroups = np.int_(2**gbits)
group_offset = np.empty(ngroups + 1, int)
shift = max(bitlen-gbits, 0)
a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
bitlen = shift
if not bitlen:
return a
# LSB radix sort each group:
agroups = np.split(a, group_offset[1:-1])
# Mask off high bits to not undo the grouping..
gmask = (1 << shift) - 1
nbatch = (bitlen-1) // bbits + 1
radix = np.int_(2**bbits)
_ = np.empty(radix + 1, int)
for agi in agroups:
if not agi.size:
continue
mask = (radix - 1) & gmask
wgi = work[:agi.size]
for shift in range(0, nbatch*bbits, bbits):
keys = (agi & mask) >> shift
agi, wgi = radix_step(radix, keys, _, agi, wgi)
mask = (mask << bbits) & gmask
if nbatch % 2:
# Copy result back in to `a`
wgi[...] = agi
return a
Timings (with best performing settings for each on my system):
def numba_radix(a, batch_m_bits=8):
a = a.copy()
bit_len = int(a.max()).bit_length()
nbatches = (bit_len-1)//batch_m_bits +1
work = np.zeros_like(a)
bitsums = np.zeros(2**batch_m_bits + 1, int)
srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
return srtd
a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop
Numba performs very well, as expected. And also with some clever application of existing C-extensions it's possible to beat numpy.sort. IMO at the level of optimization you've already gotten it's worth-it to also consider add-ons to Numpy, but I wouldn't really consider the implementations in my answer "vectorized": The bulk of the work is done in a external dedicated function.
One other thing that strikes me is the sensitivity to the choice of radix. For most of the settings I tried my implementations were still slower than numpy.sort, so in practice some sort of heuristic would be required to offer good performance across the board.
Can you change this to be a counting / radix sort that works 8 bits at a time? For 32 bit unsigned integers, create a matrix[4][257] of counts of occurrence of byte fields, making one read pass over the array to be sorted. matrix[][0] = 0, matrix[][1] = # of occurences of 0, ... . Then convert the counts into indexes, where matrix[][0] = 0, matrix[][1] = # of bytes == 0, matrix[][2] = # of bytes == 0 + # of bytes == 1, ... . The last count is not used, since that would index the end of the array. Then do 4 passes of radix sort, moving data back and forth between the original array and the output array. Working 16 bits at time would need a matrix[2][65537], but only take 2 passes. Example C code:
size_t mIndex[4][257] = {0}; /* index matrix */
size_t i, j, m;
uint32_t u;
uint32_t *pData; /* ptr to original array */
uint32_t *pTemp; /* ptr to working array */
uint32_t *pSrc; /* working ptr */
uint32_t *pDst; /* working ptr */
/* n is size of array */
for(i = 0; i < n; i++){ /* generate histograms */
u = pData[i];
for(j = 0; j < 4; j++){
mIndex[j][1 + (size_t)(u & 0xff)]++; /* note [1 + ... */
u >>= 8;
}
}
for(j = 0; j < 4; j++){ /* convert to indices */
for(i = 1; i < 257; i++){ /* (last count never used) */
mIndex[j][i] += mIndex[j][i-1]
}
}
pDst = pTemp; /* radix sort */
pSrc = pData;
for(j = 0; j < 4; j++){
for(i = 0; i < count; i++){ /* sort pass */
u = pSrc[i];
m = (size_t)(u >> (j<<3)) & 0xff;
/* pDst[mIndex[j][m]++] = u; split into 2 lines */
pDst[mIndex[j][m]] = u;
mIndex[j][m]++;
}
pTmp = pSrc; /* swap ptrs */
pSrc = pDst;
pDst = pTmp;
}