I'm trying to wrap the LAPACK function dgtsv (a solver for tridiagonal systems of equations) using Cython.
I came across this previous answer, but since dgtsv is not one of the LAPACK functions that are wrapped in scipy.linalg I don't think I can use this particular approach. Instead I've been trying to follow this example.
Here's the contents of my lapacke.pxd file:
ctypedef int lapack_int
cdef extern from "lapacke.h" nogil:
int LAPACK_ROW_MAJOR
int LAPACK_COL_MAJOR
lapack_int LAPACKE_dgtsv(int matrix_order,
lapack_int n,
lapack_int nrhs,
double * dl,
double * d,
double * du,
double * b,
lapack_int ldb)
...here's my thin Cython wrapper in _solvers.pyx:
#!python
cimport cython
from lapacke cimport *
cpdef TDMA_lapacke(double[::1] DL, double[::1] D, double[::1] DU,
double[:, ::1] B):
cdef:
lapack_int n = D.shape[0]
lapack_int nrhs = B.shape[1]
lapack_int ldb = B.shape[0]
double * dl = &DL[0]
double * d = &D[0]
double * du = &DU[0]
double * b = &B[0, 0]
lapack_int info
info = LAPACKE_dgtsv(LAPACK_ROW_MAJOR, n, nrhs, dl, d, du, b, ldb)
return info
...and here's a Python wrapper and test script:
import numpy as np
from scipy import sparse
from cymodules import _solvers
def trisolve_lapacke(dl, d, du, b, inplace=False):
if (dl.shape[0] != du.shape[0] or dl.shape[0] != d.shape[0] - 1
or b.shape != d.shape):
raise ValueError('Invalid diagonal shapes')
if b.ndim == 1:
# b is (LDB, NRHS)
b = b[:, None]
# be sure to force a copy of d and b if we're not solving in place
if not inplace:
d = d.copy()
b = b.copy()
# this may also force copies if arrays are improperly typed/noncontiguous
dl, d, du, b = (np.ascontiguousarray(v, dtype=np.float64)
for v in (dl, d, du, b))
# b will now be modified in place to contain the solution
info = _solvers.TDMA_lapacke(dl, d, du, b)
print info
return b.ravel()
def test_trisolve(n=20000):
dl = np.random.randn(n - 1)
d = np.random.randn(n)
du = np.random.randn(n - 1)
M = sparse.diags((dl, d, du), (-1, 0, 1), format='csc')
x = np.random.randn(n)
b = M.dot(x)
x_hat = trisolve_lapacke(dl, d, du, b)
print "||x - x_hat|| = ", np.linalg.norm(x - x_hat)
Unfortunately, test_trisolve just segfaults on the call to _solvers.TDMA_lapacke.
I'm pretty sure my setup.py is correct - ldd _solvers.so shows that _solvers.so is being linked to the correct shared libraries at runtime.
I'm not really sure how to proceed from here - any ideas?
A brief update:
for smaller values of n I tend not to get segfaults immediately, but I do get nonsense results (||x - x_hat|| ought to be very close to 0):
In [28]: test_trisolve2.test_trisolve(10)
0
||x - x_hat|| = 6.23202576396
In [29]: test_trisolve2.test_trisolve(10)
-7
||x - x_hat|| = 3.88623414288
In [30]: test_trisolve2.test_trisolve(10)
0
||x - x_hat|| = 2.60190676562
In [31]: test_trisolve2.test_trisolve(10)
0
||x - x_hat|| = 3.86631743386
In [32]: test_trisolve2.test_trisolve(10)
Segmentation fault
Usually LAPACKE_dgtsv returns with code 0 (which should indicate success), but occasionally I get -7, which means that argument 7 (b) had an illegal value. What's happening is that only the first value of b is actually being modified in place. If I keep on calling test_trisolve I will eventually hit a segfault even when n is small.
OK, I figured it out eventually - it seems I've misunderstood what row- and column-major refer to in this case.
Since C-contiguous arrays follow row-major order, I assumed that I ought to specify LAPACK_ROW_MAJOR as the first argument to LAPACKE_dgtsv.
In fact, if I change
info = LAPACKE_dgtsv(LAPACK_ROW_MAJOR, ...)
to
info = LAPACKE_dgtsv(LAPACK_COL_MAJOR, ...)
then my function works:
test_trisolve2.test_trisolve()
0
||x - x_hat|| = 6.67064747632e-12
This seems pretty counter-intuitive to me - can anyone explain why this is the case?
Although rather old the question seems still to be relevant.
The observed behavior is the result of a misinterpretation of parameter LDB:
Fortran arrays are col major and the leading dimension of the array B corresponds to N. Therefore LDB >= max(1,N).
With row major LDB corresponds to NRHS and therefore the condition LDB >= max(1,NRHS) must be met.
Comment # b is (LDB, NRHS) is not correct since b has the dimension (LDB,N) and LDB should be 1 in this case.
Switching from LAPACK_ROW_MAJOR to LAPACK_COL_MAJOR fixes the issue as long as NRHS is equal to 1. The memory layout of a col major (N,1) is the same as row major (1,N). It will fail, however, if NRHS is greater than 1.
Related
What I am trying to do
I am trying to create a very simple function which I want to optimise with numba (or at least verify if numba makes any difference).
I am running numpy 1.19.2 and numba 0.51.2 in an Anaconda installation on Windows.
The function takes 3 numeric inputs: a, b , c; the inputs can be scalars or numpy arrays; the output will of course be, respectively, a scalar or a numpy array
The function is fairly simple:
if a == 0 --> it returns np.nan
if b == 0 --> it returns a certain number
otherwise it performs some very simple algebra
The issue
I have come up with the toy example below (my actual formulas are more complex but I can show what I need to show with this easier example).
if the inputs are arrays, it works perfectly
if the inputs are scalar, numba doesn't work (Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2' )
if the inputs are arrays of size 1 (I make an array out of each scalar) numba works again
What I tried / similar questions
The closest question I found was this, but the mismatch there was between an int and a float.
Here it is between an array(int64, 0d, C) and a float64. I can convert my inputs to float but the mismatch remains.
Any ideas? I am not sure what the array and the float being compared are, to be honest.
The one solution I have found is to add a = np.array([a]) at the beginning of the function, but I don't understand why, plus this returns an array of size 1, whereas I'd like a scalar returned in these cases.
Toy example
#numba.jit
def my_fun(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , c**2) )
a = np.arange(0,11)
b = np.arange(3,14)
b[1] = 0
c = np.arange(10,21)
out_array = my_fun(a,b,c)
out_scalar = my_fun(0,0,1)
The exact warning:
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function my_fun failed at nopython mode lowering due to: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2', defined at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: typing of assignment at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: lowering "$36call_method.17 = call $4load_method.1($10compare_op.4, $14load_attr.6, $34call_method.16, func=$4load_method.1, args=[Var($10compare_op.4, refactor numba.py:8), Var($14load_attr.6, refactor numba.py:8), Var($34call_method.16, refactor numba.py:9)], kws=(), vararg=None)" at D:\MY DATA\USERNAME\Python\scratch scripts\refactor numba.py (8)
#numba.jit
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:177: NumbaWarning: Function "my_fun" was compiled in object mode without forceobj=True.
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaWarning(warn_msg,
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:187: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
For more information visit https://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaDeprecationWarning(msg,
I have found a solution, but it's far from elegant, and I am hoping there is a better one.
To recap, I needed a function which:
works with numba
works with both scalars and arrays
returns scalar (not a one-sized array) when the inputs are scalars, and arrays when the inputs are arrays
I have tried the following, and found option 2 to be the fastest.
my_fun_optimised_1: a function which, without numba, determines whether the inputs are scalar or not, and then calls, accordingly, a sub-function for the scalar case and one for the arrays. Both sub-functions run with numba, but take forever. I guess this is because numba must be re-initialised at each call of the main function.
my_fun_optimised_2: similar to the above, except the scalar and array functions, both running with numba, are main functions and not subfunctions. Much much faster.
my_fun_non_opt_no_numba : a function which runs without numba.
The results are:
+-------------------------+----------------------------+-----------------------------+
| Function | Array: time vs the fastest | Scalar: time vs the fastest |
+-------------------------+----------------------------+-----------------------------+
| optimised numba 1 | 54,403 | 42,961 |
| optimised numba 2 | 1 | 1 |
| non-optimised, no numba | 3.409 | 4.53892 |
+-------------------------+----------------------------+-----------------------------+
What this means is that, on my PC, the non-optimised, no numba code takes 4.5 times longer than "optimsied numba 2" to run on scalars and 3.4 times longer for arrays.
The "optimised numba 1" is not optimsied at all and takes an insane amount of time.
I hope all of this can be of use to other people.
PS I am very well familiar with pitfalls of premature optimisation. I am only doing this because I have a specific case where 60% of the time is spent doing a similar (but not identical) calculation to the one shown here.
The code to time the functions is:
import numpy as np
import numba
import timeit
import pandas as pd
def my_fun_optimised_1(a,b,c):
#numba.jit
def my_fun_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
#numba.jit
def my_fun_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return my_fun_scalar(a,b,c)
else:
return my_fun_vectorised(a,b,c)
def my_fun_optimised_2(a,b,c):
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return fun_2_scalar(a,b,c)
else:
return fun_2_vectorised(a,b,c)
#numba.jit
def fun_2_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
#numba.jit
def fun_2_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
def my_fun_non_opt_no_numba(a,b,c):
# multipl by 1 converts from array to scalar
return 1 * np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
# I couldn't get this to work with Numba
##numba.jit
def my_fun_non_opt_numba(a,b,c):
a = np.array([a])
b = np.array([b])
c = np.array([c])
out = np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
return out
r = 4
n = int(100)
a = 3
b = 4
c = 5
x = my_fun_optimised_2(a,b,c)
t_scalar_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_scalar = pd.DataFrame(index = ['min time'])
resdf_scalar['optimised numba 1'] = [min(t_scalar_opt_numba_1)]
resdf_scalar['optimised numba 2'] = [min(t_scalar_opt_numba_2)]
resdf_scalar['non-optimised, no numba'] = [min(t_scalar_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_scalar = resdf_scalar.transpose()
resdf_scalar['diff vs fastest'] = (resdf_scalar / resdf_scalar.min() )
a = np.arange(3,13)
b = np.arange(0,10)
c = np.arange(20,30)
t_array_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_array = pd.DataFrame(index = ['min time'])
resdf_array['optimised numba 1'] = [min(t_array_opt_numba_1)]
resdf_array['optimised numba 2'] = [min(t_array_opt_numba_2)]
resdf_array['non-optimised, no numba'] = [min(t_array_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_array = resdf_array.transpose()
resdf_array['diff vs fastest'] = (resdf_array / resdf_array.min() )
I'm trying to solve a 2D-Ising model with Monte Carlo approach.
As it is slow I used Cython to accelerate the code execution. I would like to push it even further and parallelize the Cython code. My idea is to split the 2D-lattice in two, so for any point on a lattice has it's nearest neigbours on the other lattice. This way I can randomly choose one lattice and I can flip all the spins and this could be done in parallel since all those spins are independent.
So far this is my code :( inspired from http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/ )
%load_ext Cython
%%cython
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_ising_step(np.int64_t[:, :] field,float beta):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef int offset = np.random.randint(0,2)
cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)
cdef int m,n,i,j
for m in prange(M,nogil=True):
i = m % 2
for j in range(n_update.shape[0]) :
n = n_update[j]
cy_spin_flip(field,(n+i) %N,m%M,beta)
return np.array(field,dtype=np.int64)
cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
if dE <= 0 :
field[n,m] *= -1
elif np.exp(-dE * beta) > np.random.rand():
field[n,m] *= -1
I tried using a prange-constructor but I'm having a lots of troubles with GIL-lock. I'am new to Cython and parallel computing so I could easily have missed something.
The error :
Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil
Q : "How to use prange in cython?" . . . . + ( an Epilogue on True-[PARALLEL] True-randomness ... )
Short version : best in those and only those places, where performance gains.
Longer version :Your problem starts not with avoiding a GIL-lock ownership, but with the Physics & the Performance losses from almost computational anti-patterns, irrespective of all the powers the cython-isation may have ever enabled.
The code as-is attempts to apply a 2D-kernel operator over a whole 2D-domain of the {-1|+1}-spin-field[N,M], best in some fast and smart manner.
The actual result is INCONGRUENT with PHYSICAL FIELD ISING, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m] right "during" a current generation of [PAR][SEQ]-organised coverage of the 2D-domain of the field[:,:] of current spin values sequentially modifies the state of the field[i,j], which obviously does not happen in the real-world of the recognised Laws of Physics. Computers are ignorant of these rules, we, humans, should prefer not to.
Next, the prange'd attempt calls ( M * N / 2 )-times a cdef-ed cy_spin_flip() in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.
If one benchmarks the costs of invoking about 1E6-calls to a repaired, so as to become congruent with the Laws of Physics, cy_spin_flip() function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.
Vectorised code using a 2D-kernel convolution :
A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation #Divakar, can produce one step per ~ 3k3 [us] without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200] :
The initial state is :
spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv = spin_2Dstate.copy()
The actual const convolution kernel is :
spin_2Dkernel = np.array( [ [ 0, 1, 0 ],
[ 1, 0, 1 ],
[ 0, 1, 0 ]
],
dtype = np.int8 # [PERF] to be field-tested,
) # some architectures may get faster if matching CPU-WORD
The actual CPU-architecture may benefit from smart-aligned data types, yet for larger 2D-domains ~ [ > 200, > 200 ] users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 } or even more compact bitmap stored-{ 0 | 1 } spin-information.
Next, instead of double-looping calls on each field[:,:]-cell, rather block-update the full 2D-domain in one step, the helpers get:
# T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
spin_kernel,
boundary = 'wrap',
mode = 'same'
)[:,:]
Because of the Physics inside the spin-kernel properties,this helper array will consist of only { -4 | -2 | 0 | +2 | +4 } values.
A simplified, fast vector code :
def aVectorisedSpinUpdateSTEPrandom( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )
For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole 2D-domain at a same value, replace a scalar produced from the np.random.rand() with a 2D-field-of-(individualised † )-probabilities delivered from np.random.rand( N, M )[:,:] and this will now add some costs up to some 7k3 ~ 9k3 [us] per a spin update step :
def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )
>>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
7280 [us]
8984 [us]
9299 [us]
wide-screen commented as-was source :
// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
# https://www.desmos.com/calculator/bgz9t3s3nm
#cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy
#cython.wraparound( False ) # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython
def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
float beta # beta: a float-factor
): #
cdef int N = field.shape[0] # const
cdef int M = field.shape[1] # const
cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] }
cdef int m, n, i, j # idxs{ (E) | [O] }
# #
for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M |||||||||
i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes
for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
# n = n_update[j] # n = n_update[j] |||||||||
# cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # |||||||||
# ||||| # INCONGRUENT with PHYSICAL FIELD ISING |||||||||
# vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
# # |||||||||
return np.array( field, dtype = np.int64 ) # ||||||||| RET?
#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
int n, # const int
int m, # const int
float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ?
float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
):
cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
*( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP
+ field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::|
+ field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J }
+ field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP #random-scaled by 2*J*beta
)# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY )
) # | | #
if dE <= 0 : # | | #
field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy)
elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
> np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ????
# # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????
The Physics:
The beta-controlled ( given const J ) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 } which are the only cases for ~ 2 * spin_2Dkernel-convolutions across the whole 2D-domain of the current spin_2Dstate, is available here : https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }, as np.exp( -dE * 2 * J * beta ) is strongly controlled by beta and the larger the beta the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1) will not dominate the np.exp()-result.
† An Epilogue on a Post-Festum Remark :
"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."– Angelo C 7 hours ago
Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.
While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance ( a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source ).
You either design your process to be truly a randomness based ( using some of the available sources of indeed non-deterministic entropy ), or willing to be sub-ordinated to a policy to allowing repeatable experiments ( for re-inspection & re-validation of scientific computing ), for which you have one more duty - a duty of Configuration Management of such scientific experiment ( to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.
Here, given the nature warrants the spins to be mutually independent in the 2D-domain of the field[:,:], the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1). As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this ( be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in" ).
That means, either of the spins will remain fair & congruent with the Laws of Physics if and only if it sources a spin-flip decision treshhold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget it implement right & run it efficiently).
For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly ND-entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) Finite-State-Automata, isn't it?
From a Cython point-of-view the main problem is that cy_spin_flip requires the GIL. You need to add nogil to the end of its signature, and set the return type to void (since by default it returns a Python object, which requires the GIL).
However, np.exp and np.random.rand also require the GIL, because they're Python function calls. np.exp is probably easily replaced with libc.math.exp. np.random is a bit harder, but there's plenty of suggestions for C- and C++-based approaches: 1 2 3 4 (+ others).
A more fundamental problem is the line:
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
You've parallelized this with respect to m (i.e. different values of m are run in different threads), and each iteration changes field. However in this line you are looking up several different values of m. This means the whole thing is a race-condition (the result depends on which order the different threads finish) and suggests your algorithm may be fundamentally unsuitable for parallelization. Or that you should copy field and have field_in and field_out. It isn't obvious to me, but this is something that you should be able to work out.
Edit: it does look like you've given the race condition some thought with using i%2. It isn't obvious to me that this is right though. I think a working implementation of your "alternate cells" scheme would look something like:
for oddeven in range(2):
for m in prange(M):
for n in range(N):
# some mechanism to pick the alternate cells here.
i.e. you need a regular loop to pick the alternate cells outside your parallel loop.
I am trying to solve a logarithmic function using Python. I am searching for an irrational number, so I wrote this bisection algorithm:
def racine(x):
a=0
b=x/2
c=(a+b)/2
while a!=b and a<c<b:
if c**2<x:
a=c
else:
b=c
c=(a+b)/2
return a, b
which seems to work, at least for finding irrational roots. However then I have a more complicated function:
ln(P)=A+B/T+C*ln(T)
where P, A, B and C are known constants. Isolating T, there is this:
T==e**((ln(P)-A-B/T)/C)
But this still can't be solved because T is on both sides. Can somebody see the way around it? For now I have this code, which clearly doesn't work.
def temperature(P):
A=18.19
B=-23180
C=-0.8858
T==e**((log(P)-A-B/T)/C)
return racine (T)
Thank you!
The answer should be to use the bisection method again.
a=small estimate
fa = f(a)
b=large estimate
fb = f(b)
while( b-a > 1e-12 ) {
c = (a+b)/2
fc = f(c)
if( fabs(fc) < 1e-12) return c;
if( (fc>0) == (fa>0) ) {
a = c; fa = fc
} else {
b = c; f = fc;
}
return (a+b)/2
For more efficient methods look up the regula falsi method in its Illinois variant.
If you have NumPy installed, you can find the temperature at a given pressure numerically, for example with scipy.optimize.newton. For example,
import numpy as np
from scipy.optimize import newton
A, B, C = 18.19, -23180, -0.8858
fr = lambda T, lnp: (A + B/T + C*np.log(T)) - lnp
def T(p):
return newton(fr, 1000, args=(np.log(p),))
In [1]: p1 = 10
In [2]: T1 = T(p1)
In [3]: T1
Out[3]: 2597.8167133280913
In [4]: np.exp(A + B/T1 + C*np.log(T1)) # check
Out[4]: 10.000000000000002
The initial guess value (here 1000) you might have to customize for your usage: I don't know your units.
I am trying to improve this cython code (which works). Please note that I don't want to use numpy.fromfile.. because I want to be able to parse not fixed binary structures.
from libc.stdio cimport *
import struct
cpdef inline cimport_td(char* f, double[:] dates, double[:] tpx, int[:] tvo):
f_b = open(f.replace('\\','/'),'rb').read()
cdef int B = len(f_b), bb = 0, dd = 0
while bb < B:
dates[dd], tpx[dd], tvo[dd] = struct.unpack('ddi', f_b[bb:bb+20])
bb += 20
dd += 1
del f_b
return dates, tpx, tvo
Is there anything better than open/read and struct unpack ?
Thank you.
There exists one very good linear interpolation method. It performs linear interpolation requiring at most one multiply per output sample. I found its description in a third edition of Understanding DSP by Lyons. This method involves a special hold buffer. Given a number of samples to be inserted between any two input samples, it produces output points using linear interpolation. Here, I have rewritten this algorithm using Python:
temp1, temp2 = 0, 0
iL = 1.0 / L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
where x contains input samples, L is a number of points to be inserted, y will contain output samples.
My question is how to implement such algorithm in ANSI C in a most effective way, e.g. is it possible to avoid the second loop?
NOTE: presented Python code is just to understand how this algorithm works.
UPDATE: here is an example how it works in Python:
x=[]
y=[]
hold=[]
num_points=20
points_inbetween = 2
temp1,temp2=0,0
for i in range(num_points):
x.append( sin(i*2.0*pi * 0.1) )
L = points_inbetween
iL = 1.0/L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 * iL)
Let's say x=[.... 10, 20, 30 ....]. Then, if L=1, it will produce [... 10, 15, 20, 25, 30 ...]
Interpolation in the sense of "signal sample rate increase"
... or i call it, "upsampling" (wrong term, probably. disclaimer: i have not read Lyons'). I just had to understand what the code does and then re-write it for readability. As given it has couple of problems:
a) it is inefficient - two loops is ok but it does multiplication for every single output item; also it uses intermediary lists(hold), generates result with append (small beer)
b) it interpolates wrong the first interval; it generates fake data in front of the first element. Say we have multiplier=5 and seq=[20,30] - it will generate [0,4,8,12,16,20,22,24,28,30] instead of [20,22,24,26,28,30].
So here is the algorithm in form of a generator:
def upsampler(seq, multiplier):
if seq:
step = 1.0 / multiplier
y0 = seq[0];
yield y0
for y in seq[1:]:
dY = (y-y0) * step
for i in range(multiplier-1):
y0 += dY;
yield y0
y0 = y;
yield y0
Ok and now for some tests:
>>> list(upsampler([], 3)) # this is just the same as [Y for Y in upsampler([], 3)]
[]
>>> list(upsampler([1], 3))
[1]
>>> list(upsampler([1,2], 3))
[1, 1.3333333333333333, 1.6666666666666665, 2]
>>> from math import sin, pi
>>> seq = [sin(2.0*pi * i/10) for i in range(20)]
>>> seq
[0.0, 0.58778525229247314, 0.95105651629515353, 0.95105651629515364, 0.58778525229247325, 1.2246063538223773e-016, -0.58778525229247303, -0.95105651629515353, -0.95105651629515364, -0.58778525229247336, -2.4492127076447545e-016, 0.58778525229247214, 0.95105651629515353, 0.95105651629515364, 0.58778525229247336, 3.6738190614671318e-016, -0.5877852522924728, -0.95105651629515342, -0.95105651629515375, -0.58778525229247347]
>>> list(upsampler(seq, 2))
[0.0, 0.29389262614623657, 0.58778525229247314, 0.76942088429381328, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247325, 0.29389262614623668, 1.2246063538223773e-016, -0.29389262614623646, -0.58778525229247303, -0.76942088429381328, -0.95105651629515353, -0.95105651629515364, -0.95105651629515364, -0.7694208842938135, -0.58778525229247336, -0.29389262614623679, -2.4492127076447545e-016, 0.29389262614623596, 0.58778525229247214, 0.76942088429381283, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247336, 0.29389262614623685, 3.6738190614671318e-016, -0.29389262614623618, -0.5877852522924728, -0.76942088429381306, -0.95105651629515342, -0.95105651629515364, -0.95105651629515375, -0.76942088429381361, -0.58778525229247347]
And here is my translation to C, fit into Kratz's fn template:
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param will be filled with (src_len - 1) * steps + 1 samples
*/
float* linearInterpolation(float* src, int src_len, int steps, float* dst)
{
float step, y0, dy;
float *src_end;
if (src_len > 0) {
step = 1.0 / steps;
for (src_end = src+src_len; *dst++ = y0 = *src++, src < src_end; ) {
dY = (*src - y0) * step;
for (int i=steps; i>0; i--) {
*dst++ = y0 += dY;
}
}
}
}
Please note the C snippet is "typed but never compiled or run", so there might be syntax errors, off-by-1 errors etc. But overall the idea is there.
In that case I think you can avoid the second loop:
def interpolate2(x, L):
new_list = []
new_len = (len(x) - 1) * (L + 1)
for i in range(0, new_len):
step = i / (L + 1)
substep = i % (L + 1)
fr = x[step]
to = x[step + 1]
dy = float(to - fr) / float(L + 1)
y = fr + (dy * substep)
new_list.append(y)
new_list.append(x[-1])
return new_list
print interpolate2([10, 20, 30], 3)
you just calculate the member in the position you want directly. Though - that might not be the most efficient to do it. The only way to be sure is to compile it and see which one is faster.
Well, first of all, your code is broken. L is not defined, and neither is y or x.
Once that is fixed, I run cython on the resulting code:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
And that seemed to work. I haven't tried to compile it, though, and you can also improve the speed a lot by adding different optimizations.
"e.g. is it possible to avoid the second loop?"
If it is, then it's possible in Python too. And I don't see how, although I don't see why you would do it the way you do. First creating a list of L length of i-temp is completely pointless. Just loop L times:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = i-temp1
temp1 = i
for j in range(L):
temp2 += hold
y.append(temp2 *iL)
It all seems overcomplicated for what you get out though. What are you trying to do, actually? Interpolate something? (Duh it says so in the title. Sorry about that.)
There are surely easier ways of interpolating.
Update, a much simplified interpolation function:
# A simple list, so it's easy to see that you interpolate.
indata = [float(x) for x in range(0, 110, 10)]
points_inbetween = 3
outdata = [indata[0]]
for point in indata[1:]: # All except the first
step = (point - outdata[-1]) / (points_inbetween + 1)
for i in range(points_inbetween):
outdata.append(outdata[-1] + step)
I don't see a way to get rid of the inner loop, nor a reason for wanting to do so.
Converting it to C I'll leave up to someone else, or even better, Cython, as C is a great langauge of you want to talk to hardware, but otherwise just needlessly difficult.
I think you need the two loops. You have to step over the samples in x to initialize the interpolator, not to mention copy their values into y, and you have to step over the output samples to fill in their values. I suppose you could do one loop to copy x into the appropriate places in y, followed by another loop to use all the values from y, but that will still require some stepping logic. Better to use the nested loop approach.
(And, as Lennart Regebro points out) As a side note, I don't see why you do hold = [i-temp1] * L. Instead, why not do hold = i-temp, and then loop for j in xrange(L): and temp2 += hold? This will use less memory but otherwise behave exactly the same.
Heres my try at a C implementation for your algorithm. Before trying to further optimize it id suggest you profile its performance with all compiler optimizations enabled.
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param needs to be of size src_len * steps
*/
float* linearInterpolation(float* src, size_t src_len, size_t steps, float* dst)
{
float* dst_ptr = dst;
float* src_ptr = src;
float stepIncrement = 1.0f / steps;
float temp1 = 0.0f;
float temp2 = 0.0f;
float hold;
size_t idx_src, idx_steps;
for(idx_src = 0; idx_src < src_len; ++idx_src)
{
hold = *src_ptr - temp1;
temp1 = *src_ptr;
++src_ptr;
for(idx_steps = 0; idx_steps < steps; ++idx_steps)
{
temp2 += hold;
*dst_ptr = temp2 * stepIncrement;
++dst_ptr;
}
}
}