Array of complex numbers in PyopenCL

Array of complex numbers in PyopenCL - python

I've been working on a new problem with PyopenCl in which i have to deal with complex numbers. More accurately, it would be really handy to use a 2D numpy array with complex numbers inside.
Something like:
np_array[np_array[C_number, C_number, ..], np_array[C_number, C_number, ..], ...]
Then for the results i would need a simple 1D numpy array of complex numbers.
I've also noticed that pyopencl sees a numpy complex number as a float2, for which i use a float16 for my data data array since I have around 8 numbers to deal with.
To work out the basic operations I've built a simple program.
I've worked out building the initials arrays and sending them to the kernel but the results and different from what i expected. I guess it has something to do with the thread's ID but i can't seem to figure it out.
The code i'm using is the following.
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
MF = cl.mem_flags
M = 3
zero = np.complex64(0.0)
X1_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
X2_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
X3_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
Y1_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
Y2_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
Y3_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
aux_h = np.complex64(1 + 1j*1)
RES_h = np.empty_like(X1_h)
dados_h = []
for i in range(3):
dados_h.append(np.array([X1_h[i], X2_h[i], X3_h[i], Y1_h[i], Y2_h[i], Y3_h[i]]).astype(np.complex64))
dados_h = np.array(dados_h).astype(np.complex64)
print dados_h
aux_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf=aux_h)
dados_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf=dados_h)
RES_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf = RES_h)
Source = """
__kernel void soma(__global float2 *aux, __global float16 *dados, __global float2 *res){
const int gid_x = get_global_id(0);
const int gid_y = get_global_id(1);
res[gid_x].x = dados[gid_y].s0;
res[gid_x].y = dados[gid_y].s1;
}
"""
prg = cl.Program(ctx, Source).build()
completeEvent = prg.soma(queue, (M,), None, aux_d, dados_d, RES_d)
completeEvent.wait()
cl.enqueue_copy(queue, RES_h, RES_d)
print "GPU"
print RES_h
The output I'm getting is the following:
[[ 1.+2.j 1.+2.j 1.+2.j 4.+5.j 4.+5.j 4.+5.j]
[ 2.+3.j 2.+3.j 2.+3.j 5.+6.j 5.+6.j 5.+6.j]
[ 3.+4.j 3.+4.j 3.+4.j 6.+7.j 6.+7.j 6.+7.j]]
GPU
[ 1.+2.j 1.+2.j 1.+2.j]
My expected output is:
[ 1.+2.j 2.+3.j 3.+4.j]
I can't understand how I'm getting that result. As said, i believe it to be something related to the thread IDs but i can't figure it out. If i use gid_x instead of gid_y on the right part of the red[gid_x] = ... I get the following
[ 1.+2.j 2.+3.j 6.+7.j]
Can anyone give me some insight in what I'm doing wrong, please?

You're launching a 1D kernel, so get_global_id(1) will always return 0. This explains why your kernel simply copies the first element of the dados array into each element of the output.
Using a float16 to represent one 'row' of your input only works if you actually have 8 complex numbers per row. In your example you only have 6, which is why you don't quite get the correct results when copying from dados[gid_x].
To allow your code to deal with an arbitrary row size, simply pass the width in as a parameter, and then compute the linear index manually:
__kernel void soma(__global float2 *aux,
__global float2 *dados,
__global float2 *res,
int rowWidth){
const int gid_x = get_global_id(0);
res[gid_x] = dados[gid_x*rowWidth];
}
and then pass the row-width as an extra argument when you launch the kernel:
# Pass your actual row-width instead of 6!
completeEvent = prg.soma(queue, (M,), None, aux_d, dados_d, RES_d, np.int32(6))

Related

Tradingview pinescript's RMA (Moving average used in RSI. It is the exponentially weighted moving average with alpha = 1 / length) in python, pandas

I've been trying to get same results from tradingviews RMA method but I dont know how to accomplish it
In their page RMA is computed as:
plot(rma(close, 15))
//the same on pine
pine_rma(src, length) =>
alpha = 1/length
sum = 0.0
sum := na(sum[1]) ? sma(src, length) : alpha * src + (1 - alpha) * nz(sum[1])
plot(pine_rma(close, 15))
to test I used input and their result, this is input column and the same input after applying tradingview´s rma(input,14):
data = [[588.0,519.9035093599585],
[361.98999999999984,508.62397297710436],
[412.52000000000055,501.7594034787397],
[197.60000000000042,480.0337318016869],
[208.71999999999932,460.6541795301378],
[380.1100000000006,454.90102384941366],
[537.6599999999999,460.8123792887413],
[323.5600000000013,451.0086379109742],
[431.78000000000077,449.6351637744761],
[299.6299999999992,438.9205092191563],
[225.1900000000005,423.65404427493087],
[292.42000000000013,414.28018396957873],
[357.64999999999964,410.23517082889435],
[692.5100000000003,430.3976586268306],
[219.70999999999916,415.34854015348543],
[400.32999999999987,414.2757872853794],
[604.3099999999995,427.849659622138],
[204.29000000000087,411.8811125062711],
[176.26000000000022,395.0510330415374],
[204.1800000000003,381.41738782428473],
[324.0,377.3161458368358],
[231.67000000000007,366.91284970563316],
[184.21000000000092,353.8626461552309],
[483.0,363.08674285842864],
[290.6399999999994,357.911975511398],
[107.10000000000036,339.996834403441],
[179.0,328.49706051748086],
[182.36000000000058,318.05869905194663],
[275.0,314.98307769109323],
[135.70000000000073,302.17714357030087],
[419.59000000000015,310.56377617242225],
[275.6399999999994,308.06922073153487],
[440.48999999999984,317.5278478221396],
[224.0,310.8472872634153],
[548.0100000000001,327.78748103031415],
[257.0,322.73123238529183],
[267.97999999999956,318.82043007205664],
[366.51000000000016,322.2268279240526],
[341.14999999999964,323.57848307233456],
[147.4200000000001,310.9957342814536],
[158.78000000000063,300.12318183277836],
[416.03000000000077,308.4022402732943],
[360.78999999999917,312.14422311091613],
[1330.7299999999996,384.90035003156487],
[506.92000000000013,393.61603931502464],
[307.6100000000006,387.4727507925229],
[296.7299999999996,380.991125735914],
[462.0,386.7774738976345],
[473.8099999999995,392.9940829049463],
[769.4200000000002,419.88164841173585],
[971.4799999999997,459.2815306680404],
[722.1399999999994,478.0571356203232],
[554.9799999999996,483.5516259331572],
[688.5,498.19079550936027],
[292.0,483.462881544406],
[634.9500000000007,494.2833900055199]]
# Create the pandas DataFrame
dfRMA = pd.DataFrame(data, columns = ['input', 'wantedrma'])
dfRMA['try1'] = dfRMA['input'].ewm( alpha=1/14, adjust=False).mean()
dfRMA['try2'] = numpy_ewma_vectorized(dfRMA['input'],14)
dfRMA
ewm does not give me same results, so I searched and found this but I just replicated ewma
def numpy_ewma_vectorized(data, window):
alpha = 1/window
alpha_rev = 1-alpha
scale = 1/alpha_rev
n = data.shape[0]
r = np.arange(n)
scale_arr = scale**r
offset = data[0]*alpha_rev**(r+1)
pw0 = alpha*alpha_rev**(n-1)
mult = data*pw0*scale_arr
cumsums = mult.cumsum()
out = offset + cumsums*scale_arr[::-1]
return out
I am getting these results
Do you know how to translate pinescript rma method in pandas?
I realized that using pandas ewm it seems to converge, last rows are closer and closer to the value, is this correct?
...

As far as I know Pine-script will use data that is not exported, so the weighted mean is taking into account previous records that are not in your table, meaning that you can't reproduce the results without more information.
What you need to do is load around 50-100 points (depending on alpha) of data further into the past than what you actually will use, and use a threshold for the comparing the data. You need that both python and pine-script is using data with the same or at least similar "history".
So you make the calculations using the whole dataframe and then you skip the first rows. You can see the effect of the historical data in your own example as difference between your calculation and pine-script one is quickly vanishing after the 55 point, but of course the difference will also depend on alpha.
So actually your code could be already well written. In any case you can use the pandas implementation directly, it will be easier and probably faster.

const cloneArray = (input) => [...input]
const pluck = (input, key) => input.map(element => element[key])
const pineSma = (source, length) => {
let sum = 0.0
for (let i = 0; i < length; ++i) {
sum += source[i] / length
}
return sum
}
const pineRma = (src, length, last) => {
const alpha = 1.0 / length
return alpha * src[0] + (1.0 - alpha) * last
}
const calculatePineRma = (candles, sourceKey, length) => {
const results = []
for (let i = 0; i <= candles.length - length; ++i) {
const sourceCandles = cloneArray(candles.slice(i, i + length)).reverse()
const closes = pluck(sourceCandles, sourceKey)
if (i === 0) {
results.push(pineSma(closes, length))
continue
}
results.push(pineRma(closes, length, results[results.length - 1]))
}
return results
}

How to use prange in cython?

I'm trying to solve a 2D-Ising model with Monte Carlo approach.
As it is slow I used Cython to accelerate the code execution. I would like to push it even further and parallelize the Cython code. My idea is to split the 2D-lattice in two, so for any point on a lattice has it's nearest neigbours on the other lattice. This way I can randomly choose one lattice and I can flip all the spins and this could be done in parallel since all those spins are independent.
So far this is my code :( inspired from http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/ )
%load_ext Cython
%%cython
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_ising_step(np.int64_t[:, :] field,float beta):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef int offset = np.random.randint(0,2)
cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)
cdef int m,n,i,j
for m in prange(M,nogil=True):
i = m % 2
for j in range(n_update.shape[0]) :
n = n_update[j]
cy_spin_flip(field,(n+i) %N,m%M,beta)
return np.array(field,dtype=np.int64)
cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
if dE <= 0 :
field[n,m] *= -1
elif np.exp(-dE * beta) > np.random.rand():
field[n,m] *= -1
I tried using a prange-constructor but I'm having a lots of troubles with GIL-lock. I'am new to Cython and parallel computing so I could easily have missed something.
The error :
Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil

Q : "How to use prange in cython?" . . . . + ( an Epilogue on True-[PARALLEL] True-randomness ... )
Short version : best in those and only those places, where performance gains.
Longer version :Your problem starts not with avoiding a GIL-lock ownership, but with the Physics & the Performance losses from almost computational anti-patterns, irrespective of all the powers the cython-isation may have ever enabled.
The code as-is attempts to apply a 2D-kernel operator over a whole 2D-domain of the {-1|+1}-spin-field[N,M], best in some fast and smart manner.
The actual result is INCONGRUENT with PHYSICAL FIELD ISING, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m] right "during" a current generation of [PAR][SEQ]-organised coverage of the 2D-domain of the field[:,:] of current spin values sequentially modifies the state of the field[i,j], which obviously does not happen in the real-world of the recognised Laws of Physics. Computers are ignorant of these rules, we, humans, should prefer not to.
Next, the prange'd attempt calls ( M * N / 2 )-times a cdef-ed cy_spin_flip() in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.
If one benchmarks the costs of invoking about 1E6-calls to a repaired, so as to become congruent with the Laws of Physics, cy_spin_flip() function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.
Vectorised code using a 2D-kernel convolution :
A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation #Divakar, can produce one step per ~ 3k3 [us] without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200] :
The initial state is :
spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv = spin_2Dstate.copy()
The actual const convolution kernel is :
spin_2Dkernel = np.array( [ [ 0, 1, 0 ],
[ 1, 0, 1 ],
[ 0, 1, 0 ]
],
dtype = np.int8 # [PERF] to be field-tested,
) # some architectures may get faster if matching CPU-WORD
The actual CPU-architecture may benefit from smart-aligned data types, yet for larger 2D-domains ~ [ > 200, > 200 ] users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 } or even more compact bitmap stored-{ 0 | 1 } spin-information.
Next, instead of double-looping calls on each field[:,:]-cell, rather block-update the full 2D-domain in one step, the helpers get:
# T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
spin_kernel,
boundary = 'wrap',
mode = 'same'
)[:,:]
Because of the Physics inside the spin-kernel properties,this helper array will consist of only { -4 | -2 | 0 | +2 | +4 } values.
A simplified, fast vector code :
def aVectorisedSpinUpdateSTEPrandom( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )
For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole 2D-domain at a same value, replace a scalar produced from the np.random.rand() with a 2D-field-of-(individualised † )-probabilities delivered from np.random.rand( N, M )[:,:] and this will now add some costs up to some 7k3 ~ 9k3 [us] per a spin update step :
def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )
>>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
7280 [us]
8984 [us]
9299 [us]
wide-screen commented as-was source :
// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
# https://www.desmos.com/calculator/bgz9t3s3nm
#cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy
#cython.wraparound( False ) # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython
def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
float beta # beta: a float-factor
): #
cdef int N = field.shape[0] # const
cdef int M = field.shape[1] # const
cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] }
cdef int m, n, i, j # idxs{ (E) | [O] }
# #
for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M |||||||||
i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes
for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
# n = n_update[j] # n = n_update[j] |||||||||
# cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # |||||||||
# ||||| # INCONGRUENT with PHYSICAL FIELD ISING |||||||||
# vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
# # |||||||||
return np.array( field, dtype = np.int64 ) # ||||||||| RET?
#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
int n, # const int
int m, # const int
float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ?
float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
):
cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
*( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP
+ field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::|
+ field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J }
+ field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP #random-scaled by 2*J*beta
)# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a 2D-field TOPOLOGY )
) # | | #
if dE <= 0 : # | | #
field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or numpy-efficient block-wise XOR MASK) (2D-requires more efforts for best cache-eff'cy)
elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
> np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ????
# # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????
The Physics:
The beta-controlled ( given const J ) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 } which are the only cases for ~ 2 * spin_2Dkernel-convolutions across the whole 2D-domain of the current spin_2Dstate, is available here : https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }, as np.exp( -dE * 2 * J * beta ) is strongly controlled by beta and the larger the beta the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1) will not dominate the np.exp()-result.
† An Epilogue on a Post-Festum Remark :
"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of thems are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."– Angelo C 7 hours ago
Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.
While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance ( a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source ).
You either design your process to be truly a randomness based ( using some of the available sources of indeed non-deterministic entropy ), or willing to be sub-ordinated to a policy to allowing repeatable experiments ( for re-inspection & re-validation of scientific computing ), for which you have one more duty - a duty of Configuration Management of such scientific experiment ( to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.
Here, given the nature warrants the spins to be mutually independent in the 2D-domain of the field[:,:], the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1). As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this ( be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in" ).
That means, either of the spins will remain fair & congruent with the Laws of Physics if and only if it sources a spin-flip decision treshhold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget it implement right & run it efficiently).
For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly ND-entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) Finite-State-Automata, isn't it?

From a Cython point-of-view the main problem is that cy_spin_flip requires the GIL. You need to add nogil to the end of its signature, and set the return type to void (since by default it returns a Python object, which requires the GIL).
However, np.exp and np.random.rand also require the GIL, because they're Python function calls. np.exp is probably easily replaced with libc.math.exp. np.random is a bit harder, but there's plenty of suggestions for C- and C++-based approaches: 1 2 3 4 (+ others).
A more fundamental problem is the line:
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
You've parallelized this with respect to m (i.e. different values of m are run in different threads), and each iteration changes field. However in this line you are looking up several different values of m. This means the whole thing is a race-condition (the result depends on which order the different threads finish) and suggests your algorithm may be fundamentally unsuitable for parallelization. Or that you should copy field and have field_in and field_out. It isn't obvious to me, but this is something that you should be able to work out.
Edit: it does look like you've given the race condition some thought with using i%2. It isn't obvious to me that this is right though. I think a working implementation of your "alternate cells" scheme would look something like:
for oddeven in range(2):
for m in prange(M):
for n in range(N):
# some mechanism to pick the alternate cells here.
i.e. you need a regular loop to pick the alternate cells outside your parallel loop.

Binary reading with python gives unexpected results

I'm trying to read some binary files with python for my analysis generated with Zemax OpticStudio. The structure of the file is supposed to be the following:
2 x 32-bit integer as header
n chunks of data
Each chunk is made by
32-bit integer indicating the number of C struc that come after
m C structures
The structures' definition is the following:
typedef struct
{
unsigned int status;
int level;
int hit_object;
int hit_face;
int unused;
int in_object;
int parent;
int storage;
int xybin, lmbin;
double index, starting_phase;
double x, y, z;
double l, m, n;
double nx, ny, nz;
double path_to, intensity;
double phase_of, phase_at;
double exr, exi, eyr, eyi, ezr, ezi;
}
which has a size of 208 bytes, for your convenience.
Here is the code that I wrote with some research and a couple of brilliant answers from here.
from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE
import struct
def little_endian_int(x):
return int.from_bytes(x,'little')
def file_byte_iterator(path):
"""iterator over lazily loaded file
"""
path = Path(path)
with path.open('rb') as file:
reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
file_iterator = iter(reader, bytes())
for chunk in file_iterator:
yield from chunk
def ray_tell(rays_idcs:list,ray_idx:int,seg_idx:int):
idx = rays_idcs[ray_idx][0]
idx += 4 + 208*seg_idx
return idx
def read_header(bytearr:bytearray):
version = int.from_bytes(bytearr[0:4],'little')
zrd_format = version//10000
version = version%10000
num_seg_max = int.from_bytes(bytearr[4:8],'little')
return zrd_format,version,num_seg_max
def rays_indices(bytearr:bytearray):
index=8
rays=[]
while index <len(bytearr):
num_seg = int.from_bytes(bytearr[index:index+4],'little')
rays.append((index,num_seg))
index = index+4 + 208*num_seg
return rays
def read_ray(bytearr:bytearray,ray):
ray_idx,num_seg = ray
data = []
ray_idx = ray_idx + 4
seg_idx=0
for ray_idx in range(8,8+num_seg*208,208):
offsets = [0,4,8,12,16,20,24,28,32,36,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200]
int_vars = offsets[0:11]
doubl_vars = offsets[11:]
data_integ = [bytearr[ray_idx+offset:ray_idx+offset+4] for offset in int_vars]
data_doubl = [bytearr[ray_idx+offset:ray_idx+offset+8] for offset in doubl_vars]
data.append([seg_idx,data_integ,data_doubl])
seg_idx += 1
return data
file="test_uncompressed.ZRD"
raypath = {}
filebin = bytearray(file_byte_iterator(file))
header = read_header(filebin)
print(header)
rays_idcs = rays_indices(filebin)
rays = []
for ray in rays_idcs:
rays.append(read_ray(filebin,ray))
ray = rays[1] #Random ray
segm = ray[2] #Random segm
ints = segm[1]
doub = segm[2]
print("integer vars:")
for x in ints:
print(x,little_endian_int(x))
print("double vars:")
for x in doub:
print(x,struct.unpack('<d',x))
I have verified that all of the structures have the right size and number of chunks and structures (my reading matches the number of segments and rays that I read with Zemax, ) , and thanks to the header, I verified the endianness of the file (little endian).
My output is the following:
(0, 2002)
bytearray(b'\x1f\xd8\x9c?') 1067243551
bytearray(b'\x06\x80\x00\x00') 32774
bytearray(b'\x02\x00\x00\x00') 2
bytearray(b'\x11\x00\x00\x00') 17
bytearray(b'\x02\x00\x00\x00') 2
bytearray(b'\x00\x00\x00\x00') 0
bytearray(b'\x11\x00\x00\x00') 17
bytearray(b'\x01\x00\x00\x00') 1
bytearray(b'\x00\x00\x00\x00') 0
bytearray(b'\x00\x00\x00\x00') 0
double vars:
bytearray(b'\x00\x00\x00\x00# \xac\xe8') (-1.6425098109028998e+196,)
bytearray(b'\xe8\xe3\xf9?\x00\x00\x00\x00') (5.3030112e-315,)
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00') (0.0,)
bytearray(b'\x00\x00\x00\x00p_\xb4\xec') (-4.389425605765071e+215,)
bytearray(b'5\xe3\x9d\xbf\xf0\xbd"\xa2') (-3.001836066957746e-144,)
bytearray(b'z"\xc0?\x00\x00\x00\x00') (5.28431047e-315,)
bytearray(b'\x00\x00\x00\x00 \xc9+\xa3') (-2.9165705864036956e-139,)
bytearray(b'g\xd4\xcd?\x9ch{ ') (3.2707669223572687e-152,)
bytearray(b'q\x1e\xef?\x00\x00\x00\x00') (5.299523535e-315,)
bytearray(b'\x00\x00\x00\x00%\x0c\xb4A') (336340224.0,)
bytearray(b'\t\xf2u\xbf\\3L\xe6') (-5.991371249309652e+184,)
bytearray(b'\xe1\xff\xef\xbf1\x8dV\x1e') (1.5664573023148095e-162,)
bytearray(b'\xa1\xe9\xe8?\x9c\x9a6\xfc') (-2.202825582975923e+290,)
bytearray(b'qV\xb9?\x00\x00\x00\x00') (5.28210966e-315,)
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00') (0.0,)
bytearray(b'\x00\x00\x00\x00\xc6\xfd\x0c\xa1') (-1.7713316840526727e-149,)
bytearray(b'\x96\x94\x8d?\xad\xf9(\xcc') (-7.838624888507203e+58,)
bytearray(b'yN\xb2\xbff.\\\x1a') (1.0611651097687064e-181,)
bytearray(b'\xb9*\xae?\xac\xaf\xe5\xe1') (-3.90257774261585e+163,)
bytearray(b'c\xab\xd2\xbf\xccQ\x8bj') (1.7130904564012918e+205,)
bytearray(b'\xc8\xea\x8c\xbf\xdf\xdc\xe49') (8.22891935818188e-30,)
I'm reading correctly just the int values. I don't understand why I get those binaries for all the other variables
EDIT
I want to highlight that bytearrays contain non-hexadecimal digits, and I'm sure that binary files are not corrupted, since I can read those in zemax

Solved.
It was just an error in my pointer arithmetic in the read_ray function. Thanks to Mad Physicist for his suggestion to unpack the whole structure which put me in the right direction.
def read_ray(bytearr:bytearray,ray):
ray_idx,num_seg = ray
data = []
assert num_seg==little_endian_int(bytearr[ray_idx:ray_idx+4])
ray_idx = ray_idx + 4
for seg_ptr in range(ray_idx,ray_idx + num_seg*208,208):
...
data_integ = [bytearr[seg_ptr+offset:seg_ptr+offset+4] for offset in int_vars]
data_doubl = [bytearr[seg_ptr+offset:seg_ptr+offset+8] for offset in doubl_vars]
...
return data

Matlab to Python conversion - Can't assign to function call

I have recently been trying to convert a piece of Matlab code into Python code.
I have made most of the changes that I need to however, the issue I am having is the line where it says:
y(index(m)) = 1-x(index(m));
I get the error:
"Can't assign to function call"
However I am not sure how to restructure it in order to remove this error.
I have had a look around and people mention "get item" and "set item" however I have tried to use them, but I can't get them to work (probably because I can't figure out the structure)
Here is the full code:
import numpy
N = 100;
B = N+1;
M = 5e4;
burnin = M;
Niter = 20;
p = ones(B,Niter+1)/B;
hit = zeros(B,1);
for j in range(1,Niter):
x = double(rand(1,N)>0.5);
bin_x = 1+sum(x);
index = ceil(N*rand(1,M+burnin));
acceptval = rand(1,M+burnin);
for m in range(1,M+burnin):
y = x;
y(index(m)) = 1-x(index(m));
bin_y = 1+sum(y);
alpha = min(1, p(bin_x,j)/p(bin_y,j) );
if acceptval(m)<alpha:
x = y; bin_x = bin_y;
end
if m > burnin: hit(bin_x) = hit(bin_x)+1; end
end
pnew = p[:,j];
for b in range(1,B-1):
if (hit(b+1)*hit(b) == 0):
pnew(b+1) = pnew(b)*(p(b+1,j)/p(b,j));
else:
g(b,j) = hit(b+1)*hit(b) / (hit(b+1)+hit(b));
g_hat(b) = g(b,j)/sum(g(b,arange(1,j)));
pnew(b+1) = pnew(b)*(p(b+1,j)/p(b,j))+((hit(b+1)/hit(b))^g_hat(b));
end
end
p[:,j+1] = pnew/sum(pnew);
hit[:] = 0;
end
Thanks in advance

The round brackets () indicate a function. For indexing you need [] square brackets - but that is only the first of many, many errors... I am currently going through line by line, but it's taking a while.
This code at least runs... you need to figure out whether the indexing is doing what you are expecting since Python arrays are indexed from zero, and Matlab arrays start at 1. I tried to fix that in a couple of places but didn't go through line by line - that's debugging.
Some key learnings:
There is no end statement... just stop indenting
When you import a library, you need to reference it (numpy.zeros, not zeros)
Lists are indexed from zero, not one
Indexing is done with [], not ()
Creating an array of random numbers is done with [random.random() for r in xrange(N)], not random(N).
... and many other things you will find as you look through the code below.
Good luck!
import numpy
import random
N = int(100);
B = N+1;
M = 5e4;
burnin = M;
Niter = 20;
p = numpy.ones([B,Niter+1])/B;
hit = numpy.zeros([B,1]);
g = numpy.zeros([B, Niter]);
b_hat = numpy.zeros(B);
for j in range(1,Niter):
x = [float(random.randint(0,1)>0.5) for r in xrange(N)];
bin_x = 1+sum(x);
index = [random.randint(0,N-1) for r in xrange(int(M+burnin))];
#acceptval = rand(1,M+burnin);
acceptval = [random.random() for r in xrange(int(M+burnin))];
for m in range(1,int(M+burnin)):
y = x;
y[index[m]] = 1-x[index[m]];
bin_y = 1+sum(y);
alpha = min(1, p[bin_x,j]/p[bin_y,j] );
if acceptval[m]<alpha:
x = y; bin_x = bin_y;
if m > burnin:
hit[bin_x] = hit[bin_x]+1;
pnew = p[:,j];
for b in range(1,B-1):
if (hit[b+1]*hit[b] == 0):
pnew[b+1] = pnew[b]*(p[b+1,j]/p[b,j]);
else:
g[b,j] = hit[b+1]*hit[b] / [hit[b+1]+hit[b]];
g_hat[b] = g[b,j]/sum(g[b,numpy.arange(1,j)]);
pnew[b+1] = pnew[b]*(p[b+1,j]/p[b,j])+((hit[b+1]/hit[b])^g_hat[b]);
p[:,j+1] = pnew/sum(pnew);
hit[:] = 0;

Linear Interpolation. How to implement this algorithm in C ? (Python version is given)

There exists one very good linear interpolation method. It performs linear interpolation requiring at most one multiply per output sample. I found its description in a third edition of Understanding DSP by Lyons. This method involves a special hold buffer. Given a number of samples to be inserted between any two input samples, it produces output points using linear interpolation. Here, I have rewritten this algorithm using Python:
temp1, temp2 = 0, 0
iL = 1.0 / L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
where x contains input samples, L is a number of points to be inserted, y will contain output samples.
My question is how to implement such algorithm in ANSI C in a most effective way, e.g. is it possible to avoid the second loop?
NOTE: presented Python code is just to understand how this algorithm works.
UPDATE: here is an example how it works in Python:
x=[]
y=[]
hold=[]
num_points=20
points_inbetween = 2
temp1,temp2=0,0
for i in range(num_points):
x.append( sin(i*2.0*pi * 0.1) )
L = points_inbetween
iL = 1.0/L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 * iL)
Let's say x=[.... 10, 20, 30 ....]. Then, if L=1, it will produce [... 10, 15, 20, 25, 30 ...]

Interpolation in the sense of "signal sample rate increase"
... or i call it, "upsampling" (wrong term, probably. disclaimer: i have not read Lyons'). I just had to understand what the code does and then re-write it for readability. As given it has couple of problems:
a) it is inefficient - two loops is ok but it does multiplication for every single output item; also it uses intermediary lists(hold), generates result with append (small beer)
b) it interpolates wrong the first interval; it generates fake data in front of the first element. Say we have multiplier=5 and seq=[20,30] - it will generate [0,4,8,12,16,20,22,24,28,30] instead of [20,22,24,26,28,30].
So here is the algorithm in form of a generator:
def upsampler(seq, multiplier):
if seq:
step = 1.0 / multiplier
y0 = seq[0];
yield y0
for y in seq[1:]:
dY = (y-y0) * step
for i in range(multiplier-1):
y0 += dY;
yield y0
y0 = y;
yield y0
Ok and now for some tests:
>>> list(upsampler([], 3)) # this is just the same as [Y for Y in upsampler([], 3)]
[]
>>> list(upsampler([1], 3))
[1]
>>> list(upsampler([1,2], 3))
[1, 1.3333333333333333, 1.6666666666666665, 2]
>>> from math import sin, pi
>>> seq = [sin(2.0*pi * i/10) for i in range(20)]
>>> seq
[0.0, 0.58778525229247314, 0.95105651629515353, 0.95105651629515364, 0.58778525229247325, 1.2246063538223773e-016, -0.58778525229247303, -0.95105651629515353, -0.95105651629515364, -0.58778525229247336, -2.4492127076447545e-016, 0.58778525229247214, 0.95105651629515353, 0.95105651629515364, 0.58778525229247336, 3.6738190614671318e-016, -0.5877852522924728, -0.95105651629515342, -0.95105651629515375, -0.58778525229247347]
>>> list(upsampler(seq, 2))
[0.0, 0.29389262614623657, 0.58778525229247314, 0.76942088429381328, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247325, 0.29389262614623668, 1.2246063538223773e-016, -0.29389262614623646, -0.58778525229247303, -0.76942088429381328, -0.95105651629515353, -0.95105651629515364, -0.95105651629515364, -0.7694208842938135, -0.58778525229247336, -0.29389262614623679, -2.4492127076447545e-016, 0.29389262614623596, 0.58778525229247214, 0.76942088429381283, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247336, 0.29389262614623685, 3.6738190614671318e-016, -0.29389262614623618, -0.5877852522924728, -0.76942088429381306, -0.95105651629515342, -0.95105651629515364, -0.95105651629515375, -0.76942088429381361, -0.58778525229247347]
And here is my translation to C, fit into Kratz's fn template:
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param will be filled with (src_len - 1) * steps + 1 samples
*/
float* linearInterpolation(float* src, int src_len, int steps, float* dst)
{
float step, y0, dy;
float *src_end;
if (src_len > 0) {
step = 1.0 / steps;
for (src_end = src+src_len; *dst++ = y0 = *src++, src < src_end; ) {
dY = (*src - y0) * step;
for (int i=steps; i>0; i--) {
*dst++ = y0 += dY;
}
}
}
}
Please note the C snippet is "typed but never compiled or run", so there might be syntax errors, off-by-1 errors etc. But overall the idea is there.

In that case I think you can avoid the second loop:
def interpolate2(x, L):
new_list = []
new_len = (len(x) - 1) * (L + 1)
for i in range(0, new_len):
step = i / (L + 1)
substep = i % (L + 1)
fr = x[step]
to = x[step + 1]
dy = float(to - fr) / float(L + 1)
y = fr + (dy * substep)
new_list.append(y)
new_list.append(x[-1])
return new_list
print interpolate2([10, 20, 30], 3)
you just calculate the member in the position you want directly. Though - that might not be the most efficient to do it. The only way to be sure is to compile it and see which one is faster.

Well, first of all, your code is broken. L is not defined, and neither is y or x.
Once that is fixed, I run cython on the resulting code:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
And that seemed to work. I haven't tried to compile it, though, and you can also improve the speed a lot by adding different optimizations.
"e.g. is it possible to avoid the second loop?"
If it is, then it's possible in Python too. And I don't see how, although I don't see why you would do it the way you do. First creating a list of L length of i-temp is completely pointless. Just loop L times:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = i-temp1
temp1 = i
for j in range(L):
temp2 += hold
y.append(temp2 *iL)
It all seems overcomplicated for what you get out though. What are you trying to do, actually? Interpolate something? (Duh it says so in the title. Sorry about that.)
There are surely easier ways of interpolating.
Update, a much simplified interpolation function:
# A simple list, so it's easy to see that you interpolate.
indata = [float(x) for x in range(0, 110, 10)]
points_inbetween = 3
outdata = [indata[0]]
for point in indata[1:]: # All except the first
step = (point - outdata[-1]) / (points_inbetween + 1)
for i in range(points_inbetween):
outdata.append(outdata[-1] + step)
I don't see a way to get rid of the inner loop, nor a reason for wanting to do so.
Converting it to C I'll leave up to someone else, or even better, Cython, as C is a great langauge of you want to talk to hardware, but otherwise just needlessly difficult.

I think you need the two loops. You have to step over the samples in x to initialize the interpolator, not to mention copy their values into y, and you have to step over the output samples to fill in their values. I suppose you could do one loop to copy x into the appropriate places in y, followed by another loop to use all the values from y, but that will still require some stepping logic. Better to use the nested loop approach.
(And, as Lennart Regebro points out) As a side note, I don't see why you do hold = [i-temp1] * L. Instead, why not do hold = i-temp, and then loop for j in xrange(L): and temp2 += hold? This will use less memory but otherwise behave exactly the same.

Heres my try at a C implementation for your algorithm. Before trying to further optimize it id suggest you profile its performance with all compiler optimizations enabled.
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param needs to be of size src_len * steps
*/
float* linearInterpolation(float* src, size_t src_len, size_t steps, float* dst)
{
float* dst_ptr = dst;
float* src_ptr = src;
float stepIncrement = 1.0f / steps;
float temp1 = 0.0f;
float temp2 = 0.0f;
float hold;
size_t idx_src, idx_steps;
for(idx_src = 0; idx_src < src_len; ++idx_src)
{
hold = *src_ptr - temp1;
temp1 = *src_ptr;
++src_ptr;
for(idx_steps = 0; idx_steps < steps; ++idx_steps)
{
temp2 += hold;
*dst_ptr = temp2 * stepIncrement;
++dst_ptr;
}
}
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Array of complex numbers in PyopenCL - python

Related

Tradingview pinescript's RMA (Moving average used in RSI. It is the exponentially weighted moving average with alpha = 1 / length) in python, pandas

How to use prange in cython?

Binary reading with python gives unexpected results

Matlab to Python conversion - Can't assign to function call

Linear Interpolation. How to implement this algorithm in C ? (Python version is given)

Categories

Resources