What I am doing wrong in R to python translation? - python

I have a code in R which implements the Metropolis Hastings algorithm :
trials <- 100000
sim <- numeric(trials)
sim[1] <- 2
for (i in 2: trials) {
old <- sim[i-1]
prop <- runif(1,0,5)
acc <- (exp(-(prop-1)^2/2) + exp(-(prop-4)^2/2)) /
( (exp(-(old-1)^2/2) + exp(-(old-4)^2/2)) )
if (runif(1) < acc)
sim[i] <- prop
else
sim[i] <- old }
mean(sim)
var(sim)
and the results are right.
But when I translate it in Python the results are different.
trials = 100000
sim = np.repeat(0,trials+1)
sim[0] = 2
for i in range (2, trials):
old = sim[i-1]
prop = np.random.uniform(0,5,1)
acc = (np.exp(-(prop-1)**2/2) + np.exp(-(prop-4)**2/2)) /( (np.exp(-(old-1)**2/2) + np.exp(-(old-4)**2/2)) )
if np.random.uniform(1) < acc:
sim[i] = prop
else:
sim[i] = old
Why ? What I am doing wrong here ?

Firstly, you want to start your python for loop at i=1 in the range(1, trials) since Python starts at 0.
Secondly, at if np.random.uniform(1) you are just producing 1.0, so that needs to change, e.g. np.random.uniform(min, max, size=1) or np.random.uniform(size=1) if you just want a uniform number between 0 and 1. Have a look at the documentation for np.random.uniform if this isn't clear.
Update
Thirdly, you are unknowingly casting to integers, so this needs to be handled also. When being used to R this is easy to forget (I just did myself). I have refactored your code below, and this solution should provide you with a similar result to what you see in R.
Here I turned sim into a float vector, and I made sure to subtract and divide using floats inside the exp-functions. Hope this works for you.
trials = 100000
sim = np.repeat(0,trials).astype(np.float64)
sim[0] = 2.0
for i in range (1, trials):
old = sim[i-1]
prop = np.random.uniform(0,5,1)
acc = (np.exp(-(prop-1.0)**2/2.0) + np.exp(-(prop-4.0)**2/2.0)) \
/ ( (np.exp(-(old-1.0)**2/2.0) + np.exp(-(old-4.0)**2/2.0)) )
if np.random.uniform(size=1) < acc:
sim[i] = prop
else:
sim[i] = old

Related

Convert Pine Script into Python

I am just wondering how to convert the PineScript dev() function into Python code. Is my interpretation correct?
Pine Scripts Example is the following:
plot(dev(close, 10))
// the same on pine
pine_dev(source, length) =>
mean = sma(source, length)
sum = 0.0
for i = 0 to length - 1
val = source[i]
sum := sum + abs(val - mean)
dev = sum/length
plot(pine_dev(close, 10))
My Python code is the following:
df["SMA_highest"] = ta.sma(df["Close"], 10)
df["dev_abs_highest"] = (df["Close"] - df["SMA_highest"]).abs()
df["dev_cumsum_highest"] = df["dev_abs_highest"].rolling(window=10).sum()
df["DEV_SMA_highest"] = df["dev_cumsum_highest"] / 10
What do I need to adjust in the Python code to have the same result as in the Pine Script?
Thanks for any hints :)
I was looking for the same script too and did not find a ready-to-go solution. So I implemented myself. Unfortunately I did not test it completely because the stock prices between yfinance and TradingView differ a little bit, so the result differs a little bit too.
diffavg = stock[columnname].rolling(days).apply(pine_dev)
def pine_dev(column):
summ = 0.0
mean = column.mean()
length = len(column)
for i in range(0, length):
summ = summ + abs( column[i] - mean )
ret_val = summ / length
return ret_val
Basically I use the rolling function and if you apply a function to this, you get all Values from the rolling timeframe in the function.

percentiles pandas vs. scala where is the bug?

For a list of numbers
val numbers = Seq(0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205)
python / pandas computes the following percentiles:
25% 0.167289
50% 0.348107
75% 0.692389
However, scala returns:
calcPercentiles(Seq(.25, .5, .75), sortedNumber.toArray)
25% 0.1601818278168149
50% 0.3481071101229365
75% 0.7182103704579226
The numbers are almost matching - but different. How can I get rid of the difference (and most likely fix a bug in my scala code?
val sortedNumber = numbers.sorted
import scala.collection.mutable
case class PercentileResult(percentile:Double, value:Double)
// https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
def calculatePercentile(arr: Array[Double], p: Double)={
// +1 so that the .5 == mean for even number of elements.
val f = (arr.length + 1) * p
val i = f.toInt
if (i == 0) arr.head
else if (i >= arr.length) arr.last
else {
arr(i - 1) + (f - i) * (arr(i) - arr(i - 1))
}
}
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]):Array[PercentileResult] = {
val results = new mutable.ListBuffer[PercentileResult]
percentiles.foreach(p => {
val r = PercentileResult(percentile = p, value = calculatePercentile(arr, p))
results.append(r)
})
results.toArray
}
python:
import pandas as pd
df = pd.DataFrame({'foo':[0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205]})
display(df.head())
df.describe()
After a bit trial and error I write this code that returns the same results as Panda (using linear interpolation as this is pandas default):
def calculatePercentile(numbers: Seq[Double], p: Double): Double = {
// interpolate only - no special handling of the case when rank is integer
val rank = (numbers.size - 1) * p
val i = numbers(math.floor(rank).toInt)
val j = numbers(math.ceil(rank).toInt)
val fraction = rank - math.floor(rank)
i + (j - i) * fraction
}
From that I would say that the errors was here:
(arr.length + 1) * p
Percentile of 0 should be 0, and percentile at 100% should be a maximal index.
So for numbers (.size == 21) that would be indices 0 and 20. However, for 100% you would get index value of 22 - bigger than the size of array! If not for these guard clauses:
else if (i >= arr.length) arr.last
you would get error and you could suspect that something is wrong. Perhaps authors of the code:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
used a different definition of percentile... (?) or they might simply have a bug. I cannot tell.
BTW: This:
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]): Array[PercentileResult]
could be much easier to write like this:
def calcPercentiles(percentiles:Seq[Double], numbers: Seq[Double]): Seq[PercentileResult] =
percentiles.map { p =>
PercentileResult(p, calculatePercentile(numbers, p))
}

Metropolis-Hastings accept-reject implementation

I've been reading about the Metropolis-Hastings (MH) algorithm. Theoretically, I understood how the algorithm works. Now, I am trying to implement the MH algorithm using python.
I came across the following notebook. It suits exactly my problem since I want to fit my data by a straight line taking into consideration the measurement errors on my data. I am going to paste the code I am finding difficulties to understand:
# initial m, b
m,b = 2, 0
# step sizes
mstep, bstep = 0.1, 10.
# how many steps?
nsteps = 10000
chain = []
probs = []
naccept = 0
print 'Running MH for', nsteps, 'steps'
# First point:
L_old = straight_line_log_likelihood(x, y, sigmay, m, b)
p_old = straight_line_log_prior(m, b)
prob_old = np.exp(L_old + p_old)
for i in range(nsteps):
# step
mnew = m + np.random.normal() * mstep
bnew = b + np.random.normal() * bstep
# evaluate probabilities
# prob_new = straight_line_posterior(x, y, sigmay, mnew, bnew)
L_new = straight_line_log_likelihood(x, y, sigmay, mnew, bnew)
p_new = straight_line_log_prior(mnew, bnew)
prob_new = np.exp(L_new + p_new)
if (prob_new / prob_old > np.random.uniform()):
# accept
m = mnew
b = bnew
L_old = L_new
p_old = p_new
prob_old = prob_new
naccept += 1
else:
# Stay where we are; m,b stay the same, and we append them
# to the chain below.
pass
chain.append((b,m))
probs.append((L_old,p_old))
print 'Acceptance fraction:', naccept/float(nsteps)
The code is simple and easy, but I have difficulties in understanding how the MH is being implemented.
My question is in the chain.append (the third line from the bottom). The author is appending m and b whether they were accepted or rejected. Why? Shouldn't he append only the accepted points?
The following R code demonstrates why it is important to capture the rejected case:
# 20 samples from 0 or 1. 1 has an 80% probability of being chosen.
the.population <- sample(c(0,1), 20, replace = TRUE, prob=c(0.2, 0.8))
# Create a new sample that only catches changes
the.sample <- c(the.population[1])
# Loop though the.population,
# but only copy the.population to the.sample if the value changes
for( i in 2:length(the.population))
{
if(the.population[i] != the.population[i-1])
the.sample <- append(the.sample, the.population[i])
}
When this code runs, the.population gets 20 values, for example:
0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1
The probability of a 1 in this population is 16/20 or 0.8. Exactly the probability we expected...
The sample, on the other hand, which only records changes, looks like this:
0 1 0 1 0 1
The probability of a 1 in the sample is 3/6 or 0.5.
We are trying to build a distribution, rejecting the new values means that the old values are more likely than the new values. That needs to be captured so our distribution is correct.
From a quick reading of the algorithm description: When a candidate is rejected, it still counts as a step, but the value is the same as the old step. I.e. b, m are appended either way, but they only get updated (to bnew, mnew) in the case where the candidate is accepted.

trimmed/winsorized standard deviation

What's an efficient way to calculate a trimmed or winsorized standard deviation of a list?
I don't mind using numpy, but if I have to make a separate copy of the list, it's going to be quite slow.
This will make two copies, but you should give it a try because it should be very fast.
def trimmed_std(data, low, high):
tmp = np.asarray(data)
return tmp[(low <= tmp) & (tmp < high)].std()
Do you need to do rank order trimming (ie 5% trimmed)?
Update:
If you need percentile trimming, the best way I can think of is to sort the data first. Something like this should work:
def trimmed_std(data, percentile):
data = np.array(data)
data.sort()
percentile = percentile / 2.
low = int(percentile * len(data))
high = int((1. - percentile) * len(data))
return data[low:high].std(ddof=0)
You can obviously implement this without using numpy, but even including the time of converting the list to an array, using numpy is faster than anything I could think of.
This is what generator functions are for.
SD requires two passes, plus a count. For this reason, you'll need to "tee" some iterators over the base collection.
So.
trimmed = ( x for x in the_list if low <= x < high )
sum_iter, len_iter, var_iter = itertools.tee( trimmed, 3 )
n = sum( 1 for x in len_iter)
mean = sum( sum_iter ) / n
sd = math.sqrt( sum( (x-mean)**2 for x in var_iter ) / (n-1) )
Something like that might do what you want without copying anything.
In order to get an unbiased trimmed mean you have to account for fractional bits of items in the list as described here and (a little less directly) here. I wrote a function to do it:
def percent_tmean( data, pcent ):
# make sure data is a list
dc = list( data )
# find the number of items
n = len(dc)
# sort the list
dc.sort()
# get the proportion to trim
p = pcent / 100.0
k = n*p
# print "n = %i\np = %.3f\nk = %.3f" % ( n,p,k )
# get the decimal and integer parts of k
dec_part, int_part = modf( k )
# get an index we can use
index = int(int_part)
# trim down the list
dc = dc[ index: index * -1 ]
# deal with the case of trimming fractional items
if dec_part != 0.0:
# deal with the first remaining item
dc[ 0 ] = dc[ 0 ] * (1 - dec_part)
# deal with last remaining item
dc[ -1 ] = dc[ -1 ] * (1 - dec_part)
return sum( dc ) / ( n - 2.0*k )
I also made an iPython Notebook that demonstrates it.
My function will probably be slower than those already posted but it will give unbiased results.

Linear Interpolation. How to implement this algorithm in C ? (Python version is given)

There exists one very good linear interpolation method. It performs linear interpolation requiring at most one multiply per output sample. I found its description in a third edition of Understanding DSP by Lyons. This method involves a special hold buffer. Given a number of samples to be inserted between any two input samples, it produces output points using linear interpolation. Here, I have rewritten this algorithm using Python:
temp1, temp2 = 0, 0
iL = 1.0 / L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
where x contains input samples, L is a number of points to be inserted, y will contain output samples.
My question is how to implement such algorithm in ANSI C in a most effective way, e.g. is it possible to avoid the second loop?
NOTE: presented Python code is just to understand how this algorithm works.
UPDATE: here is an example how it works in Python:
x=[]
y=[]
hold=[]
num_points=20
points_inbetween = 2
temp1,temp2=0,0
for i in range(num_points):
x.append( sin(i*2.0*pi * 0.1) )
L = points_inbetween
iL = 1.0/L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 * iL)
Let's say x=[.... 10, 20, 30 ....]. Then, if L=1, it will produce [... 10, 15, 20, 25, 30 ...]
Interpolation in the sense of "signal sample rate increase"
... or i call it, "upsampling" (wrong term, probably. disclaimer: i have not read Lyons'). I just had to understand what the code does and then re-write it for readability. As given it has couple of problems:
a) it is inefficient - two loops is ok but it does multiplication for every single output item; also it uses intermediary lists(hold), generates result with append (small beer)
b) it interpolates wrong the first interval; it generates fake data in front of the first element. Say we have multiplier=5 and seq=[20,30] - it will generate [0,4,8,12,16,20,22,24,28,30] instead of [20,22,24,26,28,30].
So here is the algorithm in form of a generator:
def upsampler(seq, multiplier):
if seq:
step = 1.0 / multiplier
y0 = seq[0];
yield y0
for y in seq[1:]:
dY = (y-y0) * step
for i in range(multiplier-1):
y0 += dY;
yield y0
y0 = y;
yield y0
Ok and now for some tests:
>>> list(upsampler([], 3)) # this is just the same as [Y for Y in upsampler([], 3)]
[]
>>> list(upsampler([1], 3))
[1]
>>> list(upsampler([1,2], 3))
[1, 1.3333333333333333, 1.6666666666666665, 2]
>>> from math import sin, pi
>>> seq = [sin(2.0*pi * i/10) for i in range(20)]
>>> seq
[0.0, 0.58778525229247314, 0.95105651629515353, 0.95105651629515364, 0.58778525229247325, 1.2246063538223773e-016, -0.58778525229247303, -0.95105651629515353, -0.95105651629515364, -0.58778525229247336, -2.4492127076447545e-016, 0.58778525229247214, 0.95105651629515353, 0.95105651629515364, 0.58778525229247336, 3.6738190614671318e-016, -0.5877852522924728, -0.95105651629515342, -0.95105651629515375, -0.58778525229247347]
>>> list(upsampler(seq, 2))
[0.0, 0.29389262614623657, 0.58778525229247314, 0.76942088429381328, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247325, 0.29389262614623668, 1.2246063538223773e-016, -0.29389262614623646, -0.58778525229247303, -0.76942088429381328, -0.95105651629515353, -0.95105651629515364, -0.95105651629515364, -0.7694208842938135, -0.58778525229247336, -0.29389262614623679, -2.4492127076447545e-016, 0.29389262614623596, 0.58778525229247214, 0.76942088429381283, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247336, 0.29389262614623685, 3.6738190614671318e-016, -0.29389262614623618, -0.5877852522924728, -0.76942088429381306, -0.95105651629515342, -0.95105651629515364, -0.95105651629515375, -0.76942088429381361, -0.58778525229247347]
And here is my translation to C, fit into Kratz's fn template:
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param will be filled with (src_len - 1) * steps + 1 samples
*/
float* linearInterpolation(float* src, int src_len, int steps, float* dst)
{
float step, y0, dy;
float *src_end;
if (src_len > 0) {
step = 1.0 / steps;
for (src_end = src+src_len; *dst++ = y0 = *src++, src < src_end; ) {
dY = (*src - y0) * step;
for (int i=steps; i>0; i--) {
*dst++ = y0 += dY;
}
}
}
}
Please note the C snippet is "typed but never compiled or run", so there might be syntax errors, off-by-1 errors etc. But overall the idea is there.
In that case I think you can avoid the second loop:
def interpolate2(x, L):
new_list = []
new_len = (len(x) - 1) * (L + 1)
for i in range(0, new_len):
step = i / (L + 1)
substep = i % (L + 1)
fr = x[step]
to = x[step + 1]
dy = float(to - fr) / float(L + 1)
y = fr + (dy * substep)
new_list.append(y)
new_list.append(x[-1])
return new_list
print interpolate2([10, 20, 30], 3)
you just calculate the member in the position you want directly. Though - that might not be the most efficient to do it. The only way to be sure is to compile it and see which one is faster.
Well, first of all, your code is broken. L is not defined, and neither is y or x.
Once that is fixed, I run cython on the resulting code:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
And that seemed to work. I haven't tried to compile it, though, and you can also improve the speed a lot by adding different optimizations.
"e.g. is it possible to avoid the second loop?"
If it is, then it's possible in Python too. And I don't see how, although I don't see why you would do it the way you do. First creating a list of L length of i-temp is completely pointless. Just loop L times:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = i-temp1
temp1 = i
for j in range(L):
temp2 += hold
y.append(temp2 *iL)
It all seems overcomplicated for what you get out though. What are you trying to do, actually? Interpolate something? (Duh it says so in the title. Sorry about that.)
There are surely easier ways of interpolating.
Update, a much simplified interpolation function:
# A simple list, so it's easy to see that you interpolate.
indata = [float(x) for x in range(0, 110, 10)]
points_inbetween = 3
outdata = [indata[0]]
for point in indata[1:]: # All except the first
step = (point - outdata[-1]) / (points_inbetween + 1)
for i in range(points_inbetween):
outdata.append(outdata[-1] + step)
I don't see a way to get rid of the inner loop, nor a reason for wanting to do so.
Converting it to C I'll leave up to someone else, or even better, Cython, as C is a great langauge of you want to talk to hardware, but otherwise just needlessly difficult.
I think you need the two loops. You have to step over the samples in x to initialize the interpolator, not to mention copy their values into y, and you have to step over the output samples to fill in their values. I suppose you could do one loop to copy x into the appropriate places in y, followed by another loop to use all the values from y, but that will still require some stepping logic. Better to use the nested loop approach.
(And, as Lennart Regebro points out) As a side note, I don't see why you do hold = [i-temp1] * L. Instead, why not do hold = i-temp, and then loop for j in xrange(L): and temp2 += hold? This will use less memory but otherwise behave exactly the same.
Heres my try at a C implementation for your algorithm. Before trying to further optimize it id suggest you profile its performance with all compiler optimizations enabled.
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param needs to be of size src_len * steps
*/
float* linearInterpolation(float* src, size_t src_len, size_t steps, float* dst)
{
float* dst_ptr = dst;
float* src_ptr = src;
float stepIncrement = 1.0f / steps;
float temp1 = 0.0f;
float temp2 = 0.0f;
float hold;
size_t idx_src, idx_steps;
for(idx_src = 0; idx_src < src_len; ++idx_src)
{
hold = *src_ptr - temp1;
temp1 = *src_ptr;
++src_ptr;
for(idx_steps = 0; idx_steps < steps; ++idx_steps)
{
temp2 += hold;
*dst_ptr = temp2 * stepIncrement;
++dst_ptr;
}
}
}

Categories