percentiles pandas vs. scala where is the bug?

percentiles pandas vs. scala where is the bug? - python

For a list of numbers
val numbers = Seq(0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205)
python / pandas computes the following percentiles:
25% 0.167289
50% 0.348107
75% 0.692389
However, scala returns:
calcPercentiles(Seq(.25, .5, .75), sortedNumber.toArray)
25% 0.1601818278168149
50% 0.3481071101229365
75% 0.7182103704579226
The numbers are almost matching - but different. How can I get rid of the difference (and most likely fix a bug in my scala code?
val sortedNumber = numbers.sorted
import scala.collection.mutable
case class PercentileResult(percentile:Double, value:Double)
// https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
def calculatePercentile(arr: Array[Double], p: Double)={
// +1 so that the .5 == mean for even number of elements.
val f = (arr.length + 1) * p
val i = f.toInt
if (i == 0) arr.head
else if (i >= arr.length) arr.last
else {
arr(i - 1) + (f - i) * (arr(i) - arr(i - 1))
}
}
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]):Array[PercentileResult] = {
val results = new mutable.ListBuffer[PercentileResult]
percentiles.foreach(p => {
val r = PercentileResult(percentile = p, value = calculatePercentile(arr, p))
results.append(r)
})
results.toArray
}
python:
import pandas as pd
df = pd.DataFrame({'foo':[0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205]})
display(df.head())
df.describe()

After a bit trial and error I write this code that returns the same results as Panda (using linear interpolation as this is pandas default):
def calculatePercentile(numbers: Seq[Double], p: Double): Double = {
// interpolate only - no special handling of the case when rank is integer
val rank = (numbers.size - 1) * p
val i = numbers(math.floor(rank).toInt)
val j = numbers(math.ceil(rank).toInt)
val fraction = rank - math.floor(rank)
i + (j - i) * fraction
}
From that I would say that the errors was here:
(arr.length + 1) * p
Percentile of 0 should be 0, and percentile at 100% should be a maximal index.
So for numbers (.size == 21) that would be indices 0 and 20. However, for 100% you would get index value of 22 - bigger than the size of array! If not for these guard clauses:
else if (i >= arr.length) arr.last
you would get error and you could suspect that something is wrong. Perhaps authors of the code:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
used a different definition of percentile... (?) or they might simply have a bug. I cannot tell.
BTW: This:
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]): Array[PercentileResult]
could be much easier to write like this:
def calcPercentiles(percentiles:Seq[Double], numbers: Seq[Double]): Seq[PercentileResult] =
percentiles.map { p =>
PercentileResult(p, calculatePercentile(numbers, p))
}

Related

How to change the code from Scala to Python

def computeTotalVariationDistance(p: Distribution, q: Distribution): Double = {
val pSum = p.sum
val qSum = q.sum
val l1Distance = p.zip(q)
.map { case (_, pVal, qVal) =>
math.abs((pVal / pSum) - (qVal / qSum))
}
.sum
0.5 * l1Distance
}
Can someone help me to change this code into python.

It's actually relatively straightforward. Instead of map, you can use list comprehension:
l1Distance = sum(
[abs((pVal / pSum) - (qVal / qSum))
for pVal, qVal in zip(p, q)]
)
I have not tested this but it should work, or something very similar.

Tradingview pinescript's RMA (Moving average used in RSI. It is the exponentially weighted moving average with alpha = 1 / length) in python, pandas

I've been trying to get same results from tradingviews RMA method but I dont know how to accomplish it
In their page RMA is computed as:
plot(rma(close, 15))
//the same on pine
pine_rma(src, length) =>
alpha = 1/length
sum = 0.0
sum := na(sum[1]) ? sma(src, length) : alpha * src + (1 - alpha) * nz(sum[1])
plot(pine_rma(close, 15))
to test I used input and their result, this is input column and the same input after applying tradingview´s rma(input,14):
data = [[588.0,519.9035093599585],
[361.98999999999984,508.62397297710436],
[412.52000000000055,501.7594034787397],
[197.60000000000042,480.0337318016869],
[208.71999999999932,460.6541795301378],
[380.1100000000006,454.90102384941366],
[537.6599999999999,460.8123792887413],
[323.5600000000013,451.0086379109742],
[431.78000000000077,449.6351637744761],
[299.6299999999992,438.9205092191563],
[225.1900000000005,423.65404427493087],
[292.42000000000013,414.28018396957873],
[357.64999999999964,410.23517082889435],
[692.5100000000003,430.3976586268306],
[219.70999999999916,415.34854015348543],
[400.32999999999987,414.2757872853794],
[604.3099999999995,427.849659622138],
[204.29000000000087,411.8811125062711],
[176.26000000000022,395.0510330415374],
[204.1800000000003,381.41738782428473],
[324.0,377.3161458368358],
[231.67000000000007,366.91284970563316],
[184.21000000000092,353.8626461552309],
[483.0,363.08674285842864],
[290.6399999999994,357.911975511398],
[107.10000000000036,339.996834403441],
[179.0,328.49706051748086],
[182.36000000000058,318.05869905194663],
[275.0,314.98307769109323],
[135.70000000000073,302.17714357030087],
[419.59000000000015,310.56377617242225],
[275.6399999999994,308.06922073153487],
[440.48999999999984,317.5278478221396],
[224.0,310.8472872634153],
[548.0100000000001,327.78748103031415],
[257.0,322.73123238529183],
[267.97999999999956,318.82043007205664],
[366.51000000000016,322.2268279240526],
[341.14999999999964,323.57848307233456],
[147.4200000000001,310.9957342814536],
[158.78000000000063,300.12318183277836],
[416.03000000000077,308.4022402732943],
[360.78999999999917,312.14422311091613],
[1330.7299999999996,384.90035003156487],
[506.92000000000013,393.61603931502464],
[307.6100000000006,387.4727507925229],
[296.7299999999996,380.991125735914],
[462.0,386.7774738976345],
[473.8099999999995,392.9940829049463],
[769.4200000000002,419.88164841173585],
[971.4799999999997,459.2815306680404],
[722.1399999999994,478.0571356203232],
[554.9799999999996,483.5516259331572],
[688.5,498.19079550936027],
[292.0,483.462881544406],
[634.9500000000007,494.2833900055199]]
# Create the pandas DataFrame
dfRMA = pd.DataFrame(data, columns = ['input', 'wantedrma'])
dfRMA['try1'] = dfRMA['input'].ewm( alpha=1/14, adjust=False).mean()
dfRMA['try2'] = numpy_ewma_vectorized(dfRMA['input'],14)
dfRMA
ewm does not give me same results, so I searched and found this but I just replicated ewma
def numpy_ewma_vectorized(data, window):
alpha = 1/window
alpha_rev = 1-alpha
scale = 1/alpha_rev
n = data.shape[0]
r = np.arange(n)
scale_arr = scale**r
offset = data[0]*alpha_rev**(r+1)
pw0 = alpha*alpha_rev**(n-1)
mult = data*pw0*scale_arr
cumsums = mult.cumsum()
out = offset + cumsums*scale_arr[::-1]
return out
I am getting these results
Do you know how to translate pinescript rma method in pandas?
I realized that using pandas ewm it seems to converge, last rows are closer and closer to the value, is this correct?
...

As far as I know Pine-script will use data that is not exported, so the weighted mean is taking into account previous records that are not in your table, meaning that you can't reproduce the results without more information.
What you need to do is load around 50-100 points (depending on alpha) of data further into the past than what you actually will use, and use a threshold for the comparing the data. You need that both python and pine-script is using data with the same or at least similar "history".
So you make the calculations using the whole dataframe and then you skip the first rows. You can see the effect of the historical data in your own example as difference between your calculation and pine-script one is quickly vanishing after the 55 point, but of course the difference will also depend on alpha.
So actually your code could be already well written. In any case you can use the pandas implementation directly, it will be easier and probably faster.

const cloneArray = (input) => [...input]
const pluck = (input, key) => input.map(element => element[key])
const pineSma = (source, length) => {
let sum = 0.0
for (let i = 0; i < length; ++i) {
sum += source[i] / length
}
return sum
}
const pineRma = (src, length, last) => {
const alpha = 1.0 / length
return alpha * src[0] + (1.0 - alpha) * last
}
const calculatePineRma = (candles, sourceKey, length) => {
const results = []
for (let i = 0; i <= candles.length - length; ++i) {
const sourceCandles = cloneArray(candles.slice(i, i + length)).reverse()
const closes = pluck(sourceCandles, sourceKey)
if (i === 0) {
results.push(pineSma(closes, length))
continue
}
results.push(pineRma(closes, length, results[results.length - 1]))
}
return results
}

shortcuts in if-else condition python

I have the following abbreviation of a function in my code:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
if (average(velo_mask<0.9):
s = 0.8
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
else:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
This means that I have to compute the two arrays first based on the initial given value of s, then do the condition, based on it, I change the value of s and I want the re-run the whole previous code based on the new value of s. (I have a loop,and each time the whole data changes)
It is actually a huge code, and I don't want to re-write it 3 times, once to calculate the average, in the if condition, and in the else condition.
Is there maybe a way to tell python to re-run the whole previous part in the if-else condition.

Use functions to avoid code duplication. Example:
def create_mask(velo, spec, freq, fthrow, maskwidth_f, s):
m = np.nonzero((velo > freq - fthrow - s * maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
return velo_mask, spec_mask
...
s = 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)
s = 0.8 if average(velo_mask < 0.9) else 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)

splitting list in chunks of balanced weight

I need an algorithm to split a list of values into such chunks, that sum of values in every chunk is (approximately) equals (its some variation of Knapsack problem, I suppose)
So, for example [1, 2, 1, 4, 10, 3, 8] => [[8, 2], [10], [1, 3, 1, 4]]
Chunks of equal lengths are preferred, but it's not a constraint.
Python is preferred language, but others are welcome as well
Edit: number of chunks is defined

Greedy:
1. Order the available items descending.
2. Create N empty groups
3. Start adding the items one at a time into the group that has the smallest sum in it.
I think in most real life situations this should be enough.

Based on #Alin Purcaru answer and #amit remarks, I wrote code (Python 3.1). It has, as far as I tested, linear performance (both for number of items and number of chunks, so finally it's O(N * M)). I avoid sorting the list every time, keeping current sum of values for every chunk in a dict (can be less practical with greater number of chunks)
import time, random
def split_chunks(l, n):
"""
Splits list l into n chunks with approximately equals sum of values
see http://stackoverflow.com/questions/6855394/splitting-list-in-chunks-of-balanced-weight
"""
result = [[] for i in range(n)]
sums = {i:0 for i in range(n)}
c = 0
for e in l:
for i in sums:
if c == sums[i]:
result[i].append(e)
break
sums[i] += e
c = min(sums.values())
return result
if __name__ == '__main__':
MIN_VALUE = 0
MAX_VALUE = 20000000
ITEMS = 50000
CHUNKS = 256
l =[random.randint(MIN_VALUE, MAX_VALUE ) for i in range(ITEMS)]
t = time.time()
r = split_chunks(l, CHUNKS)
print(ITEMS, CHUNKS, time.time() - t)
Just because, you know, we can, the same code in PHP 5.3 (2 - 3 times slower than Python 3.1):
function split_chunks($l, $n){
$result = array_fill(0, $n, array());
$sums = array_fill(0, $n, 0);
$c = 0;
foreach ($l as $e){
foreach ($sums as $i=>$sum){
if ($c == $sum){
$result[$i][] = $e;
break;
} // if
} // foreach
$sums[$i] += $e;
$c = min($sums);
} // foreach
return $result;
}
define('MIN_VALUE',0);
define('MAX_VALUE',20000000);
define('ITEMS',50000);
define('CHUNKS',128);
$l = array();
for ($i=0; $i<ITEMS; $i++){
$l[] = rand(MIN_VALUE, MAX_VALUE);
}
$t = microtime(true);
$r = split_chunks($l, CHUNKS);
$t = microtime(true) - $t;
print(ITEMS. ' ' . CHUNKS .' ' . $t . ' ');

This will be faster and a little cleaner (based on above ideas!)
def split_chunks2(l, n):
result = [[] for i in range(n)]
sums = [0]*n
i = 0
for e in l:
result[i].append(e)
sums[i] += e
i = sums.index(min(sums))
return result

you may want to use Artificial Intelligence tools for the problem.
first define your problem
States={(c1,c2,...,ck) | c1,...,ck are subgroups of your problem , and union(c1,..,ck)=S }
successors((c1,...,ck)) = {switch one element from one sub list to another }
utility(c1,...,ck) = max{sum(c1),sum(c2)...} - min{sum(c1),sum(c2),...}
now, you can use steepest ascent hill climbing with random-restarts.
this algorithm will be anytime, meaning you can start searching, and when time's up - stop it, and you will get the best result so far. the result will be better as run time increased.

Scala version of foxtrotmikew answer:
def workload_balancer(element_list: Seq[(Long, Any)], partitions: Int): Seq[Seq[(Long, Any)]] = {
val result = scala.collection.mutable.Seq.fill(partitions)(null : Seq[(Long, Any)])
val index = (0 to element_list.size-1)
val weights = scala.collection.mutable.Seq.fill(partitions)(0l)
(0 to partitions-1).foreach( x => weights(x) = 0 )
var i = 0
for (e <- element_list){
result(i) = if(result(i) == null) Seq(e) else result(i) ++: Seq(e)
weights(i) = weights(i) + e._1
i = weights.indexOf( weights.min )
}
result.toSeq
}
element_list should be (weight : Long, Object : Any), then you can order and split objects into different workloads (result). It help me a lot!, thnks.

Linear Interpolation. How to implement this algorithm in C ? (Python version is given)

There exists one very good linear interpolation method. It performs linear interpolation requiring at most one multiply per output sample. I found its description in a third edition of Understanding DSP by Lyons. This method involves a special hold buffer. Given a number of samples to be inserted between any two input samples, it produces output points using linear interpolation. Here, I have rewritten this algorithm using Python:
temp1, temp2 = 0, 0
iL = 1.0 / L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
where x contains input samples, L is a number of points to be inserted, y will contain output samples.
My question is how to implement such algorithm in ANSI C in a most effective way, e.g. is it possible to avoid the second loop?
NOTE: presented Python code is just to understand how this algorithm works.
UPDATE: here is an example how it works in Python:
x=[]
y=[]
hold=[]
num_points=20
points_inbetween = 2
temp1,temp2=0,0
for i in range(num_points):
x.append( sin(i*2.0*pi * 0.1) )
L = points_inbetween
iL = 1.0/L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 * iL)
Let's say x=[.... 10, 20, 30 ....]. Then, if L=1, it will produce [... 10, 15, 20, 25, 30 ...]

Interpolation in the sense of "signal sample rate increase"
... or i call it, "upsampling" (wrong term, probably. disclaimer: i have not read Lyons'). I just had to understand what the code does and then re-write it for readability. As given it has couple of problems:
a) it is inefficient - two loops is ok but it does multiplication for every single output item; also it uses intermediary lists(hold), generates result with append (small beer)
b) it interpolates wrong the first interval; it generates fake data in front of the first element. Say we have multiplier=5 and seq=[20,30] - it will generate [0,4,8,12,16,20,22,24,28,30] instead of [20,22,24,26,28,30].
So here is the algorithm in form of a generator:
def upsampler(seq, multiplier):
if seq:
step = 1.0 / multiplier
y0 = seq[0];
yield y0
for y in seq[1:]:
dY = (y-y0) * step
for i in range(multiplier-1):
y0 += dY;
yield y0
y0 = y;
yield y0
Ok and now for some tests:
>>> list(upsampler([], 3)) # this is just the same as [Y for Y in upsampler([], 3)]
[]
>>> list(upsampler([1], 3))
[1]
>>> list(upsampler([1,2], 3))
[1, 1.3333333333333333, 1.6666666666666665, 2]
>>> from math import sin, pi
>>> seq = [sin(2.0*pi * i/10) for i in range(20)]
>>> seq
[0.0, 0.58778525229247314, 0.95105651629515353, 0.95105651629515364, 0.58778525229247325, 1.2246063538223773e-016, -0.58778525229247303, -0.95105651629515353, -0.95105651629515364, -0.58778525229247336, -2.4492127076447545e-016, 0.58778525229247214, 0.95105651629515353, 0.95105651629515364, 0.58778525229247336, 3.6738190614671318e-016, -0.5877852522924728, -0.95105651629515342, -0.95105651629515375, -0.58778525229247347]
>>> list(upsampler(seq, 2))
[0.0, 0.29389262614623657, 0.58778525229247314, 0.76942088429381328, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247325, 0.29389262614623668, 1.2246063538223773e-016, -0.29389262614623646, -0.58778525229247303, -0.76942088429381328, -0.95105651629515353, -0.95105651629515364, -0.95105651629515364, -0.7694208842938135, -0.58778525229247336, -0.29389262614623679, -2.4492127076447545e-016, 0.29389262614623596, 0.58778525229247214, 0.76942088429381283, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247336, 0.29389262614623685, 3.6738190614671318e-016, -0.29389262614623618, -0.5877852522924728, -0.76942088429381306, -0.95105651629515342, -0.95105651629515364, -0.95105651629515375, -0.76942088429381361, -0.58778525229247347]
And here is my translation to C, fit into Kratz's fn template:
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param will be filled with (src_len - 1) * steps + 1 samples
*/
float* linearInterpolation(float* src, int src_len, int steps, float* dst)
{
float step, y0, dy;
float *src_end;
if (src_len > 0) {
step = 1.0 / steps;
for (src_end = src+src_len; *dst++ = y0 = *src++, src < src_end; ) {
dY = (*src - y0) * step;
for (int i=steps; i>0; i--) {
*dst++ = y0 += dY;
}
}
}
}
Please note the C snippet is "typed but never compiled or run", so there might be syntax errors, off-by-1 errors etc. But overall the idea is there.

In that case I think you can avoid the second loop:
def interpolate2(x, L):
new_list = []
new_len = (len(x) - 1) * (L + 1)
for i in range(0, new_len):
step = i / (L + 1)
substep = i % (L + 1)
fr = x[step]
to = x[step + 1]
dy = float(to - fr) / float(L + 1)
y = fr + (dy * substep)
new_list.append(y)
new_list.append(x[-1])
return new_list
print interpolate2([10, 20, 30], 3)
you just calculate the member in the position you want directly. Though - that might not be the most efficient to do it. The only way to be sure is to compile it and see which one is faster.

Well, first of all, your code is broken. L is not defined, and neither is y or x.
Once that is fixed, I run cython on the resulting code:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
And that seemed to work. I haven't tried to compile it, though, and you can also improve the speed a lot by adding different optimizations.
"e.g. is it possible to avoid the second loop?"
If it is, then it's possible in Python too. And I don't see how, although I don't see why you would do it the way you do. First creating a list of L length of i-temp is completely pointless. Just loop L times:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = i-temp1
temp1 = i
for j in range(L):
temp2 += hold
y.append(temp2 *iL)
It all seems overcomplicated for what you get out though. What are you trying to do, actually? Interpolate something? (Duh it says so in the title. Sorry about that.)
There are surely easier ways of interpolating.
Update, a much simplified interpolation function:
# A simple list, so it's easy to see that you interpolate.
indata = [float(x) for x in range(0, 110, 10)]
points_inbetween = 3
outdata = [indata[0]]
for point in indata[1:]: # All except the first
step = (point - outdata[-1]) / (points_inbetween + 1)
for i in range(points_inbetween):
outdata.append(outdata[-1] + step)
I don't see a way to get rid of the inner loop, nor a reason for wanting to do so.
Converting it to C I'll leave up to someone else, or even better, Cython, as C is a great langauge of you want to talk to hardware, but otherwise just needlessly difficult.

I think you need the two loops. You have to step over the samples in x to initialize the interpolator, not to mention copy their values into y, and you have to step over the output samples to fill in their values. I suppose you could do one loop to copy x into the appropriate places in y, followed by another loop to use all the values from y, but that will still require some stepping logic. Better to use the nested loop approach.
(And, as Lennart Regebro points out) As a side note, I don't see why you do hold = [i-temp1] * L. Instead, why not do hold = i-temp, and then loop for j in xrange(L): and temp2 += hold? This will use less memory but otherwise behave exactly the same.

Heres my try at a C implementation for your algorithm. Before trying to further optimize it id suggest you profile its performance with all compiler optimizations enabled.
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param needs to be of size src_len * steps
*/
float* linearInterpolation(float* src, size_t src_len, size_t steps, float* dst)
{
float* dst_ptr = dst;
float* src_ptr = src;
float stepIncrement = 1.0f / steps;
float temp1 = 0.0f;
float temp2 = 0.0f;
float hold;
size_t idx_src, idx_steps;
for(idx_src = 0; idx_src < src_len; ++idx_src)
{
hold = *src_ptr - temp1;
temp1 = *src_ptr;
++src_ptr;
for(idx_steps = 0; idx_steps < steps; ++idx_steps)
{
temp2 += hold;
*dst_ptr = temp2 * stepIncrement;
++dst_ptr;
}
}
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

percentiles pandas vs. scala where is the bug? - python

Related

How to change the code from Scala to Python

Tradingview pinescript's RMA (Moving average used in RSI. It is the exponentially weighted moving average with alpha = 1 / length) in python, pandas

shortcuts in if-else condition python

splitting list in chunks of balanced weight

Linear Interpolation. How to implement this algorithm in C ? (Python version is given)

Categories

Resources