I have the following abbreviation of a function in my code:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
if (average(velo_mask<0.9):
s = 0.8
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
else:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
This means that I have to compute the two arrays first based on the initial given value of s, then do the condition, based on it, I change the value of s and I want the re-run the whole previous code based on the new value of s. (I have a loop,and each time the whole data changes)
It is actually a huge code, and I don't want to re-write it 3 times, once to calculate the average, in the if condition, and in the else condition.
Is there maybe a way to tell python to re-run the whole previous part in the if-else condition.
Use functions to avoid code duplication. Example:
def create_mask(velo, spec, freq, fthrow, maskwidth_f, s):
m = np.nonzero((velo > freq - fthrow - s * maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
return velo_mask, spec_mask
...
s = 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)
s = 0.8 if average(velo_mask < 0.9) else 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)
Related
I'm stuck on returning the result from the function which is checking samples for A/B test and gave the result. The calculation is correct, but somehow I'm getting the result twice. The code and output below.
def test (sample1, sample2):
for i in it.chain (range(len(sample1)), range(len(sample2))):
alpha = .05
difference = (sample1['step_conversion'][i] - sample2['step_conversion'][i])/100
if (i > 0):
p_combined = (sample1['unq_user'][i] + sample2['unq_user'][i]) / (sample1['unq_user'][i-1] + sample2['unq_user'][i-1])
z_value = difference / mth.sqrt(
p_combined * (1 - p_combined) * (1 / sample1['unq_user'][i-1] + 1 / sample2['unq_user'][i-1]))
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print( sample1['event_name'][i], 'p-value: ', p_value)
if p_value < alpha:
print('Deny H0')
else:
print('Accept H0')
return
So I need the result in output just once (tagged in the box), but I get it twice from both samples.
When using Pandas dataframes, you should avoid most for loops, and use the standard vectorised approach. Use NumPy where applicable.
First, I've reset the indexes (indices) of the dataframes, to be sure .loc can be used with a standard numerical index.
sample1 = sample1.reset_index()
sample2 = sample2.reset_index()
The below does what I think you for loop does.
I can't test it, and without a clear description, example dataframes and expected outcome, it is anyone's guess if the code below does what you want. But it may get close, and mostly serves as an example of the vectorised approach.
import numpy as np
difference = (sample1['step_conversion'] - sample2['step_conversion']) / 100
n = len(sample1)
# Note that Pandas uses `n` as the highest *valid* index when using `.loc`, `n-1` is one lower
p_combined = ((sample1.loc[1:, 'unq_user'] + sample2.loc[1:, 'unq_user']).reset_index(drop=True) /
(sample1.loc[:n-1, 'unq_user'] + sample2.loc[:n-1, 'unq_user'])).reset_index(drop=True)
z_value = difference / np.sqrt(
p_combined * (1 - p_combined) * (
1 / sample1.loc[:n-1, 'unq_user'] + 1 / sample2.loc[:n-1, 'unq_user']))
distr = st.norm(0, 1) # ??
p_value = (1 - distr.cdf(np.abs(z_value))) * 2
sample1['p_value'] = p_value
print(sample1)
# The below prints a list of True values for elements for which the condition is valid.
# You can also use e.g. `print(sample1[p_value < alpha])`.
alpha = 0.05
print('Deny H0:')
print(p_value < alpha)
print('Accept H0:')
print(p_value > alpha)
No for loop needed, and for a large dataframe, the above will be notably faster.
Note that the .reset_index(drop=True) is a bit ugly. But if that is not there, Pandas will divide the two dataframes by equal indices, which is not what we want. This way, that is avoided.
I'm trying to implement this following formula in Python. It's basically a long concatenation os summations, where an additional summation is added each time a new 'element' is needed. To simply explain the formula's structure, here's how this formula goes in order from 2 to 5 elements:
2 elements
3 elements
4 elements
5 elements
By the way, here's the g function shown in the formulas:
g function
Now, I foolishly tried coding this formula with my extremely barebones python programming skills. The initial goal was to try this with 15 elements, but given that it contained a lot of nested for loops and factorials, I quickly noticed that I could not really obtain a result from that.
At the end I ended up with this monstrous code, that would finish just after the heat death of the universe:
from ast import Str
import math
pNuevos = [0,2,2,2,2,1,1,1,2,2,2,1,2,2,1,1]
pTotales = [0,10,10,7,8,7,7,7,7,7,10,7,8,7,8,8]
def PTirada (personajes):
tirada = 0.05/personajes
return tirada
def Ppers1 (personajes, intentos):
p1pers = ((math.factorial(intentos-1)) / ((math.factorial(4))*(math.factorial(intentos-5)))) * (PTirada(personajes)**5) * ((1-PTirada(personajes))**(intentos-5))
return p1pers
def Ppers2 (personajes, intentos):
p2pers = 0
for i in range(10,intentos+1):
p2pers = p2pers + ( (math.factorial(intentos-1)) / ((math.factorial(4))*(math.factorial(i-5))*(math.factorial(intentos-i))) ) * (PTirada(personajes)**i) * ((1 - 2*(PTirada(personajes))) **(intentos-i))
p2pers = 2*p2pers
return p2pers
def Activate (z) :
probability1 = 0
probability2 = 0
probability3 = 0
probability4 = 0
probability5 = 0
probability6 = 0
probability7 = 0
probability8 = 0
probability9 = 0
probability10 = 0
probability11 = 0
probability12 = 0
probability13 = 0
probability14 = 0
for i in range (5*pNuevos[1], z-5*pNuevos[2]+1):
for j in range (5*pNuevos[2], z-i-5*pNuevos[3]+1):
for k in range (5*pNuevos[3], z-j-i-5*pNuevos[4]+1):
for l in range (5*pNuevos[4], z-k-j-i-5*pNuevos[5]+1):
for m in range (5*pNuevos[5], z-l-k-j-i-5*pNuevos[6]+1):
for n in range (5*pNuevos[6], z-m-l-k-j-i-5*pNuevos[7]+1):
for o in range (5*pNuevos[7], z-n-m-l-k-j-i-5*pNuevos[8]+1):
for p in range (5*pNuevos[8], z-o-n-m-l-k-j-i-5*pNuevos[9]+1):
for q in range (5*pNuevos[9], z-p-o-n-m-l-k-j-i-5*pNuevos[10]+1):
for r in range (5*pNuevos[10], z-q-p-o-n-m-l-k-j-i-5*pNuevos[11]+1):
for s in range (5*pNuevos[11], z-r-q-p-o-n-m-l-k-j-i-5*pNuevos[12]+1):
for t in range (5*pNuevos[12], z-s-r-q-p-o-n-m-l-k-j-i-5*pNuevos[13]+1):
for u in range (5*pNuevos[13], z-t-s-r-q-p-o-n-m-l-k-j-i-5*pNuevos[14]+1):
for v in range (5*pNuevos[14], z-u-t-s-r-q-p-o-n-m-l-k-j-i-5*pNuevos[15]+1):
probability14 = probability14 + eval("Ppers"+str(pNuevos[14])+"("+str(pTotales[14])+","+str(v)+")") * eval("Ppers"+str(pNuevos[15])+"("+str(pTotales[15])+","+str(z-v-u-t-s-r-q-p-o-n-m-l-k-j-i)+")")
probability13 = probability13 + eval("Ppers"+str(pNuevos[13])+"("+str(pTotales[13])+","+str(u)+")") * probability14
probability12 = probability12 + eval("Ppers"+str(pNuevos[12])+"("+str(pTotales[12])+","+str(t)+")") * probability13
probability11 = probability11 + eval("Ppers"+str(pNuevos[11])+"("+str(pTotales[11])+","+str(s)+")") * probability12
probability10 = probability10 + eval("Ppers"+str(pNuevos[10])+"("+str(pTotales[10])+","+str(r)+")") * probability11
probability9 = probability9 + eval("Ppers"+str(pNuevos[9])+"("+str(pTotales[9])+","+str(q)+")") * probability10
probability8 = probability8 + eval("Ppers"+str(pNuevos[8])+"("+str(pTotales[8])+","+str(p)+")") * probability9
probability7 = probability7 + eval("Ppers"+str(pNuevos[7])+"("+str(pTotales[7])+","+str(o)+")") * probability8
probability6 = probability6 + eval("Ppers"+str(pNuevos[6])+"("+str(pTotales[6])+","+str(n)+")") * probability7
probability5 = probability5 + eval("Ppers"+str(pNuevos[5])+"("+str(pTotales[5])+","+str(m)+")") * probability6
probability4 = probability4 + eval("Ppers"+str(pNuevos[4])+"("+str(pTotales[4])+","+str(l)+")") * probability5
probability3 += eval("Ppers"+str(pNuevos[3]) + "("+str(pTotales[3])+","+str(k)+")") * probability4
probability2 += eval("Ppers"+str(pNuevos[2]) + "("+str(pTotales[2])+","+str(j)+")") * probability3
probability1 += eval("Ppers"+str(pNuevos[1]) + "("+str(pTotales[1])+","+str(i)+")") * probability2
return probability1
print (str(Activate(700)))
Edit: Alright I think it would be helpful to explain a couple things:
-First of all, I was trying to find ways the code could run faster, as I'm aware the nested for loops are a performance hog. I was also hoping there would be a way to optimize so many factorial operations.
-Also, the P(A) function described in the g function represents the probability of an event happening, which is already in the code, in the first function from the top.
There's also the function f in the formula, which is just a simplification of the function g for specific cases.
The function f is the second function in the code, whereas g is the third function in the code.
I will try to find a way to simplify the multiple summations, and thanks for the tip of not using eval()!
I'm sorry again for not specifying the question more, and for that mess of code also.
I would expect to break it down with something like this:
def main():
A = 0.5
m = 10
result = g(A, m)
return
def sigma(k, m):
''' function to deal with the sum loop'''
for k in range(10, m+1):
# the bits in the formula
pass
return
def g(A, m):
''' function to deal with g '''
k=10
return 2 * sigma(k,m)
if __name__=='__main__':
''' This is executed when run from the command line '''
main()
Or alternatively to do similar with classes.
I expect you also need a function for p(A) and one for factorials.
For a list of numbers
val numbers = Seq(0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205)
python / pandas computes the following percentiles:
25% 0.167289
50% 0.348107
75% 0.692389
However, scala returns:
calcPercentiles(Seq(.25, .5, .75), sortedNumber.toArray)
25% 0.1601818278168149
50% 0.3481071101229365
75% 0.7182103704579226
The numbers are almost matching - but different. How can I get rid of the difference (and most likely fix a bug in my scala code?
val sortedNumber = numbers.sorted
import scala.collection.mutable
case class PercentileResult(percentile:Double, value:Double)
// https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
def calculatePercentile(arr: Array[Double], p: Double)={
// +1 so that the .5 == mean for even number of elements.
val f = (arr.length + 1) * p
val i = f.toInt
if (i == 0) arr.head
else if (i >= arr.length) arr.last
else {
arr(i - 1) + (f - i) * (arr(i) - arr(i - 1))
}
}
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]):Array[PercentileResult] = {
val results = new mutable.ListBuffer[PercentileResult]
percentiles.foreach(p => {
val r = PercentileResult(percentile = p, value = calculatePercentile(arr, p))
results.append(r)
})
results.toArray
}
python:
import pandas as pd
df = pd.DataFrame({'foo':[0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205]})
display(df.head())
df.describe()
After a bit trial and error I write this code that returns the same results as Panda (using linear interpolation as this is pandas default):
def calculatePercentile(numbers: Seq[Double], p: Double): Double = {
// interpolate only - no special handling of the case when rank is integer
val rank = (numbers.size - 1) * p
val i = numbers(math.floor(rank).toInt)
val j = numbers(math.ceil(rank).toInt)
val fraction = rank - math.floor(rank)
i + (j - i) * fraction
}
From that I would say that the errors was here:
(arr.length + 1) * p
Percentile of 0 should be 0, and percentile at 100% should be a maximal index.
So for numbers (.size == 21) that would be indices 0 and 20. However, for 100% you would get index value of 22 - bigger than the size of array! If not for these guard clauses:
else if (i >= arr.length) arr.last
you would get error and you could suspect that something is wrong. Perhaps authors of the code:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
used a different definition of percentile... (?) or they might simply have a bug. I cannot tell.
BTW: This:
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]): Array[PercentileResult]
could be much easier to write like this:
def calcPercentiles(percentiles:Seq[Double], numbers: Seq[Double]): Seq[PercentileResult] =
percentiles.map { p =>
PercentileResult(p, calculatePercentile(numbers, p))
}
I'm new to Python and really stumped on this. I'm reading from a book and the code works fine; I just don't get it!
T[i+1] = m*v[i+1]ˆ**/L
What's with the double asterisk part of this code? It's even followed by a forward slash. The variable L is initialized with the value 1.0 However, it looks like someone slumped over the keyboard, but the code works fine. Is this a math expression or something more? I would appreciate the help understanding this. Thanks!
full code:
from pylab import *
g = 9.8 # m/sˆ2
dt = 0.01 # s
time = 10.0 # s
v0 = 2.0 # s
D = 0.05 #
L = 1.0 # m
m = 0.5 # kg
# Numerical initialization
n = int(round(time/dt))
t = zeros(n,float)
s = zeros(n,float)
v = zeros(n,float)
T = zeros(n,float)
# Initial conditions
v[0] = v0
s[0] = 0.0
# Simulation loop
i = 0
while (i<n AND T[i]>=0.0):
t[i+1] = t[i] + dt
a = -D/m*v[i]*abs(v[i])-g*sin(s[i]/L)
v[i+1] = v[i] + a*dt
s[i+1] = s[i] + v[i+1]*dt
T[i+1] = m*v[i+1]ˆ**/L + m*g*cos(s[i+1]/L)
i = i + 1
This code is from the book "Elementary Mechanics Using Python: A Modern Course Combining Analytical and Numerical Techniques".
According to the formula on the page 255:
So the Python line should be:
T[i+1] = m*v[i+1]**2/L + m*g*cos(s[i+1]/L)
What's with the double asterisk part of this code?
The answer to your core questions (at least as it exists of this writing) is the double asterisk (star) is power -- "raise to the power". So, i**3 would be "cube i".
My (cross check) source: https://stackoverflow.com/a/1044866/18196
There exists one very good linear interpolation method. It performs linear interpolation requiring at most one multiply per output sample. I found its description in a third edition of Understanding DSP by Lyons. This method involves a special hold buffer. Given a number of samples to be inserted between any two input samples, it produces output points using linear interpolation. Here, I have rewritten this algorithm using Python:
temp1, temp2 = 0, 0
iL = 1.0 / L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
where x contains input samples, L is a number of points to be inserted, y will contain output samples.
My question is how to implement such algorithm in ANSI C in a most effective way, e.g. is it possible to avoid the second loop?
NOTE: presented Python code is just to understand how this algorithm works.
UPDATE: here is an example how it works in Python:
x=[]
y=[]
hold=[]
num_points=20
points_inbetween = 2
temp1,temp2=0,0
for i in range(num_points):
x.append( sin(i*2.0*pi * 0.1) )
L = points_inbetween
iL = 1.0/L
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 * iL)
Let's say x=[.... 10, 20, 30 ....]. Then, if L=1, it will produce [... 10, 15, 20, 25, 30 ...]
Interpolation in the sense of "signal sample rate increase"
... or i call it, "upsampling" (wrong term, probably. disclaimer: i have not read Lyons'). I just had to understand what the code does and then re-write it for readability. As given it has couple of problems:
a) it is inefficient - two loops is ok but it does multiplication for every single output item; also it uses intermediary lists(hold), generates result with append (small beer)
b) it interpolates wrong the first interval; it generates fake data in front of the first element. Say we have multiplier=5 and seq=[20,30] - it will generate [0,4,8,12,16,20,22,24,28,30] instead of [20,22,24,26,28,30].
So here is the algorithm in form of a generator:
def upsampler(seq, multiplier):
if seq:
step = 1.0 / multiplier
y0 = seq[0];
yield y0
for y in seq[1:]:
dY = (y-y0) * step
for i in range(multiplier-1):
y0 += dY;
yield y0
y0 = y;
yield y0
Ok and now for some tests:
>>> list(upsampler([], 3)) # this is just the same as [Y for Y in upsampler([], 3)]
[]
>>> list(upsampler([1], 3))
[1]
>>> list(upsampler([1,2], 3))
[1, 1.3333333333333333, 1.6666666666666665, 2]
>>> from math import sin, pi
>>> seq = [sin(2.0*pi * i/10) for i in range(20)]
>>> seq
[0.0, 0.58778525229247314, 0.95105651629515353, 0.95105651629515364, 0.58778525229247325, 1.2246063538223773e-016, -0.58778525229247303, -0.95105651629515353, -0.95105651629515364, -0.58778525229247336, -2.4492127076447545e-016, 0.58778525229247214, 0.95105651629515353, 0.95105651629515364, 0.58778525229247336, 3.6738190614671318e-016, -0.5877852522924728, -0.95105651629515342, -0.95105651629515375, -0.58778525229247347]
>>> list(upsampler(seq, 2))
[0.0, 0.29389262614623657, 0.58778525229247314, 0.76942088429381328, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247325, 0.29389262614623668, 1.2246063538223773e-016, -0.29389262614623646, -0.58778525229247303, -0.76942088429381328, -0.95105651629515353, -0.95105651629515364, -0.95105651629515364, -0.7694208842938135, -0.58778525229247336, -0.29389262614623679, -2.4492127076447545e-016, 0.29389262614623596, 0.58778525229247214, 0.76942088429381283, 0.95105651629515353, 0.95105651629515364, 0.95105651629515364, 0.7694208842938135, 0.58778525229247336, 0.29389262614623685, 3.6738190614671318e-016, -0.29389262614623618, -0.5877852522924728, -0.76942088429381306, -0.95105651629515342, -0.95105651629515364, -0.95105651629515375, -0.76942088429381361, -0.58778525229247347]
And here is my translation to C, fit into Kratz's fn template:
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param will be filled with (src_len - 1) * steps + 1 samples
*/
float* linearInterpolation(float* src, int src_len, int steps, float* dst)
{
float step, y0, dy;
float *src_end;
if (src_len > 0) {
step = 1.0 / steps;
for (src_end = src+src_len; *dst++ = y0 = *src++, src < src_end; ) {
dY = (*src - y0) * step;
for (int i=steps; i>0; i--) {
*dst++ = y0 += dY;
}
}
}
}
Please note the C snippet is "typed but never compiled or run", so there might be syntax errors, off-by-1 errors etc. But overall the idea is there.
In that case I think you can avoid the second loop:
def interpolate2(x, L):
new_list = []
new_len = (len(x) - 1) * (L + 1)
for i in range(0, new_len):
step = i / (L + 1)
substep = i % (L + 1)
fr = x[step]
to = x[step + 1]
dy = float(to - fr) / float(L + 1)
y = fr + (dy * substep)
new_list.append(y)
new_list.append(x[-1])
return new_list
print interpolate2([10, 20, 30], 3)
you just calculate the member in the position you want directly. Though - that might not be the most efficient to do it. The only way to be sure is to compile it and see which one is faster.
Well, first of all, your code is broken. L is not defined, and neither is y or x.
Once that is fixed, I run cython on the resulting code:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = [i-temp1] * L
temp1 = i
for j in hold:
temp2 += j
y.append(temp2 *iL)
And that seemed to work. I haven't tried to compile it, though, and you can also improve the speed a lot by adding different optimizations.
"e.g. is it possible to avoid the second loop?"
If it is, then it's possible in Python too. And I don't see how, although I don't see why you would do it the way you do. First creating a list of L length of i-temp is completely pointless. Just loop L times:
L = 1
temp1, temp2 = 0, 0
iL = 1.0 / L
y = []
x = range(5)
for i in x:
hold = i-temp1
temp1 = i
for j in range(L):
temp2 += hold
y.append(temp2 *iL)
It all seems overcomplicated for what you get out though. What are you trying to do, actually? Interpolate something? (Duh it says so in the title. Sorry about that.)
There are surely easier ways of interpolating.
Update, a much simplified interpolation function:
# A simple list, so it's easy to see that you interpolate.
indata = [float(x) for x in range(0, 110, 10)]
points_inbetween = 3
outdata = [indata[0]]
for point in indata[1:]: # All except the first
step = (point - outdata[-1]) / (points_inbetween + 1)
for i in range(points_inbetween):
outdata.append(outdata[-1] + step)
I don't see a way to get rid of the inner loop, nor a reason for wanting to do so.
Converting it to C I'll leave up to someone else, or even better, Cython, as C is a great langauge of you want to talk to hardware, but otherwise just needlessly difficult.
I think you need the two loops. You have to step over the samples in x to initialize the interpolator, not to mention copy their values into y, and you have to step over the output samples to fill in their values. I suppose you could do one loop to copy x into the appropriate places in y, followed by another loop to use all the values from y, but that will still require some stepping logic. Better to use the nested loop approach.
(And, as Lennart Regebro points out) As a side note, I don't see why you do hold = [i-temp1] * L. Instead, why not do hold = i-temp, and then loop for j in xrange(L): and temp2 += hold? This will use less memory but otherwise behave exactly the same.
Heres my try at a C implementation for your algorithm. Before trying to further optimize it id suggest you profile its performance with all compiler optimizations enabled.
/**
*
* #param src caller supplied array with data
* #param src_len len of src
* #param steps to interpolate
* #param dst output param needs to be of size src_len * steps
*/
float* linearInterpolation(float* src, size_t src_len, size_t steps, float* dst)
{
float* dst_ptr = dst;
float* src_ptr = src;
float stepIncrement = 1.0f / steps;
float temp1 = 0.0f;
float temp2 = 0.0f;
float hold;
size_t idx_src, idx_steps;
for(idx_src = 0; idx_src < src_len; ++idx_src)
{
hold = *src_ptr - temp1;
temp1 = *src_ptr;
++src_ptr;
for(idx_steps = 0; idx_steps < steps; ++idx_steps)
{
temp2 += hold;
*dst_ptr = temp2 * stepIncrement;
++dst_ptr;
}
}
}