I'm stuck on returning the result from the function which is checking samples for A/B test and gave the result. The calculation is correct, but somehow I'm getting the result twice. The code and output below.
def test (sample1, sample2):
for i in it.chain (range(len(sample1)), range(len(sample2))):
alpha = .05
difference = (sample1['step_conversion'][i] - sample2['step_conversion'][i])/100
if (i > 0):
p_combined = (sample1['unq_user'][i] + sample2['unq_user'][i]) / (sample1['unq_user'][i-1] + sample2['unq_user'][i-1])
z_value = difference / mth.sqrt(
p_combined * (1 - p_combined) * (1 / sample1['unq_user'][i-1] + 1 / sample2['unq_user'][i-1]))
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print( sample1['event_name'][i], 'p-value: ', p_value)
if p_value < alpha:
print('Deny H0')
else:
print('Accept H0')
return
So I need the result in output just once (tagged in the box), but I get it twice from both samples.
When using Pandas dataframes, you should avoid most for loops, and use the standard vectorised approach. Use NumPy where applicable.
First, I've reset the indexes (indices) of the dataframes, to be sure .loc can be used with a standard numerical index.
sample1 = sample1.reset_index()
sample2 = sample2.reset_index()
The below does what I think you for loop does.
I can't test it, and without a clear description, example dataframes and expected outcome, it is anyone's guess if the code below does what you want. But it may get close, and mostly serves as an example of the vectorised approach.
import numpy as np
difference = (sample1['step_conversion'] - sample2['step_conversion']) / 100
n = len(sample1)
# Note that Pandas uses `n` as the highest *valid* index when using `.loc`, `n-1` is one lower
p_combined = ((sample1.loc[1:, 'unq_user'] + sample2.loc[1:, 'unq_user']).reset_index(drop=True) /
(sample1.loc[:n-1, 'unq_user'] + sample2.loc[:n-1, 'unq_user'])).reset_index(drop=True)
z_value = difference / np.sqrt(
p_combined * (1 - p_combined) * (
1 / sample1.loc[:n-1, 'unq_user'] + 1 / sample2.loc[:n-1, 'unq_user']))
distr = st.norm(0, 1) # ??
p_value = (1 - distr.cdf(np.abs(z_value))) * 2
sample1['p_value'] = p_value
print(sample1)
# The below prints a list of True values for elements for which the condition is valid.
# You can also use e.g. `print(sample1[p_value < alpha])`.
alpha = 0.05
print('Deny H0:')
print(p_value < alpha)
print('Accept H0:')
print(p_value > alpha)
No for loop needed, and for a large dataframe, the above will be notably faster.
Note that the .reset_index(drop=True) is a bit ugly. But if that is not there, Pandas will divide the two dataframes by equal indices, which is not what we want. This way, that is avoided.
Related
I have two arrays with two rows each, using which I am trying to do some calculation. Last step in calculation is to sum. When I try to sum using .sum(), it gives one answer, and when I try to manually sum with + by explicitly indexing each row, it gives different answer:
>>> ad=[[0.,0.,0.,-0.91,0.34],[0.,0.,1.,0.93,0.65]]
>>> nad=np.array(ad)
>>> W=[[0.29,0.],[0.23,0.]]
>>> NW = np.array(W)
>>> (nad[:,4] * (nad[:,3] + 0.96 * NW[nad[:,2].astype(int)][0][0])).sum() #use sum() on row 0 and 1
0.5707160000000001
>>> (nad[0,4] * (nad[0,3] + 0.96 * NW[nad[0,2].astype(int)][0])) #calculate by explicitly indexing row 0
-0.21474400000000005
>>> (nad[1,4] * (nad[1,3] + 0.96 * NW[nad[1,2].astype(int)][0])) #calculate by explicitly indexing row 0
0.74802
>>> (nad[0,4] * (nad[0,3] + 0.96 * NW[nad[0,2].astype(int)][0])) + (nad[1,4] * (nad[1,3] + 0.96 * NW[nad[1,2].astype(int)][0])) #calculate by explicitly indexing row 0 and row 1
0.533276
The difference (0.57071-0.533276) might look small, but it blows with iterative calculations. I feel both approaches should give same values, right? But then why they are giving different values? What my eyes are missing?
Even manually doing sum gives later answer (one with +). Is numpy sum() doing something else which I dont know?
Hi,
I have this dataset that has 7 columns (see image). First, I want to group by the Name column, afterwards I want to assign weights as follows:
Compute 10% of 1/n (if Provider for a Name is more than 1) for all n number of IDs in a Name. n = count of unique ID for one name. So for Sammy for example, n = 2.
Add 5% of 1/n if the column Accel_5 is 1, add an extra 10% of 1/n if the Accel_10 is 1 and add an extra 15% of 1/n if the Accel_15 is 1.
Add 10% for each additional tech
Altogether, groupby Name(Sammy, Josh, Sarah), then compute; 10% of 1/n(if provider is greater than 1) + 5% of 1/n(if Accel_5 is equal to 1) + 10% of 1/n (if Accel_10 is equal to 1) + 15% of 1/n (if Accel_15 is equal to 1) + 10% of 1/n (for each additional tech).
I have been able to groupby name only and I have gotten the unique number of IDs by name but I am stuck. See sample code below:
sample = pd.read_csv("Records.csv")
test = sample.groupby("Name")
test["ID"].nunique()
Link to data: Link to image depicted above
I appreciate your help.
Thanks.
You could try to create a custom function, and then use .apply() as:
def assign_weights(x):
n = len(x['ID'].unique())
x["Weight"] = 0
# 1.
n_providers = len(x['Provider'].unique())
if n_providers > 1:
x["Weight"] += 0.1 * 1/n
# 2.
if 1 in x['Accel_5']:
x["Weight"] += 0.05 * 1/n
if 1 in x['Accel_10']:
x["Weight"] += 0.1 * 1/n
if 1 in x['Accel_15']:
x["Weight"] += 0.15 * 1/n
# 3.
n_tech = len(x['Tech'].unique())
x["Weight"] += 0.1 * n_tech
return x
sample.groupby("Name").apply(lambda x: assign_weights(x))
This creates a new column Weight, based on the conditions 1, 2 and 3 you supplied. Because you did not specify the input data in an appropriate manner (not using an image), I have not tested the code, but I believe it should work as intended.
I have a python function that groups by based on a single column and calculates the count and mean.
def calc_smooth_mean(df, by, on, m):
mean = df[on].mean()
agg_value = df.groupby(by)[on].agg(['count', 'mean'])
counts = agg_value['count']
means = agg_value['mean']
smooth = (counts * means + m * mean) / (counts + m)
return df[by].map(smooth)
When I pass more than 1 column to "by", it throws the error "Dataframe object has no attribute map". I tried converting it to list & passed it to the function, but did not work.
You should change the map to apply, since map is for Series and index only
We can fix it with transform
def calc_smooth_mean(df, by, on, m):
mean = df[on].mean()
counts = df.groupby(by)[on].transform('count')
means = df.groupby(by)[on].transform('mean')
smooth = (counts * means + m * mean) / (counts + m)
return smooth
I get this error when using a python script that calculates pi using the Gauss-Legendre algorithm. You can only use up to 1024 iterations before getting this:
C:\Users\myUsernameHere>python Desktop/piWriter.py
End iteration: 1025
Traceback (most recent call last):
File "Desktop/piWriter.py", line 15, in <module>
vars()['t' + str(sub)] = vars()['t' + str(i)] - vars()['p' + str(i)] * math.
pow((vars()['a' + str(i)] - vars()['a' + str(sub)]), 2)
OverflowError: long int too large to convert to float
Here is my code:
import math
a0 = 1
b0 = 1/math.sqrt(2)
t0 = .25
p0 = 1
finalIter = input('End iteration: ')
finalIter = int(finalIter)
for i in range(0, finalIter):
sub = i + 1
vars()['a' + str(sub)] = (vars()['a' + str(i)] + vars()['b' + str(i)])/ 2
vars()['b' + str(sub)] = math.sqrt((vars()['a' + str(i)] * vars()['b' + str(i)]))
vars()['t' + str(sub)] = vars()['t' + str(i)] - vars()['p' + str(i)] * math.pow((vars()['a' + str(i)] - vars()['a' + str(sub)]), 2)
vars()['p' + str(sub)] = 2 * vars()['p' + str(i)]
n = i
pi = math.pow((vars()['a' + str(n)] + vars()['b' + str(n)]), 2) / (4 * vars()['t' + str(n)])
print(pi)
Ideally, I want to be able to plug in a very large number as the iteration value and come back a while later to see the result.
Any help appreciated!
Thanks!
Floats can only represent numbers up to sys.float_info.max, or 1.7976931348623157e+308. Once you have an int with more than 308 digits (or so), you are stuck. Your iteration fails when p1024 has 309 digits:
179769313486231590772930519078902473361797697894230657273430081157732675805500963132708477322407536021120113879871393357658789768814416622492847430639474124377767893424865485276302219601246094119453082952085005768838150682342462881473913110540827237163350510684586298239947245938479716304835356329624224137216L
You'll have to find a different algorithm for pi, one that doesn't require such large values.
Actually, you'll have to be careful with floats all around, since they are only approximations. If you modify your program to print the successive approximations of pi, it looks like this:
2.914213562373094923430016933707520365715026855468750000000000
3.140579250522168575088244324433617293834686279296875000000000
3.141592646213542838751209274050779640674591064453125000000000
3.141592653589794004176383168669417500495910644531250000000000
3.141592653589794004176383168669417500495910644531250000000000
3.141592653589794004176383168669417500495910644531250000000000
3.141592653589794004176383168669417500495910644531250000000000
In other words, after only 4 iterations, your approximation has stopped getting better. This is due to inaccuracies in the floats you are using, perhaps starting with 1/math.sqrt(2). Computing many digits of pi requires a very careful understanding of the numeric representation.
As noted in previous answer, the float type has an upper bound on number size. In typical implementations, sys.float_info.max is 1.7976931348623157e+308, which reflects the use of 10 bits plus sign for the exponent field in a 64-bit floating point number. (Note that 1024*math.log(2)/math.log(10) is about 308.2547155599.)
You can add another half dozen decades to the exponent size by using the Decimal number type. Here is an example (snipped from an ipython interpreter session):
In [48]: import decimal, math
In [49]: g=decimal.Decimal('1e12345')
In [50]: g.sqrt()
Out[50]: Decimal('3.162277660168379331998893544E+6172')
In [51]: math.sqrt(g)
Out[51]: inf
This illustrates that decimal's sqrt() function performs correctly with larger numbers than does math.sqrt().
As noted above, getting lots of digits is going to be tricky, but looking at all those vars hurts my eyes. So here's a version of your code after (1) replacing your use of vars with dictionaries, and (2) using ** instead of the math functions:
a, b, t, p = {}, {}, {}, {}
a[0] = 1
b[0] = 2**-0.5
t[0] = 0.25
p[0] = 1
finalIter = 4
for i in range(finalIter):
sub = i + 1
a[sub] = (a[i] + b[i]) / 2
b[sub] = (a[i] * b[i])**0.5
t[sub] = t[i] - p[i] * (a[i] - a[sub])**2
p[sub] = 2 * p[i]
n = i
pi_approx = (a[n] + b[n])**2 / (4 * t[n])
Instead of playing games with vars, I've used dictionaries to store the values (the link there is to the official Python tutorial) which makes your code much more readable. You can probably even see an optimization or two now.
As noted in the comments, you really don't need to store all the values, only the last, but I think it's more important that you see how to do things without dynamically creating variables. Instead of a dict, you could also have simply appended the values to a list, but lists are always zero-indexed and you can't easily "skip ahead" and set values at arbitrary indices. That can occasionally be confusing when working with algorithms, so let's start simple.
Anyway, the above gives me
>>> print(pi_approx)
3.141592653589794
>>> print(pi_approx-math.pi)
8.881784197001252e-16
A simple solution is to install and use the arbitrary-precisionmpmath module which now supports Python 3. However, since I completely agree with DSM that your use ofvars()to create variables on the fly is an undesirable way to implement the algorithm, I've based my answer on his rewrite of your code and [trivially] modified it to make use ofmpmath to do the calculations.
If you insist on usingvars(), you could probably do something similar -- although I suspect it might be more difficult and the result would definitely harder to read, understand, and modify.
from mpmath import mpf # arbitrary-precision float type
a, b, t, p = {}, {}, {}, {}
a[0] = mpf(1)
b[0] = mpf(2**-0.5)
t[0] = mpf(0.25)
p[0] = mpf(1)
finalIter = 10000
for i in range(finalIter):
sub = i + 1
a[sub] = (a[i] + b[i]) / 2
b[sub] = (a[i] * b[i])**0.5
t[sub] = t[i] - p[i] * (a[i] - a[sub])**2
p[sub] = 2 * p[i]
n = i
pi_approx = (a[n] + b[n])**2 / (4 * t[n])
print(pi_approx) # 3.14159265358979
What's an efficient way to calculate a trimmed or winsorized standard deviation of a list?
I don't mind using numpy, but if I have to make a separate copy of the list, it's going to be quite slow.
This will make two copies, but you should give it a try because it should be very fast.
def trimmed_std(data, low, high):
tmp = np.asarray(data)
return tmp[(low <= tmp) & (tmp < high)].std()
Do you need to do rank order trimming (ie 5% trimmed)?
Update:
If you need percentile trimming, the best way I can think of is to sort the data first. Something like this should work:
def trimmed_std(data, percentile):
data = np.array(data)
data.sort()
percentile = percentile / 2.
low = int(percentile * len(data))
high = int((1. - percentile) * len(data))
return data[low:high].std(ddof=0)
You can obviously implement this without using numpy, but even including the time of converting the list to an array, using numpy is faster than anything I could think of.
This is what generator functions are for.
SD requires two passes, plus a count. For this reason, you'll need to "tee" some iterators over the base collection.
So.
trimmed = ( x for x in the_list if low <= x < high )
sum_iter, len_iter, var_iter = itertools.tee( trimmed, 3 )
n = sum( 1 for x in len_iter)
mean = sum( sum_iter ) / n
sd = math.sqrt( sum( (x-mean)**2 for x in var_iter ) / (n-1) )
Something like that might do what you want without copying anything.
In order to get an unbiased trimmed mean you have to account for fractional bits of items in the list as described here and (a little less directly) here. I wrote a function to do it:
def percent_tmean( data, pcent ):
# make sure data is a list
dc = list( data )
# find the number of items
n = len(dc)
# sort the list
dc.sort()
# get the proportion to trim
p = pcent / 100.0
k = n*p
# print "n = %i\np = %.3f\nk = %.3f" % ( n,p,k )
# get the decimal and integer parts of k
dec_part, int_part = modf( k )
# get an index we can use
index = int(int_part)
# trim down the list
dc = dc[ index: index * -1 ]
# deal with the case of trimming fractional items
if dec_part != 0.0:
# deal with the first remaining item
dc[ 0 ] = dc[ 0 ] * (1 - dec_part)
# deal with last remaining item
dc[ -1 ] = dc[ -1 ] * (1 - dec_part)
return sum( dc ) / ( n - 2.0*k )
I also made an iPython Notebook that demonstrates it.
My function will probably be slower than those already posted but it will give unbiased results.