I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get:
0.642213492773
0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have:
Website 1 : test group 1: unique cookies: revenue
Website 1 : test group 2: unique cookies: revenue
Website 2 : test group 1: unique cookies: revenue
Website 2 : test group 2: unique cookies: revenue
e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis
Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.
On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.
I'm going to have to guess what the data look like---you have a bunch of samples like this:
Website Method Revenue
------- ------ -------
w1 A 12
w2 B 0
w3 A 6
w4 B 0
etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?
I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.
If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.
Related
I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?
First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])
Background
I'm running an A-B test for two campaigns.
I got three step funnels set up for both campaigns.
So far B seems to be better than A, but how do I know when I have gathered enough measure points?
Funnel steps
In the data below, there are three steps. Step_1 is the number of users that reached our sign up page.
Step_2 is the number of users that filled in our sign up form
Step_3 is the number of users that confirmed their email address.
Question
How can I calculate the likelihood that A is better than B, or vice versa?
Or more eloquently:
Given an "infinate amount of cases" where we have A:6 and B:8 observations in Step_3 and a conversion rate from Step_1 of A:12.5% and B:13.333...%. In how many of these cases does A end up with a higher conversion rate than B and vice versa?
Step_1 Step_2 Step_3
A 144.0 18 6
B 135.0 18 8
Rationale
Each user going through the funnel is unaffected by other users.
Each user cannot reach the next step without going through the earlier.
Each user will either stop at a step, or continue to the next. Giving only two options for each independent observation
This means a binomial distribution can be used to predict the likeliness of a user converting to the next step.
What I tried so far
So far I've tried using a poisson distribution
from scipy.stats.distributions import poisson
And using the poisson.ppf somehow I should be able to say "The likeliness of A being better than B is 5%, the likeliness of B being better than A is 25%."
Of course I can just plug in some values to the function and go "Hey, this looks great" but I feel like I need to call upon the vast knowledge of the Stacked Oracles of Stack Overflow to make sure I'm doing something statistically sound.
Why Poisson
In my humble understanding of distributions:
The poisson distribution is a lot like the binomial distribution (scipy.stats.binom), but better suited for predictions involving few observations than it's binom big brother.
The poisson distribution is a binomial distribution, because it asserts two possible outcomes
The reason binomial distributions are what I want to use is because there are two outcomes in my simulated scenario, either the user proceeds down the funnel, or the user exits. This is the bi in binomial.
The poisson distribution is based on the assumption that two observations cannot affect each other. So wether or not user_1 makes it to step_3, step_2 or just to step_1, it does not matter for user_2. This is very much the case, they do not know of each others existence.
Mathematically speaking Binomial is more precise in this case than Poisson. For example, using Poisson you'll get a positive probability of more than 18 of your 18 candidates making the conversion. Poisson owes its popularity to being easier to compute.
The result also depends on your prior knowledge. For example if both your outcomes look very high compared to typical conversion rates then all else being equal the difference you see is more significant.
Assuming no prior knowledge, i.e. assuming that every conversion rate between 0 and 1 is equally likely if you know nothing else, the probability of a given conversion rate r once you take into account your observation of 6 out of 18 possible conversions is given by the Beta distribution, in this case Beta(r; 6+1, 18-6+1)
Technically speaking this is not a probability but a likelihood. The difference is the following: a probablity describes how often you will observe different outcomes if you compare "parallel universes" that are identical, even though reputable statisticians probably wouldn't use that terminology. A likelihood is the other way round: given a fixed outcome comparing different universes how often will you observe a particular kind of universe. (To be even more technical, this description is only fully correct if as we did a "flat prior" is assumed.) In your case there are two kinds of universe, one where A is better than B and one where B is better than A.
The probability of B being better than A is then
integral_0^1 dr Beta_cdf(r; 6+1, 18-6+1) x Beta_pdf(r; 8+1, 18-8+1)
You can use scipy.stats.beta and scipy.integrate.quad to calculate that and you'll get a 0.746 probability of B being better than A:
quad(lambda r: beta(7, 13).cdf(r) * beta(9,11).pdf(r), 0, 1)
# (0.7461608994979401, 1.3388378385104094e-08)
To conclude, by this measure the evidence for B being better than A is not very strong.
UPDATE:
The two step case can be solved conceptually similarly, but is a bit more challenging to compute.
We have two steps 135 / 144 -> 18 -> 8 / 6. Given these numbers how are the conversion rates for A and B and step 1 and step 2 distributed? Ultimately we are interested in the product of step 1 and step 2 for A and for B. Since I couldn't get scipy to solve the integrals in reasonable time I fell back to a Monte Carlo scheme. Just draw the conversion rates with appropriate probabilites N=10^7 times and count how often B is better than A:
(beta(9,11).rvs(N)*beta(19,118).rvs(N) > beta(7,13).rvs(N)*beta(19,127).rvs(N)).mean()
The result is very similar to the single step one: 0.742 in favour of B. Again, not very strong evidence.
As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.
from StringIO import StringIO
myst="""01/01/2016 8781262
01/02/2016 8958598
01/03/2016 8787628
01/04/2016 9770861
01/05/2016 8409410
01/06/2016 8924784
01/07/2016 8597500
01/08/2016 6436862
01/09/2016 1542677
"""
u_cols=['month', 'count']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)
Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?
This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.
A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.
Define a zscore function
def zscore(s):
return (s - np.mean(s)) / np.std(s)
Apply it to the count column
zscore(df['count'])
0 0.414005
1 0.488906
2 0.416694
3 0.831981
4 0.256946
5 0.474624
6 0.336390
7 -0.576197
8 -2.643349
Name: count, dtype: float64
Notice that the September observation is 2.6 standard deviations away.
Use abs and gt to identify outliers
zscore(df['count']).abs().gt(2)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
Name: count, dtype: bool
Again, September comes back true.
Tie it all together to filter your original dataframe
df[zscore(df['count']).abs().gt(2)]
filter the other way
df[zscore(df['count']).abs().le(2)]
first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
But on the other side:
In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.
So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.
STRAIGHTFORWARD APPROACH
If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).
In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).
Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:
High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.
ALTERNATIVES
If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).
In this case I would regard two different approaches, depending again on how does your data look like:
Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.
I hope it helps!
Cheers
One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).
The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.
You can select the constant k based on your domain knowledge (a common choice is 1.5).
Given the data, a filter in pandas could look like this:
iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]
I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.
I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.
your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.
High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)
I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).