Python: Solving a complex scenario analysis - python

I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?

First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])

Related

Likeliness of "A" being better than "B" using Poisson distribution

Background
I'm running an A-B test for two campaigns.
I got three step funnels set up for both campaigns.
So far B seems to be better than A, but how do I know when I have gathered enough measure points?
Funnel steps
In the data below, there are three steps. Step_1 is the number of users that reached our sign up page.
Step_2 is the number of users that filled in our sign up form
Step_3 is the number of users that confirmed their email address.
Question
How can I calculate the likelihood that A is better than B, or vice versa?
Or more eloquently:
Given an "infinate amount of cases" where we have A:6 and B:8 observations in Step_3 and a conversion rate from Step_1 of A:12.5% and B:13.333...%. In how many of these cases does A end up with a higher conversion rate than B and vice versa?
Step_1 Step_2 Step_3
A 144.0 18 6
B 135.0 18 8
Rationale
Each user going through the funnel is unaffected by other users.
Each user cannot reach the next step without going through the earlier.
Each user will either stop at a step, or continue to the next. Giving only two options for each independent observation
This means a binomial distribution can be used to predict the likeliness of a user converting to the next step.
What I tried so far
So far I've tried using a poisson distribution
from scipy.stats.distributions import poisson
And using the poisson.ppf somehow I should be able to say "The likeliness of A being better than B is 5%, the likeliness of B being better than A is 25%."
Of course I can just plug in some values to the function and go "Hey, this looks great" but I feel like I need to call upon the vast knowledge of the Stacked Oracles of Stack Overflow to make sure I'm doing something statistically sound.
Why Poisson
In my humble understanding of distributions:
The poisson distribution is a lot like the binomial distribution (scipy.stats.binom), but better suited for predictions involving few observations than it's binom big brother.
The poisson distribution is a binomial distribution, because it asserts two possible outcomes
The reason binomial distributions are what I want to use is because there are two outcomes in my simulated scenario, either the user proceeds down the funnel, or the user exits. This is the bi in binomial.
The poisson distribution is based on the assumption that two observations cannot affect each other. So wether or not user_1 makes it to step_3, step_2 or just to step_1, it does not matter for user_2. This is very much the case, they do not know of each others existence.
Mathematically speaking Binomial is more precise in this case than Poisson. For example, using Poisson you'll get a positive probability of more than 18 of your 18 candidates making the conversion. Poisson owes its popularity to being easier to compute.
The result also depends on your prior knowledge. For example if both your outcomes look very high compared to typical conversion rates then all else being equal the difference you see is more significant.
Assuming no prior knowledge, i.e. assuming that every conversion rate between 0 and 1 is equally likely if you know nothing else, the probability of a given conversion rate r once you take into account your observation of 6 out of 18 possible conversions is given by the Beta distribution, in this case Beta(r; 6+1, 18-6+1)
Technically speaking this is not a probability but a likelihood. The difference is the following: a probablity describes how often you will observe different outcomes if you compare "parallel universes" that are identical, even though reputable statisticians probably wouldn't use that terminology. A likelihood is the other way round: given a fixed outcome comparing different universes how often will you observe a particular kind of universe. (To be even more technical, this description is only fully correct if as we did a "flat prior" is assumed.) In your case there are two kinds of universe, one where A is better than B and one where B is better than A.
The probability of B being better than A is then
integral_0^1 dr Beta_cdf(r; 6+1, 18-6+1) x Beta_pdf(r; 8+1, 18-8+1)
You can use scipy.stats.beta and scipy.integrate.quad to calculate that and you'll get a 0.746 probability of B being better than A:
quad(lambda r: beta(7, 13).cdf(r) * beta(9,11).pdf(r), 0, 1)
# (0.7461608994979401, 1.3388378385104094e-08)
To conclude, by this measure the evidence for B being better than A is not very strong.
UPDATE:
The two step case can be solved conceptually similarly, but is a bit more challenging to compute.
We have two steps 135 / 144 -> 18 -> 8 / 6. Given these numbers how are the conversion rates for A and B and step 1 and step 2 distributed? Ultimately we are interested in the product of step 1 and step 2 for A and for B. Since I couldn't get scipy to solve the integrals in reasonable time I fell back to a Monte Carlo scheme. Just draw the conversion rates with appropriate probabilites N=10^7 times and count how often B is better than A:
(beta(9,11).rvs(N)*beta(19,118).rvs(N) > beta(7,13).rvs(N)*beta(19,127).rvs(N)).mean()
The result is very similar to the single step one: 0.742 in favour of B. Again, not very strong evidence.

generate a 'normal-distribution' like data based on one value in python

I have one var temp, say temp = 100. What I want to do is to generate 8 data points. These 8 points are displayed like shown in the figure. It looks like normal-distribution but I want to add lots of random values in these points so that they do not look like a perfect normal distribution. The final data (the area under the curve) should be summed to temp. Could someone advise how to do this easily and neatly in Python please?
I have tried to use the distribution function in numpy/matplot. However, I wonder how I can get 8 points like shown in the figure (x = 0,1,2,3,4...)? Also I can't figure out how I can sum them to 100?
By imposing the sum temp=100 you introduce a dependency between at least two data points, making it impossible to create a set of independently sampled random data points.
This answer on mathworks provides more detailed information.
An easier example:
Imagine one coin flip. The randomness in the system is exactly one binary outcome, or 1 bit.
Imagine two coin flips. The randomness in the system is exactly two binary outcomes or 2 bit.
Now imagine imposing a sum constraint on two coin flips, let's say you want the sum of coin flips in the system to equal exactly 1. Since the outcome of the second coin flip is determined by the outcome of the first binary decision, the randomness in the system shrinks.
Therefore you reduce the total randomness of the system from 2 bit to 1 bit.
Sampling 8 truly (pseudo)-random points from a normal distribution with a sum-constraint is therefore not possible.
Your best bet would be to sample 7 random points from a distribution with appropriate mean and then add a point to the dataset to absorb the difference:
>>> import numpy as np
>>> temp = 100.0
>>> datapoints = 8
>>> dev = 1
>>> data = np.random.normal(temp/datapoints, dev, datapoints-1)
>>> print(data)
[ 11.70369328 10.77010243 11.20507387 12.40637644 12.81099137
12.55329521 10.95809056]
>>> data = np.append(data,temp-sum(data))
>>> data
array([ 11.70369328, 10.77010243, 11.20507387, 12.40637644,
12.81099137, 12.55329521, 10.95809056, 17.59237685])
>>> sum(data)
100.0

Python - Cosine gradually reveals small-amp oscillations ("wobbles")

I have a problem that is equal parts trig and Python. I am plotting a cosine over time interval [0,t] whose frequency changes (slightly) according to another cosine function. So what I'd expect to see is a repeating pattern of higher-to-lower frequency that repeats over the duration of the window [0,t].
Instead what I'm seeing is that over time a low-freq motif emerges in the cosine plot and repeats over time, each time becoming lower and lower in freq until eventually the cosine doesn't even oscillate properly it just "wobbles", for lack of a better term.
I don't understand how this is emerging over the course of the window [0,t] because cosine is (obviously) periodic and the function modulating it is as well. So how can "new" behavior emerge?? The behavior should be identical across all periods of the modulatory cosine that tunes the freq of the base cosine, right?
As a note, I'm technically using a modified cosine, instead of cos(wt) I'm using e^(cos(wt)) [called von mises eq or something similar].
Minimum needed Code:
cos_plot = []
for wind,pos_theta in zip(window,pos_theta_vec): #window is vec of time increments
# for ref: DBFT(pos_theta) = (1/(2*np.pi))*np.cos(np.radians(pos_theta - base_pos))
f = float(baserate+DBFT(pos_theta)) # DBFT() returns a val [-0.15,0.15] periodically depending on val of pos_theta
cos_plot.append(np.exp(np.cos(f*2*np.pi*wind)))
plt.plot(cos_plot)
plt.show()
What you are observing could depend on "aliasing", i.e. the emergence of low-frequency figures because of sampling of an high frequency function with a step that is too big.
(picture taken from the linked Wikipedia page)
If the issue is NOT aliasing consider that any function shape between -1 and 1 can be obtained with cos(f(x)*x) by simply choosing f(x).
For, consider any function -1 <= g(x) <= 1 and set f(x) = arccos(g(x))/x.
To look for the problem try plotting your "frequency" and see if anything really strange is present in it. May be you've a bug in DBFT.
In the interest of posterity, in case anyone ever needs an answer to this question:
I wanted a cosine whose frequency was a time-varying function freq(t). My mistake was simply evaluating this function at each time t like this: Acos(2pifreq(t)t). Instead you need to integrate freq(t) from 0 to t at each time point: y = cos(2%piintegral(f(t)) + 2%pi*f0*t + phase). The term for this procedure is a frequency sweep or chirp (not identical terms, but similar if you need to google/SO answers).
Thanks to those who responded with help :)
SB

Running AB tests on Revenue in Python

I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get:
0.642213492773
0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have:
Website 1 : test group 1: unique cookies: revenue
Website 1 : test group 2: unique cookies: revenue
Website 2 : test group 1: unique cookies: revenue
Website 2 : test group 2: unique cookies: revenue
e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis
Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.
On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.
I'm going to have to guess what the data look like---you have a bunch of samples like this:
Website Method Revenue
------- ------ -------
w1 A 12
w2 B 0
w3 A 6
w4 B 0
etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?
I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.
If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Categories