Poisson simulation not working as expected? - python

I have a simple script to set up a Poisson distribution by constructing an array of "events" of probability = 0.1, and then counting the number of successes in each group of 10. It almost works, but the distribution is not quite right (P(0) should equal P(1), but is instead about 90% of P(1)). It's like there's an off-by-one kind of error, but I can't figure out what it is. The script uses the Counter class from here (because I have Python 2.6 and not 2.7) and the grouping uses itertools as discussed here. It's not a stochastic issue, repeats give pretty tight results, and the overall mean looks good, group size looks good. Any ideas where I've messed up?
from itertools import izip_longest
import numpy as np
import Counter
def groups(iterable, n=3, padvalue=0):
"groups('abcde', 3, 'x') --> ('a','b','c'), ('d','e','x')"
return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def event():
f = 0.1
r = np.random.random()
if r < f: return 1
return 0
L = [event() for i in range(100000)]
rL = [sum(g) for g in groups(L,n=10)]
print len(rL)
print sum(list(L))
C = Counter.Counter(rL)
for i in range(max(C.keys())+1):
print str(i).rjust(2), C[i]
$ python script.py
10000
9949
0 3509
1 3845
2 1971
3 555
4 104
5 15
6 1
$ python script.py
10000
10152
0 3417
1 3879
2 1978
3 599
4 115
5 12

I did a combinatorial reality check on your math, and it looks like your results are correct actually. P(0) should not be roughly equivalent to P(1)
.9^10 = 0.34867844 = probability of 0 events
.1 * .9^9 * (10 choose 1) = .1 * .9^9 * 10 = 0.387420489 = probability of 1 event
I wonder if you accidentally did your math thusly:
.1 * .9^10 * (10 choose 1) = 0.34867844 = incorrect probability of 1 event

Related

A while loop time complexity

I'm interested in determining the big O time complexity of the following:
def f(x):
r = x / 2
d = 1e-10
while abs(x - r**2) > d:
r = (r + x/r) / 2
return r
I believe this is O(log n). To arrive at this, I merely collected empirical data via the timeit module and plotted the results, and saw that a plot that looked logarithmic using the following code:
ns = np.linspace(1, 50_000, 100, dtype=int)
ts = [timeit.timeit('f({})'.format(n),
number=100,
globals=globals())
for n in ns]
plt.plot(ns, ts, 'or')
But this seems like a corny way to go about figuring this out. Intuitively, I understand that the body of the while loop involves dividing an expression by 2 some number k times until the while expression is equal to d. This repeated division by 2 gives something like 1/2^k, from which I can see where a log is involved to solve for k. I can't seem to write down a more explicit derivation, though. Any help?
This is Heron's (Or Babylonian) method for calculating the square root of a number. https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
Big O notation for this requires a numerical analysis approach. For more details on the analysis you can check the wikipedia page listed or look for Heron's error convergence or fixed point iteration. (or look here https://mathcirclesofchicago.org/wp-content/uploads/2015/08/johnson.pdf)
Broad-strokes, if we can write the error e_n = (x-r_n**2) in terms of itself to where e_n = (e_n**2)/(2*(e_n+1))
Then we can see that e_n+1 <= min{(e_n**2)/2,e_n/2} so we have the error decrease quadratically. With the degrees of accuracy effectively doubling each iteration.
Whats different between this analysis and Big-O, is that the time it takes does NOT depend on the size of the input, but instead of the wanted accuracy. So in terms of input, this while loop is O(1) because its number of iterations is bounded by the accuracy not the input.
In terms of accuracy the error is bounded by above by e_n < 2**(-n) so we would need to find -n such that 2**(-n) < d. So log_2(d) = b such that 2^b = d. Assuming d < 2, then n = floor(log_2(d)) would work. So in terms of d, it is O(log(d)).
EDIT: Some more info on error analysis of fixed point iteration http://www.maths.lth.se/na/courses/FMN050/media/material/part3_1.pdf
I believe you're correct that it's O(log n).
Here you can see the successive values of r when x = 100000:
1 50000
2 25001
3 12502
4 6255
5 3136
6 1584
7 823
8 472
9 342
10 317
11 316
12 316
(I've rounded them off because the fractions are not interesting).
What you can see if that it goes through two phases.
Phase 1 is when r is large. During these first few iterations, x/r is tiny compared to r. As a result, r + x/r is close to r, so (r + x/r) / 2 is approximately r/2. You can see this in the first 8 iterations.
Phase 2 is when it gets close to the final result. During the last few iterations, x/r is close to r, so r + x/r is close to 2 * r, so (r + x/r) / 2 is close to r. At this point we're just improving the approximation by small amounts. These iterations are not really very dependent on the magnitude of x.
Here's the succession for x = 1000000 (10x the above):
1 500000
2 250001
3 125002
4 62505
5 31261
6 15646
7 7855
8 3991
9 2121
10 1296
11 1034
12 1001
13 1000
14 1000
This time there are 10 iterations in Phase 1, then we again have 4 iterations in Phase 2.
The complexity of the algorithm is dominated by Phase 1, which is logarithmic because it's approximately dividing by 2 each time.

Python Random Values with a Given Density

Say I want to build a maze with a certain probability of an obstacle at each position. This probability is determined by a density value ranging from 0 to 10, with 0 meaning "no chance", and 10 meaning "certain".
Does this Python code do what I want?
import random
obstacle_density = 10
if random.randint(0, 9) < obstacle_density:
print("There is an obstacle")
I've tried various combinations of upper and lower bounds and inequalities, and this seems to do the job, but I'm suspicious. For one thing, 11 possible values for obstacle_density and only 10 in random.randint(0, 9).
Not super sure about your solution. It seems like it would work, though.
Here's how I would approach it, even if it is a bit redundant - I'd start with a table just for my own reference:
density | probability of obstacle
---------------------------------
0 | 0%
1 | 10%
2 | 20%
3 | 30%
4 | 40%
5 | 50%
6 | 60%
7 | 70%
8 | 80%
9 | 90%
10 | 100%
This seems to add up. I present two versions of a function which returns True or False depending on the density. In the first version, I use the density to create the associated weights to be used with random.choices (the total weight in this case would be 100). For example, if density = 3, then weights = [30, 70] - 30% to be True, 70% to be False.
def get_obstacle_state_version_1(density):
from random import choices
assert isinstance(density, int)
assert density in range(0, 11) # 0 - 10 inclusive
true_weight = density * 10
false_weight = 100 - true_weight
weights = [true_weight, false_weight]
return choices([True, False], weights=weights, k=1)[0]
Here's the second version, in which I use random.choice rather than random.choices. The latter always returns a list of samples, even if the sample size k is 1.
Here, the idea is the same, but basically the density influences the number of Trues and Falses that appear in the population to be sampled. For example, if density = 3, then random.choice would pick one element from a list of 30 Trues, and 70 Falses with a uniform distribution.
def get_obstacle_state_version_2(density):
from random import choice
assert isinstance(density, int)
assert density in range(0, 11) # 0 - 10 inclusive
true_count = density * 10
false_count = 100 - true_count
return choice([True] * true_count + [False] * false_count)
You should loop over the maze and at each site assign a probability.
You should do something like this:
probability = random.randint(0, 10) / 10
I have no idea what you mean by obstacle_density, so I am not gonna go there.

python random binary list, evenly distributed

I have code to make a binary list of any length I want, with a random number of bits turned on:
rand_binary_list = lambda n: [random.randint(0,1) for b in range(1,n+1)]
rand_binary_list(10)
this returns something like this:
[0,1,1,0,1,0,1,0,0,0]
and if you run it a million times you'll get a bell curve distribution where about the sum(rand_binary_list(10)) is about 5 way more often than 1 or 10.
What I'd prefer is that having 1 bit turned on out of 10 is equally as likely as having half of them turned on. The number of bits turned on should be uniformly distributed.
I'm not sure how this can be done without compromising the integrity of the randomness. Any ideas?
EDIT:
I wanted to show this bell curve phenomenon explicitly so here it is:
>>> import random
>>> rand_binary_list = lambda n: [random.randint(0,1) for b in range(1,n+1)]
>>> counts = {0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0}
>>> for i in range(10000):
... x = sum(rand_binary_list(10))
... counts[x] = counts[x] + 1
...
>>> counts[0]
7
>>> counts[1]
89
>>> counts[2]
454
>>> counts[3]
1217
>>> counts[4]
2017
>>> counts[5]
2465
>>> counts[6]
1995
>>> counts[7]
1183
>>> counts[8]
460
>>> counts[9]
107
>>> counts[10]
6
see how the chances of getting 5 turned on are much higher than the chances of getting 1 bit turned on?
Something like this:
def randbitlist(n=10):
n_on = random.randint(0, n)
n_off = n - n_on
result = [1]*n_on + [0]*n_off
random.shuffle(result)
return result
The number of bits "on" should be uniformly distributed in [0, n] inclusive, and then those bits selected will be uniformly distributed throughout the list.

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Metropolis-Hastings accept-reject implementation

I've been reading about the Metropolis-Hastings (MH) algorithm. Theoretically, I understood how the algorithm works. Now, I am trying to implement the MH algorithm using python.
I came across the following notebook. It suits exactly my problem since I want to fit my data by a straight line taking into consideration the measurement errors on my data. I am going to paste the code I am finding difficulties to understand:
# initial m, b
m,b = 2, 0
# step sizes
mstep, bstep = 0.1, 10.
# how many steps?
nsteps = 10000
chain = []
probs = []
naccept = 0
print 'Running MH for', nsteps, 'steps'
# First point:
L_old = straight_line_log_likelihood(x, y, sigmay, m, b)
p_old = straight_line_log_prior(m, b)
prob_old = np.exp(L_old + p_old)
for i in range(nsteps):
# step
mnew = m + np.random.normal() * mstep
bnew = b + np.random.normal() * bstep
# evaluate probabilities
# prob_new = straight_line_posterior(x, y, sigmay, mnew, bnew)
L_new = straight_line_log_likelihood(x, y, sigmay, mnew, bnew)
p_new = straight_line_log_prior(mnew, bnew)
prob_new = np.exp(L_new + p_new)
if (prob_new / prob_old > np.random.uniform()):
# accept
m = mnew
b = bnew
L_old = L_new
p_old = p_new
prob_old = prob_new
naccept += 1
else:
# Stay where we are; m,b stay the same, and we append them
# to the chain below.
pass
chain.append((b,m))
probs.append((L_old,p_old))
print 'Acceptance fraction:', naccept/float(nsteps)
The code is simple and easy, but I have difficulties in understanding how the MH is being implemented.
My question is in the chain.append (the third line from the bottom). The author is appending m and b whether they were accepted or rejected. Why? Shouldn't he append only the accepted points?
The following R code demonstrates why it is important to capture the rejected case:
# 20 samples from 0 or 1. 1 has an 80% probability of being chosen.
the.population <- sample(c(0,1), 20, replace = TRUE, prob=c(0.2, 0.8))
# Create a new sample that only catches changes
the.sample <- c(the.population[1])
# Loop though the.population,
# but only copy the.population to the.sample if the value changes
for( i in 2:length(the.population))
{
if(the.population[i] != the.population[i-1])
the.sample <- append(the.sample, the.population[i])
}
When this code runs, the.population gets 20 values, for example:
0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1
The probability of a 1 in this population is 16/20 or 0.8. Exactly the probability we expected...
The sample, on the other hand, which only records changes, looks like this:
0 1 0 1 0 1
The probability of a 1 in the sample is 3/6 or 0.5.
We are trying to build a distribution, rejecting the new values means that the old values are more likely than the new values. That needs to be captured so our distribution is correct.
From a quick reading of the algorithm description: When a candidate is rejected, it still counts as a step, but the value is the same as the old step. I.e. b, m are appended either way, but they only get updated (to bnew, mnew) in the case where the candidate is accepted.

Categories