How to tackle the Birthday Paradox Problem in Python?

How to tackle the Birthday Paradox Problem in Python? - python

I'm practicing the Birthday Paradox problem in Python. I've run it a bunch of times, with changing the random number of birthdays and **loop run number **, but the probability is either 0 or 100%, and I was unable to get other probability like 50% etc. Can someone help me look through my code and see what I did wrong? Thank you so much!!
from random import randint
from datetime import datetime, timedelta
first_day_of_year = datetime(2017, 1, 1)
num_of_ppl = 45
birthdays = []
# get 45 random birthdays list
for i in range(num_of_ppl):
new_birthday = first_day_of_year + timedelta(days = randint(0, 365))
birthdays.append(new_birthday)
# find if there's matched birthdays, run 10000 times
dups = 0
duplicates = set()
for i in range(10000):
for bday in birthdays:
if birthdays.count(bday) > 1:
duplicates.add(bday)
if len(duplicates) >= 1:
dups += 1
# calculate the probability
probability = dups/10000 * 100
print(probability)

If you generate the birthdays list each time, the probability is as expected. Also I didn't see a need to use datetime or set objects, I just replaced them with ints and bools without changing anything functionally. Also, you can use list comprehension in order to generate the birthdays list in one line:
from random import randint
num_iterations = 10000
num_people = 45
num_duplicates_overall = 0
# generate a random birthday for each person, check if there was a duplicate,
# and repeat num_iterations times
for i in range(num_iterations):
# start with a new, empty list every time.
# get a list of random birthdays, of length num_people.
birthdays = [randint(0, 365) for _ in range(num_people)]
# Keep track of whether or not there was a duplicate for this iteration
was_duplicate = False
for bday in birthdays:
if birthdays.count(bday) > 1:
# We found a duplicate for this iteration, so we can stop checking
was_duplicate = True
break
if was_duplicate:
num_duplicates_overall += 1
probability = num_duplicates_overall / num_iterations
print(f"Probability: {probability * 100}%")
Output with num_iterations = 1000000 and num_people = 23:
Probability: 50.6452%
Edit: Alternatively, there's this method to check for duplicates which is supposedly faster (but mainly I like it because it's on one line):
if len(birthdays) != len(set(birthdays)):
num_duplicates_overall += 1
So, your code could look as simple as this:
from random import randint
num_iterations = 10000
num_people = 45
num_duplicates_overall = 0
for i in range(num_iterations):
birthdays = [randint(0, 365) for _ in range(num_people)]
if len(birthdays) != len(set(birthdays)):
num_duplicates_overall += 1
probability = num_duplicates_overall / num_iterations
print(f"Probability: {probability * 100}%")

Related

Making a random int just be itself once in loop [duplicate]

This question already has answers here:
Generate 'n' unique random numbers within a range [duplicate]
(4 answers)
Closed 12 days ago.
So.. im working on this loop:
stuff_so_far = [intl_pt]
for i in range(0, num_pts - 1):
rdm = random.randint(0, len(points_in_code) - 1)
a = (stuff_so_far[i][0] + points_in_code[rdm][0]) // 2
b = (stuff_so_far[i][1] + points_in_code[rdm][1]) // 2
stuff_so_far.append((a, b))
Basically what i want to achive is to get a random index for "points_in_code" every time the code loops. It is doing that now, but what i want to know is, how do i make it not randomly repeat a number? As in, if in the first iteration of the loop rdm gets set to 1, and then in the second iteration of the loop rdm gets set to 3, and in some cases, rdm can be set to 1 again in the third itertion. How do i make it not be 1 again (as long as the loop is still going)?
Ive tried everything i know and searched online but i found nothing, how do i make that happen without altering my code too much? (im new to programming)
I know each time i call random.randint(), i am creating a single random number, it does not magically change to a new random not used before number everytime the loop iterates.

You can use random.sample:
import random
points_in_code = [11, 4, 13, 18, 7, 12] # Just a example
num_pts = 4
indexes = random.sample(range(len(points_in_code)), num_pts - 1)
for rdm, i in zip(indexes, range(0, num_pts - 1)):
pass # rdm will be a random unique index
# i will increase 1 each iteration
You could also use enumerate(indexes) instead of zip(indexes, range(0, num_pts - 1)) but then you would need to reverse i and rdm.
See the documentation for random.sample for more info. See also info on zip and enumerate

You can use an Array for keep track of which random number you already used. You only need to regenerate the random number if you get one which you already had.
stuff_so_far = [intl_pt]
tracking_num = [] # keeps track of the random number that was alreadey used.
for i in range(num_pts - 1): # small tip you can create a for-loop without the zero in the beginning.
# while loop in which a number will be regenerated if the current random number
# is contained in the 'tracking_num' array.
rdm = random.randint(0, len(points_in_code) - 1)
while tracking_num.__contains__(rdm):
rdm = random.randint(0, len(points_in_code) - 1)
a = (stuff_so_far[i][0] + points_in_code[rdm][0]) // 2
b = (stuff_so_far[i][1] + points_in_code[rdm][1]) // 2
stuff_so_far.append((a, b))

Try below approach
import random as rnd
## Lets say you want at max 100
MAX=100
arr = [i for i in range(101)]
while True:
curr = rnd.choice(arr)
arr.remove(curr)
print(curr)
if len(arr)==0: break
Your code :
stuff_so_far = [intl_pt]
choice_arr = [i for i in range(len(points_in_code))]
for i in range(0, num_pts - 1):
rdm = random.choice(choice_arr)
choice_arr.remove(rdm)
if len(choice_arr)==0:
print("No more choices available")
break
a = (stuff_so_far[i][0] + points_in_code[rdm][0]) // 2
b = (stuff_so_far[i][1] + points_in_code[rdm][1]) // 2
stuff_so_far.append((a, b))

how to count a list of numbers which occour in a random list Python

I run 1000 times of random.choice()from a list from 0-11. how to track of the number of random selections necessary before all 12 number have been selected at least once. (Many will be selected more than once.)
For instance, suppose the simulation yields the following sequence of random choices for a single
trial: 2 5 6 8 2 9 11 10 6 3 1 9 7 10 0 7 0 7 4, where all 12 numbers have been selected at least once. The count for this example trial is 19. Collect the count for each trial of the
simulation in a single list (ultimately consisting of 1,000 counts).

Here is a solution using collections.Counter as a container:
from collections import Counter
import random
nums = list(range(12))
n = 1000
counts = [0]*n
for trial in range(n):
c = Counter()
while len(c)<len(nums):
c[random.choice(nums)]+=1
counts[trial] = sum(c.values()) # c.total() in python ≥ 3.10
counts
Output:
[28, 24, 39, 27, 40, 36, ...] # 1000 elements
Distribution of the counts:

Maybe you can try using a set to store your results in a non-redundant way, while checking to see if all numbers have been used:
import random
guesses = set()
count = 0
for i in range(1000):
count += 1
guesses.add(random.randrange(0, 12))
if len(guesses) == 12:
break
print(count)
This will give you 1 count. A better method is outlined by mozway in their answer.
You can run the code a million times and collect the results in a list, then graph it like so (updated with a while condition):
import random
import numpy as np
from matplotlib import pyplot as plt
counts = []
for i in range(100000):
guesses = set()
count = 0
while len(guesses) != 12:
count += 1
guesses.add(random.randrange(0, 12))
counts.append(count)
fig = plt.figure(figsize=(8, 6))
x = np.array([i for i in range(np.max(np.array(counts)))])
y = np.array([counts.count(i) for i in range(np.max(np.array(counts)))])
plt.bar(x, y)
plt.xlabel('Number of Guesses')
plt.ylabel('Frequency')
plt.show()
So [counts.count(i) for i in range(np.max(np.array(counts)))] is going to give you a list of how often the guessing game finished for that given position of the list. I.e. the first value of the list is 0, because there is no way that the game can finish with only 1 guess, but at position 25 (25 guesses) there are over 2000 instances of that happening

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

I'm trying to write Python code to see how many coin tosses, on average, are required to get a sequences of N heads in a row.
The thing that I'm puzzled by is that the answers produced by my code don't match ones that are given online, e.g. here (and many other places) https://math.stackexchange.com/questions/364038/expected-number-of-coin-tosses-to-get-five-consecutive-heads
According to that, the expected number of tosses that I should need to get various numbers of heads in a row are: E(1) = 2, E(2) = 6, E(3) = 14, E(4) = 30, E(5) = 62. But I don't get those answers! For example, I get E(3) = 8, instead of 14. The code below runs to give that answer, but you can change n to test for other target numbers of heads in a row.
What is going wrong? Presumably there is some error in the logic of my code, but I confess that I can't figure out what it is.
You can see, run and make modified copies of my code here: https://trinket.io/python/17154b2cbd
Below is the code itself, outside of that runnable trinket.io page. Any help figuring out what's wrong with it would be greatly appreciated!
Many thanks,
Raj
P.S. The closest related question that I could find was this one: Monte-Carlo Simulation of expected tosses for two consecutive heads in python
However, as far as I can see, the code in that question does not actually test for two consecutive heads, but instead tests for a sequence that starts with a head and then at some later, possibly non-consecutive, time gets another head.
# Click here to run and/or modify this code:
# https://trinket.io/python/17154b2cbd
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
toss_sequence = []
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print 'Trial num', trial_num, 'out of', num_trials
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print 'Average', sum(seq_lengths_rec) / len(seq_lengths_rec)

You don't re-initialize toss_sequence for each experiment, so you start every experiment with a pre-existing sequence of heads, having a 1 in 2 chance of hitting the target sequence on the first try of each new experiment.
Initializing toss_sequence inside the outer loop will solve your problem:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 4
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
toss_sequence = []
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))
You can simplify your code a bit, and make it less error-prone:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
seq_lengths_rec = []
for trial_num in range(0, num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
heads_counter = 0
toss_counter = 0
while heads_counter < n:
toss_counter += 1
this_toss = random.choice(possible_tosses)
if this_toss == 'h':
heads_counter += 1
else:
heads_counter = 0
seq_lengths_rec.append(toss_counter)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))

We cam eliminate one additional loop by running each experiment long enough (ideally infinite) number of times, e.g., each time toss a coin n=1000 times. Now, it is likely that the sequence of 5 heads will appear in each such trial. If it does appear, we can call the trial as an effective trial, otherwise we can reject the trial.
In the end, we can take an average of number of tosses needed w.r.t. the number of effective trials (by LLN it will approximate the expected number of tosses). Consider the following code:
N = 100000 # total number of trials
n = 1000 # long enough sequence of tosses
k = 5 # k heads in a row
ntosses = []
pat = ''.join(['1']*k)
effective_trials = 0
for i in range(N): # num of trials
seq = ''.join(map(str,random.choices(range(2),k=n))) # toss a coin n times (long enough times)
if pat in seq:
ntosses.append(seq.index(pat) + k)
effective_trials += 1
print(effective_trials, sum(ntosses) / effective_trials)
# 100000 62.19919
Notice that the result may not be correct if n is small, since it tries to approximate infinite number of coin tosses (to find expected number of tosses to obtain 5 heads in a row, n=1000 is okay since actual expected value is 62).

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)

Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0

You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

comparing large vectors in python

I have two large vectors (~133000 values) of different length. They are each sortet from small to large values. I want to find values that are similar within a given tolerance. This is my solution but it is very slow. Is there a way to speed this up?
import numpy as np
for lv in range(np.size(vector1)):
for lv_2 in range(np.size(vector2)):
if np.abs(vector1[lv_2]-vector2[lv])<.02:
print(vector1[lv_2],vector2[lv],lv,lv_2)
break

Your algorithm is far from optimal. You compare way too much values. Assume you are at a certain position in vector1 and the current value in vector2 is already more than 0.02 bigger. Why would you compare the rest of vector2?
Start with something like
pos1 = 0
pos2 = 0
Now compare the values at those postions in your vectors. If the difference is too big, move the position of the smaller one fowared and check again. Continue until you reach the end of one vector.

haven't tested it, but the following should work. The idea is to exploit the fact that the vectors are sorted
lv_1, lv_2 = 0,0
while lv_1 < len(vector1) and lv_2 < len(vector2):
if np.abs(vector1[lv_2]-vector2[lv_1])<.02:
print(vector1[lv_2],vector2[lv_1],lv_1,lv_2)
lv_1 += 1
lv_2 += 1
elif vector1[lv_1] < vector2[lv_2]: lv_1 += 1
else: lv_2 += 1

The following code gives a nice increase in performance that depends upon how dense the numbers are. Using a set of 1000 random numbers, sampled uniformly between 0 and 100, it runs about 30 times faster than your implementation.
pos_1_start = 0
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
The timing:
time new method: 0.112464904785
time old method: 3.59720897675
Which is produced by the following script:
import random
import numpy as np
import time
# initialize the vectors to be compared
vector1 = [random.uniform(0, 40) for i in range(1000)]
vector2 = [random.uniform(0, 40) for i in range(1000)]
vector1.sort()
vector2.sort()
# the arrays that will contain the results for the first method
results1 = []
# the arrays that will contain the results for the second method
results2 = []
pos1_start = 0
t_start = time.time()
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
t1 = time.time() - t_start
print "time new method:", t1
t = time.time()
for lv1 in range(np.size(vector1)):
for lv2 in range(np.size(vector2)):
if np.abs(vector1[lv1]-vector2[lv2])<.02:
results2 += [(vector1[lv1], vector2[lv2], lv1, lv2)]
t2 = time.time() - t_start
print "time old method:", t2
# sort the results
results1.sort()
results2.sort()
print np.allclose(results1, results2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to tackle the Birthday Paradox Problem in Python? - python

Related

Making a random int just be itself once in loop [duplicate]

how to count a list of numbers which occour in a random list Python

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

Standard deviation of combinations of dices

comparing large vectors in python

Categories

Resources