Different forms of genetic algorithim - python

I wrote a code that implements a simple genetic algorithm to maximize:
f(x) = 15x - x^2
The function has its maximum at 7.5, so the code output should be 7 or 8 since the population are integers.
When I run the code 10 times I get 7 or 8 around three times out of 10.
What modification should I make to further improve the algorithm and what are different types of genetic algorithms?
Here is the code:
from random import *
import numpy as np
#fitness function
def fit(x):
return 15*x -x**2
#covert binary list to decimal number
def to_dec(x):
return int("".join(str(e) for e in x), 2)
#picks pairs from the original population
def gen_pairs(populationl, prob):
pairsl = []
test = [0, 1, 2, 3, 4, 5]
for i in range(3):
pair = []
for j in range(2):
temp = np.random.choice(test, p=prob)
pair.append(populationl[temp].copy())
pairsl.append(pair)
return pairsl
#mating function
def cross_over(prs, mp):
new = []
for pr in prs:
if mp[prs.index(pr)] == 1:
index = np.random.choice([1,2,3], p=[1/3, 1/3, 1/3])
pr[0][:index], pr[1][:index] = pr[1][:index], pr[0][:index]
for pr in prs:
new.append(pr[0])
new.append(pr[1])
return new
#mutation
def mutation(x):
for chromosome in x:
for gene in chromosome:
mutation_prob = np.random.choice([0, 1], p=[0.999, .001])
if mutation_prob == 1:
#m_index = np.random.choice([0,1,2,3])
if gene == 0:
gene = 1
else:
gene = 0
#generate initial population
randlist = lambda n:[randint(0,1) for b in range(1, n+1)]
for j in range(10):
population = [randlist(4) for i in range(6)]
for _ in range(20):
fittness = [fit(to_dec(y)) for y in population]
s = sum(fittness)
prob = [e/s for e in fittness]
pairsg = gen_pairs(population.copy(), prob)
mating_prob = []
for i in pairsg:
mating_prob.append(np.random.choice([0,1], p=[0.4,0.6]))
new_population = cross_over(pairsg, mating_prob)
mutated = mutation(new_population)
decimal_p = [to_dec(i)for i in population]
decimal_new = [to_dec(i)for i in new_population]
# print(decimal_p)
# print(decimal_new)
population = new_population
print(decimal_new)

This is a very typical situation with evolutionary algorithms. Success rate is a quite common metric, and 30% is a decent result.
Just an example, recently I implemented a GP/GE solver for Santa Fe Trail problem, and it demonstrates the success rate of 30% or less.
How to improve success rate
A personal interpretation of the problem based on limited experience follows.
An evolutionary algorithm fails to find a close to global optimum solution when it converges around a local optimum or gets stuck on a great plateau, and has not enough diversity in its population to escape this trap by finding a better region.
You may try to supply your algorithm with more diversity by increasing the size of the population. Or you may look into techniques like novelty search, and quality diversity.
By the way, here is a very nice interactive demonstration of novelty search vs. fitness search: http://eplex.cs.ucf.edu/noveltysearch/userspage/demo.html

Related

How to efficiently process a list that continously being appended with new item in Python

Objective:
To visualize the population size of a particular organism over finite time.
Assumptions:
The organism has a life span of age_limit days
Only Females of age day_lay_egg days can lay the egg, and the female is allowed to lay an egg a maximum of max_lay_egg times. Each breeding session, a maximum of only egg_no eggs can be laid with a 50% probability of producing male offspring.
Initial population of 3 organisms consist of 2 Female and 1 Male
Code Snippets:
Currently, the code below should produced the expected output
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
def get_breeding(d,**kwargs):
if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
d['lay_egg'] = d['lay_egg'] + 1
return d,npol
return d,None
def to_loop_initial_population(**kwargs):
npol=kwargs['ipol']
nday = 0
total_population_per_day = []
while nday < kwargs['nday_limit']:
# print(f'Executing day {nday}')
k = []
for dpol in npol:
dpol['d'] += 1
dpol['dborn'] += 1
dpol,h = get_breeding(dpol,**kwargs)
if h is None and dpol['dborn'] <= kwargs['age_limit']:
# If beyond the age limit, ignore the parent and update only the decedent
k.append(dpol)
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
# If below age limit, append the parent and its offspring
h.extend([dpol])
k.extend(h)
total_population_per_day.append(dict(nsize=len(k), day=nday))
nday += 1
npol = k
return total_population_per_day
## Some spec and store all setting in a dict
numsex=[1,1,0] # 0: Male, 1: Female
# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
ipol=[dict(s=x,d=0, lay_egg=0, dborn=0) for x in numsex] # The initial population
age_limit = 45 # Age limit for the species
egg_no=3 # Number of eggs
day_lay_egg = 30 # Matured age for egg laying
nday_limit=360
max_lay_egg=2
para=dict(nday_limit=nday_limit,ipol=ipol,age_limit=age_limit,
egg_no=egg_no,day_lay_egg=day_lay_egg,max_lay_egg=max_lay_egg)
dpopulation = to_loop_initial_population(**para)
### make some plot
df = pd.DataFrame(dpopulation)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()
Output:
Problem/Question:
The time to complete the execution time increases exponentially with nday_limit. I need to improve the efficiency of the code. How can I speed up the running time?
Other Thoughts:
I am tempted to apply joblib as below. To my surprise, the execution time is worse.
def djob(dpol,k,**kwargs):
dpol['d'] = dpol['d'] + 1
dpol['dborn'] = dpol['dborn'] + 1
dpol,h = get_breeding(dpol,**kwargs)
if h is None and dpol['dborn'] <= kwargs['age_limit']:
# If beyond the age limit, ignore the that particular subject
k.append(dpol)
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
# If below age limit, append the parent and its offspring
h.extend([dpol])
k.extend(h)
return k
def to_loop_initial_population(**kwargs):
npol=kwargs['ipol']
nday = 0
total_population_per_day = []
while nday < kwargs['nday_limit']:
k = []
njob=1 if len(npol)<=50 else 4
if njob==1:
print(f'Executing day {nday} with single cpu')
for dpols in npol:
k=djob(dpols,k,**kwargs)
else:
print(f'Executing day {nday} with single parallel')
k=Parallel(n_jobs=-1)(delayed(djob)(dpols,k,**kwargs) for dpols in npol)
k = list(itertools.chain(*k))
ll=1
total_population_per_day.append(dict(nsize=len(k), day=nday))
nday += 1
npol = k
return total_population_per_day
for
nday_limit=365
Your code looks alright overall but I can see several points of improvement that are slowing your code down significantly.
Though it must be noted that you can't really help the code slowing down too much with increasing nday values, since the population you need to keep track of keeps growing and you keep re-populating a list to track this. It's expected as the number of objects increase, the loops will take longer to complete, but you can reduce the time it takes to complete a single loop.
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
Here you ask the instance of h every single loop, after confirming whether it's None. You know for a fact that h is going to be a list, and if not, your code will error anyway even before reaching that line for the list not to have been able to be created.
Furthermore, you have a redundant condition check for age of dpol, and then redundantly first extend h by dpol and then k by h. This can be simplified together with the previous issue to this:
if dpol['dborn'] <= kwargs['age_limit']:
k.append(dpol)
if h:
k.extend(h)
The results are identical.
Additionally, you're passing around a lot of **kwargs. This is a sign that your code should be a class instead, where some unchanging parameters are saved through self.parameter. You could even use a dataclass here (https://docs.python.org/3/library/dataclasses.html)
Also, you mix responsibilities of functions which is unnecessary and makes your code more confusing. For instance:
def get_breeding(d,**kwargs):
if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
d['lay_egg'] = d['lay_egg'] + 1
return d,npol
return d,None
This code contains two responsibilities: Generating a new individual if conditions are met, and checking these conditions, and returning two different things based on them.
This would be better done through two separate functions, one which simply checks the conditions, and another that generates a new individual as follows:
def check_breeding(d, max_lay_egg, day_lay_egg):
return d['lay_egg'] <= max_lay_egg and d['dborn'] > day_lay_egg and d['s'] == 1
def get_breeding(d, egg_no):
nums = np.random.choice([0, 1], size=egg_no, p=[.5, .5]).tolist()
npol=[dict(s=x, d=d['d'], lay_egg=0, dborn=0) for x in nums]
return npol
Where d['lay_egg'] could be updated in-place when iterating over the list if the condition is met.
You could speed up your code even further this way, if you edit the list as you iterate over it (it is not typically recommended but it's perfectly fine to do if you know what you're doing. Make sure to do it by using the index and limit it to the previous bounds of the length of the list, and decrement the index when an element is removed)
Example:
i = 0
maxiter = len(npol)
while i < maxiter:
if check_breeding(npol[i], max_lay_egg, day_lay_egg):
npol.extend(get_breeding(npol[i], egg_no))
if npol[i]['dborn'] > age_limit:
npol.pop(i)
i -= 1
maxiter -= 1
Which could significantly reduce processing time since you're not making a new list and appending all elements all over again every iteration.
Finally, you could check some population growth equation and statistical methods, and you could even reduce this whole code to a calculation problem with iterations, though that wouldn't be a sim anymore.
Edit
I've fully implemented my suggestions for improvements to your code and timed them in a jupyter notebook using %%time. I've separated out function definitions from both so they wouldn't contribute to the time, and the results are telling. I also made it so females produce another female 100% of the time, to remove randomness, otherwise it would be even faster. I compared the results from both to verify they produce identical results (they do, but I removed the 'd_born' parameter cause it's not used in the code apart from setting).
Your implementation, with nday_limit=100 and day_lay_egg=15:
Wall time 23.5s
My implementation with same parameters:
Wall time 18.9s
So you can tell the difference is quite significant, which grows even farther apart for larger nday_limit values.
Full implementation of edited code:
from dataclasses import dataclass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#dataclass
class Organism:
sex: int
times_laid_eggs: int = 0
age: int = 0
def __init__(self, sex):
self.sex = sex
def check_breeding(d, max_lay_egg, day_lay_egg):
return d.times_laid_eggs <= max_lay_egg and d.age > day_lay_egg and d.sex == 1
def get_breeding(egg_no): # Make sure to change probabilities back to 0.5 and 0.5 before using it
nums = np.random.choice([0, 1], size=egg_no, p=[0.0, 1.0]).tolist()
npol = [Organism(x) for x in nums]
return npol
def simulate(organisms, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit):
npol = organisms
nday = 0
total_population_per_day = []
while nday < nday_limit:
i = 0
maxiter = len(npol)
while i < maxiter:
npol[i].age += 1
if check_breeding(npol[i], max_lay_egg, day_lay_egg):
npol.extend(get_breeding(egg_no))
npol[i].times_laid_eggs += 1
if npol[i].age > age_limit:
npol.pop(i)
maxiter -= 1
continue
i += 1
total_population_per_day.append(dict(nsize=len(npol), day=nday))
nday += 1
return total_population_per_day
if __name__ == "__main__":
numsex = [1, 1, 0] # 0: Male, 1: Female
ipol = [Organism(x) for x in numsex] # The initial population
age_limit = 45 # Age limit for the species
egg_no = 3 # Number of eggs
day_lay_egg = 15 # Matured age for egg laying
nday_limit = 100
max_lay_egg = 2
dpopulation = simulate(ipol, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit)
df = pd.DataFrame(dpopulation)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()
Try structuring your code as a matrix like state[age][eggs_remaining] = count instead. It will have age_limit rows and max_lay_egg columns.
Males start in the 0 eggs_remaining column, and every time a female lays an egg they move down one (3->2->1->0 with your code above).
For each cycle, you just drop the last row, iterate over all the rows after age_limit and insert a new first row with the number of males and females.
If (as in your example) there only is a vanishingly small chance that a female would die of old age before laying all their eggs, you can just collapse everything into a state_alive[age][gender] = count and a state_eggs[eggs_remaining] = count instead, but it shouldn't be necessary unless the age goes really high or you want to run thousands of simulations.
use numpy array operation as much as possible instead of using loop can improve your performance, see below codes tested in notebook - https://www.kaggle.com/gfteafun/notebook03118c731b
Note that when comparing the time the nsize scale matters.
%%time​
​
# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
x = np.array([(x, 0, 0, 0) for x in numsex ] )
iparam = np.array([0, 1, 0, 1])
​
total_population_per_day = []
for nday in range(nday_limit):
x = x + iparam
c = np.all(x < np.array([2, nday_limit, max_lay_egg, age_limit]), axis=1) & np.all(x >= np.array([1, day_lay_egg, 0, day_lay_egg]), axis=1)
total_population_per_day.append(dict(nsize=len(x[x[:,3]<age_limit, :]), day=nday))
n = x[c, 2].shape[0]
​
if n > 0:
x[c, 2] = x[c, 2] + 1
newborns = np.array([(x, nday, 0, 0) for x in np.random.choice([0, 1], size=egg_no, p=[.5, .5]) for i in range(n)])
x = np.vstack((x, newborns))
​
​
df = pd.DataFrame(total_population_per_day)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()

Average time to hit a given line on 2D random walk on a unit grid

I am trying to simulate the following problem:
Given a 2D random walk (in a lattice grid) starting from the origin what is the average waiting time to hit the line y=1-x
import numpy as np
from tqdm import tqdm
N=5*10**3
results=[]
for _ in tqdm(range(N)):
current = [0,0]
step=0
while (current[1]+current[0] != 1):
step += 1
a = np.random.randint(0,4)
if (a==0):
current[0] += 1
elif (a==1):
current[0] -= 1
elif (a==2):
current[1] += 1
elif (a==3):
current[1] -= 1
results.append(step)
This code is slow even for N<10**4 I am not sure how to optimize it or change it to properly simulate the problem.
Instead of simulating a bunch of random walks sequentially, lets try simulating multiple paths at the same time and tracking the probabilities of those happening, for instance we start at position 0 with probability 1:
states = {0+0j: 1}
and the possible moves along with their associated probabilities would be something like this:
moves = {1+0j: 0.25, 0+1j: 0.25, -1+0j: 0.25, 0-1j: 0.25}
# moves = {1: 0.5, -1:0.5} # this would basically be equivelent
With this construct we can update to new states by going over the combination of each state and each move and update probabilities accordingly
def simulate_one_step(current_states):
newStates = {}
for cur_pos, prob_of_being_here in current_states.items():
for movement_dist,prob_of_moving_this_way in moves.items():
newStates.setdefault(cur_pos+movement_dist, 0)
newStates[cur_pos+movement_dist] += prob_of_being_here*prob_of_moving_this_way
return newStates
Then we just iterate this popping out all winning states at each step:
for stepIdx in range(1, 100):
states = simulate_one_step(states)
winning_chances = 0
# use set(keys) to make copy so we can delete cases out of states as we go.
for pos, prob in set(states.items()):
# if y = 1-x
if pos.imag == 1 - pos.real:
winning_chances += prob
# we no longer consider this a state that propogated because the path stops here.
del states[pos]
print(f"probability of winning after {stepIdx} moves is: {winning_chances}")
you would also be able to look at states for an idea of the distribution of possible positions, although totalling it in terms of distance from the line simplifies the data. Anyway, the final step would be to average the steps taken by the probability of taking that many steps and see if it converges:
total_average_num_moves += stepIdx * winning_chances
But we might be able to gather more insight by using symbolic variables! (note I'm simplifying this to a 1D problem which I describe how at the bottom)
import sympy
x = sympy.Symbol("x") # will sub in 1/2 later
moves = {
1: x, # assume x is the chances for us to move towards the target
-1: 1-x # and therefore 1-x is the chance of moving away
}
This with the exact code as written above gives us this sequence:
probability of winning after 1 moves is: x
probability of winning after 2 moves is: 0
probability of winning after 3 moves is: x**2*(1 - x)
probability of winning after 4 moves is: 0
probability of winning after 5 moves is: 2*x**3*(1 - x)**2
probability of winning after 6 moves is: 0
probability of winning after 7 moves is: 5*x**4*(1 - x)**3
probability of winning after 8 moves is: 0
probability of winning after 9 moves is: 14*x**5*(1 - x)**4
probability of winning after 10 moves is: 0
probability of winning after 11 moves is: 42*x**6*(1 - x)**5
probability of winning after 12 moves is: 0
probability of winning after 13 moves is: 132*x**7*(1 - x)**6
And if we ask the OEIS what the sequence 1,2,5,14,42,132... means it tells us those are Catalan numbers with the formula of (2n)!/(n!(n+1)!) so we can write a function for the non-zero terms in that series as:
f(n,x) = (2n)! / (n! * (n+1)!) * x^(n+1) * (1-x)^n
or in actual code:
import math
def probability_of_winning_after_2n_plus_1_steps(n, prob_of_moving_forward = 0.5):
return (math.factorial(2*n)/math.factorial(n)/math.factorial(n+1)
* prob_of_moving_forward**(n+1) * (1-prob_of_moving_forward)**n)
which now gives us a relatively instant way of calculating relevant parameters for any length, or more usefully ask wolfram alpha what the average would be (it diverges)
Note that we can simplify this to a 1D problem by considering y-x as one variable: "we start at y-x = 0 and move such that y-x either increases or decreases by 1 each move with equal chance and we are interested when y-x = 1. This means we can consider the 1D case by subbing in z=y-x.
Vectorisation would result in much faster code, approximately ~90K times faster. Here is the function that would return step to hit y=1-x line starting from (0,0) and trajectory generation on the 2D grid with unit steps .
import numpy as np
def _random_walk_2D(sim_steps):
""" Walk on 2D unit steps
return x_sim, y_sim, trajectory, number_of_steps_first_hit to y=1-x """
random_moves_x = np.insert(np.random.choice([1,0,-1], sim_steps), 0, 0)
random_moves_y = np.insert(np.random.choice([1,0,-1], sim_steps), 0, 0)
x_sim = np.cumsum(random_moves_x)
y_sim = np.cumsum(random_moves_y)
trajectory = np.array((x_sim,y_sim)).T
y_hat = 1-x_sim # checking if hit y=1-x
y_hit = y_hat-y_sim
hit_steps = np.where(y_hit == 0)
number_of_steps_first_hit = -1
if hit_steps[0].shape[0] > 0:
number_of_steps_first_hit = hit_steps[0][0]
return x_sim, y_sim, trajectory, number_of_steps_first_hit
if number_of_steps_first_hit is -1 it means trajectory does not hit the line.
A longer simulation and repeating might give the average behaviour, but the following one tells if it does not escape to Infiniti it hits line on average ~84 steps.
sim_steps= 5*10**3 # 5K steps
#Repeat
nrepeat = 40000
hit_step = [_random_walk_2D(sim_steps)[3] for _ in range(nrepeat)]
hit_step = [h for h in hit_step if h > -1]
np.mean(hit_step) # ~84 step
Much longer sim_steps will change the result though.
PS:
Good exercise, hope that this wasn't a homework, if it was homework, please cite this answer if it is used.
Edit
As discussed in the comments current _random_walk_2D works for 8-directions. To restrict it to cardinal direction we could do the following filtering:
cardinal_x_y = [(t[0], t[1]) for t in zip(random_moves_x, random_moves_y)
if np.abs(t[0]) != np.abs(t[1])]
random_moves_x = [t[0] for t in cardinal_x_y]
random_moves_y = [t[1] for t in cardinal_x_y]
though this would slow it down the function a bit but still will be super fast compare to for loop solutions.

How to speed up an N dimensional interval tree in python?

Consider the following problem: Given a set of n intervals and a set of m floating-point numbers, determine, for each floating-point number, the subset of intervals that contain the floating-point number.
This problem has been addressed by constructing an interval tree (or called range tree or segment tree). Implementations have been done for the one-dimensional case, e.g. python's intervaltree package. Usually, these implementations consider one or few floating-point numbers, namely a small "m" above.
In my problem setting, both n and m are extremely large numbers (from solving an image processing problem). Further, I need to consider the N-dimensional intervals (called cuboid when N=3, because I was modeling human brains with the Finite Element Method). I have implemented a simple N-dimensional interval tree in python, but it run in a loop and can only take one floating-point number at a time. Can anyone help improve the implementation in terms of efficiency? You can change data structure freely.
import sys
import time
import numpy as np
# find the index of a satisfying x > a in one dimension
def find_index_smaller(a, x):
idx = np.argsort(a)
ss = np.searchsorted(a, x, sorter=idx)
res = idx[0:ss]
return res
# find the index of a satisfying x < a in one dimension
def find_index_larger(a, x):
return find_index_smaller(-a, -x)
# find the index of a satisfing amin < x < amax in one dimension
def find_intv_at(amin, amax, x):
idx = find_index_smaller(amin, x)
idx2 = find_index_larger(amax[idx], x)
res = idx[idx2]
return res
# find the index of a satisfying amin < x < amax in N dimensions
def find_intv_at_nd(amin, amax, x):
dim = amin.shape[0]
res = np.arange(amin.shape[-1])
for i in range(dim):
idx = find_intv_at(amin[i, res], amax[i, res], x[i])
res = res[idx]
return res
I also have two test examples for sanity check and performance testing:
def demo1():
print ("By default, we do a correctness test")
n_intv = 2
n_point = 2
# generate the test data
point = np.random.rand(3, n_point)
intv_min = np.random.rand(3, n_intv)
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point ")
print (point)
print ("intv_min")
print (intv_min)
print ("intv_max")
print (intv_max)
print ("===Indexes of intervals that contain the point===")
for i in range(n_point):
print (find_intv_at_nd(intv_min,intv_max, point[:, i]))
def demo2():
print ("Performance:")
n_points=100
n_intv = 1000000
# generate the test data
points = np.random.rand(n_points, 3)*512
intv_min = np.random.rand(3, n_intv)*512
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point.shape = "+str(points.shape))
print ("intv_min.shape = "+str(intv_min.shape))
print ("intv_max.shape = "+str(intv_max.shape))
starttime = time.time()
for point in points:
tmp = find_intv_at_nd(intv_min, intv_max, point)
print("it took this long to run {} points, with {} interva: {}".format(n_points, n_intv, time.time()-starttime))
My idea would be:
Remove np.argsort() from the algo, because the interval tree does not change, so sorting could have been done in pre-processing.
Vectorize x. The algo runs a loop for each x. It would be nice if we can get rid of the loop over x.
Any contribution would be appreciated.

Coding a secretary problem (Monte Carlo) - problems with python code

Trying to code the secretary problem in python by doing a Monte Carlo simulation (without using e). The essence of the problem is here: https://en.wikipedia.org/wiki/Secretary_problem
Described as :Imagine an administrator who wants to hire the best secretary out of n rankable applicants for a position. The applicants are interviewed one by one in random order. A decision about each particular applicant is to be made immediately after the interview. Once rejected, an applicant cannot be recalled. During the interview, the administrator can rank the applicant among all applicants interviewed so far but is unaware of the quality of yet unseen applicants. The question is about the optimal strategy (stopping rule) to maximize the probability of selecting the best applicant. Taken from: https://www.geeksforgeeks.org/secretary-problem-optimal-stopping-problem/
Table that I'm checking my code against:
Here is my python code so far:
n = 7; # of applicants
m = 10000; # of repeats
plot = np.zeros(1);
for i in range (2,m): #multiple runs
array = np.random.randint(1,1000,n);
for j in range(2,n): #over range of array
test = 0;
if array[j] > array[1] and array[j] == array.max():
plot=plot+1
test = 1;
break
if array[j]> array[1]:
test = 2;
break
print(plot/m)
print(array)
print("j = ",j)
print("test = ",test)
I am doing something wrong with my code here that I'm unable to replicate the table. In the above code I've tried to do 7 = number of applicants and take the best applicant after '2'.
The plot/m should output the percentage in column three given the number of applicants and 'take the best after'.
Answered! As below.
Additional code:
import numpy as np
import matplotlib.pyplot as plt
import time
plt.style.use('seaborn-whitegrid')
n = 150 #total number of applicants
nplot = np.empty([1,1])
#take = 3 #not necessary, turned into J below:
for k in range(2,n):
m = 10000 #number of repeats
plot = np.empty([1,1]);
for j in range(1,k):
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(k);
picked = np.argmax(array[j:]>max(array[0:j])) + j
best = np.argmax(array)
if best == picked:
passed = passed+1
#print(passed/m)
plot = np.append(plot,[passed/m])
#print(plot)
plot = plot[1:];
x = range(1,k);
y = plot
#print("N = ",k)
print("Check ",plot.argmax()," if you have ",k," applicants", round(100* plot.max(),2),"% chance of finding the best applicant")
nplot =np.append(nplot,plot.max())
# Plot:
nplot = nplot[1:];
x = range(2,n);
y = nplot
plt.plot(x, y, 'o', color='black');
plt.xlabel("Number of Applicants")
plt.ylabel("Probability of Best Applicant")
Here is something that seems to do the job and is a bit simpler. Comments:
Use argmax to determine who is the best secretary, or to pick the first that has a better grade than another group
Draw from a real-valued function to reduce the odds of having 2 secretaries having the same grade.
Hence:
import numpy as np
n = 7
take = 2
m = 100000
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(n);
picked = np.argmax(array[take:]>max(array[0:take])) + take
best = np.argmax(array)
if best == picked:
passed = passed+1
print(passed/m)

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Categories