Trying to code the secretary problem in python by doing a Monte Carlo simulation (without using e). The essence of the problem is here: https://en.wikipedia.org/wiki/Secretary_problem
Described as :Imagine an administrator who wants to hire the best secretary out of n rankable applicants for a position. The applicants are interviewed one by one in random order. A decision about each particular applicant is to be made immediately after the interview. Once rejected, an applicant cannot be recalled. During the interview, the administrator can rank the applicant among all applicants interviewed so far but is unaware of the quality of yet unseen applicants. The question is about the optimal strategy (stopping rule) to maximize the probability of selecting the best applicant. Taken from: https://www.geeksforgeeks.org/secretary-problem-optimal-stopping-problem/
Table that I'm checking my code against:
Here is my python code so far:
n = 7; # of applicants
m = 10000; # of repeats
plot = np.zeros(1);
for i in range (2,m): #multiple runs
array = np.random.randint(1,1000,n);
for j in range(2,n): #over range of array
test = 0;
if array[j] > array[1] and array[j] == array.max():
plot=plot+1
test = 1;
break
if array[j]> array[1]:
test = 2;
break
print(plot/m)
print(array)
print("j = ",j)
print("test = ",test)
I am doing something wrong with my code here that I'm unable to replicate the table. In the above code I've tried to do 7 = number of applicants and take the best applicant after '2'.
The plot/m should output the percentage in column three given the number of applicants and 'take the best after'.
Answered! As below.
Additional code:
import numpy as np
import matplotlib.pyplot as plt
import time
plt.style.use('seaborn-whitegrid')
n = 150 #total number of applicants
nplot = np.empty([1,1])
#take = 3 #not necessary, turned into J below:
for k in range(2,n):
m = 10000 #number of repeats
plot = np.empty([1,1]);
for j in range(1,k):
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(k);
picked = np.argmax(array[j:]>max(array[0:j])) + j
best = np.argmax(array)
if best == picked:
passed = passed+1
#print(passed/m)
plot = np.append(plot,[passed/m])
#print(plot)
plot = plot[1:];
x = range(1,k);
y = plot
#print("N = ",k)
print("Check ",plot.argmax()," if you have ",k," applicants", round(100* plot.max(),2),"% chance of finding the best applicant")
nplot =np.append(nplot,plot.max())
# Plot:
nplot = nplot[1:];
x = range(2,n);
y = nplot
plt.plot(x, y, 'o', color='black');
plt.xlabel("Number of Applicants")
plt.ylabel("Probability of Best Applicant")
Here is something that seems to do the job and is a bit simpler. Comments:
Use argmax to determine who is the best secretary, or to pick the first that has a better grade than another group
Draw from a real-valued function to reduce the odds of having 2 secretaries having the same grade.
Hence:
import numpy as np
n = 7
take = 2
m = 100000
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(n);
picked = np.argmax(array[take:]>max(array[0:take])) + take
best = np.argmax(array)
if best == picked:
passed = passed+1
print(passed/m)
Related
Objective:
To visualize the population size of a particular organism over finite time.
Assumptions:
The organism has a life span of age_limit days
Only Females of age day_lay_egg days can lay the egg, and the female is allowed to lay an egg a maximum of max_lay_egg times. Each breeding session, a maximum of only egg_no eggs can be laid with a 50% probability of producing male offspring.
Initial population of 3 organisms consist of 2 Female and 1 Male
Code Snippets:
Currently, the code below should produced the expected output
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
def get_breeding(d,**kwargs):
if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
d['lay_egg'] = d['lay_egg'] + 1
return d,npol
return d,None
def to_loop_initial_population(**kwargs):
npol=kwargs['ipol']
nday = 0
total_population_per_day = []
while nday < kwargs['nday_limit']:
# print(f'Executing day {nday}')
k = []
for dpol in npol:
dpol['d'] += 1
dpol['dborn'] += 1
dpol,h = get_breeding(dpol,**kwargs)
if h is None and dpol['dborn'] <= kwargs['age_limit']:
# If beyond the age limit, ignore the parent and update only the decedent
k.append(dpol)
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
# If below age limit, append the parent and its offspring
h.extend([dpol])
k.extend(h)
total_population_per_day.append(dict(nsize=len(k), day=nday))
nday += 1
npol = k
return total_population_per_day
## Some spec and store all setting in a dict
numsex=[1,1,0] # 0: Male, 1: Female
# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
ipol=[dict(s=x,d=0, lay_egg=0, dborn=0) for x in numsex] # The initial population
age_limit = 45 # Age limit for the species
egg_no=3 # Number of eggs
day_lay_egg = 30 # Matured age for egg laying
nday_limit=360
max_lay_egg=2
para=dict(nday_limit=nday_limit,ipol=ipol,age_limit=age_limit,
egg_no=egg_no,day_lay_egg=day_lay_egg,max_lay_egg=max_lay_egg)
dpopulation = to_loop_initial_population(**para)
### make some plot
df = pd.DataFrame(dpopulation)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()
Output:
Problem/Question:
The time to complete the execution time increases exponentially with nday_limit. I need to improve the efficiency of the code. How can I speed up the running time?
Other Thoughts:
I am tempted to apply joblib as below. To my surprise, the execution time is worse.
def djob(dpol,k,**kwargs):
dpol['d'] = dpol['d'] + 1
dpol['dborn'] = dpol['dborn'] + 1
dpol,h = get_breeding(dpol,**kwargs)
if h is None and dpol['dborn'] <= kwargs['age_limit']:
# If beyond the age limit, ignore the that particular subject
k.append(dpol)
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
# If below age limit, append the parent and its offspring
h.extend([dpol])
k.extend(h)
return k
def to_loop_initial_population(**kwargs):
npol=kwargs['ipol']
nday = 0
total_population_per_day = []
while nday < kwargs['nday_limit']:
k = []
njob=1 if len(npol)<=50 else 4
if njob==1:
print(f'Executing day {nday} with single cpu')
for dpols in npol:
k=djob(dpols,k,**kwargs)
else:
print(f'Executing day {nday} with single parallel')
k=Parallel(n_jobs=-1)(delayed(djob)(dpols,k,**kwargs) for dpols in npol)
k = list(itertools.chain(*k))
ll=1
total_population_per_day.append(dict(nsize=len(k), day=nday))
nday += 1
npol = k
return total_population_per_day
for
nday_limit=365
Your code looks alright overall but I can see several points of improvement that are slowing your code down significantly.
Though it must be noted that you can't really help the code slowing down too much with increasing nday values, since the population you need to keep track of keeps growing and you keep re-populating a list to track this. It's expected as the number of objects increase, the loops will take longer to complete, but you can reduce the time it takes to complete a single loop.
elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
Here you ask the instance of h every single loop, after confirming whether it's None. You know for a fact that h is going to be a list, and if not, your code will error anyway even before reaching that line for the list not to have been able to be created.
Furthermore, you have a redundant condition check for age of dpol, and then redundantly first extend h by dpol and then k by h. This can be simplified together with the previous issue to this:
if dpol['dborn'] <= kwargs['age_limit']:
k.append(dpol)
if h:
k.extend(h)
The results are identical.
Additionally, you're passing around a lot of **kwargs. This is a sign that your code should be a class instead, where some unchanging parameters are saved through self.parameter. You could even use a dataclass here (https://docs.python.org/3/library/dataclasses.html)
Also, you mix responsibilities of functions which is unnecessary and makes your code more confusing. For instance:
def get_breeding(d,**kwargs):
if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
d['lay_egg'] = d['lay_egg'] + 1
return d,npol
return d,None
This code contains two responsibilities: Generating a new individual if conditions are met, and checking these conditions, and returning two different things based on them.
This would be better done through two separate functions, one which simply checks the conditions, and another that generates a new individual as follows:
def check_breeding(d, max_lay_egg, day_lay_egg):
return d['lay_egg'] <= max_lay_egg and d['dborn'] > day_lay_egg and d['s'] == 1
def get_breeding(d, egg_no):
nums = np.random.choice([0, 1], size=egg_no, p=[.5, .5]).tolist()
npol=[dict(s=x, d=d['d'], lay_egg=0, dborn=0) for x in nums]
return npol
Where d['lay_egg'] could be updated in-place when iterating over the list if the condition is met.
You could speed up your code even further this way, if you edit the list as you iterate over it (it is not typically recommended but it's perfectly fine to do if you know what you're doing. Make sure to do it by using the index and limit it to the previous bounds of the length of the list, and decrement the index when an element is removed)
Example:
i = 0
maxiter = len(npol)
while i < maxiter:
if check_breeding(npol[i], max_lay_egg, day_lay_egg):
npol.extend(get_breeding(npol[i], egg_no))
if npol[i]['dborn'] > age_limit:
npol.pop(i)
i -= 1
maxiter -= 1
Which could significantly reduce processing time since you're not making a new list and appending all elements all over again every iteration.
Finally, you could check some population growth equation and statistical methods, and you could even reduce this whole code to a calculation problem with iterations, though that wouldn't be a sim anymore.
Edit
I've fully implemented my suggestions for improvements to your code and timed them in a jupyter notebook using %%time. I've separated out function definitions from both so they wouldn't contribute to the time, and the results are telling. I also made it so females produce another female 100% of the time, to remove randomness, otherwise it would be even faster. I compared the results from both to verify they produce identical results (they do, but I removed the 'd_born' parameter cause it's not used in the code apart from setting).
Your implementation, with nday_limit=100 and day_lay_egg=15:
Wall time 23.5s
My implementation with same parameters:
Wall time 18.9s
So you can tell the difference is quite significant, which grows even farther apart for larger nday_limit values.
Full implementation of edited code:
from dataclasses import dataclass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#dataclass
class Organism:
sex: int
times_laid_eggs: int = 0
age: int = 0
def __init__(self, sex):
self.sex = sex
def check_breeding(d, max_lay_egg, day_lay_egg):
return d.times_laid_eggs <= max_lay_egg and d.age > day_lay_egg and d.sex == 1
def get_breeding(egg_no): # Make sure to change probabilities back to 0.5 and 0.5 before using it
nums = np.random.choice([0, 1], size=egg_no, p=[0.0, 1.0]).tolist()
npol = [Organism(x) for x in nums]
return npol
def simulate(organisms, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit):
npol = organisms
nday = 0
total_population_per_day = []
while nday < nday_limit:
i = 0
maxiter = len(npol)
while i < maxiter:
npol[i].age += 1
if check_breeding(npol[i], max_lay_egg, day_lay_egg):
npol.extend(get_breeding(egg_no))
npol[i].times_laid_eggs += 1
if npol[i].age > age_limit:
npol.pop(i)
maxiter -= 1
continue
i += 1
total_population_per_day.append(dict(nsize=len(npol), day=nday))
nday += 1
return total_population_per_day
if __name__ == "__main__":
numsex = [1, 1, 0] # 0: Male, 1: Female
ipol = [Organism(x) for x in numsex] # The initial population
age_limit = 45 # Age limit for the species
egg_no = 3 # Number of eggs
day_lay_egg = 15 # Matured age for egg laying
nday_limit = 100
max_lay_egg = 2
dpopulation = simulate(ipol, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit)
df = pd.DataFrame(dpopulation)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()
Try structuring your code as a matrix like state[age][eggs_remaining] = count instead. It will have age_limit rows and max_lay_egg columns.
Males start in the 0 eggs_remaining column, and every time a female lays an egg they move down one (3->2->1->0 with your code above).
For each cycle, you just drop the last row, iterate over all the rows after age_limit and insert a new first row with the number of males and females.
If (as in your example) there only is a vanishingly small chance that a female would die of old age before laying all their eggs, you can just collapse everything into a state_alive[age][gender] = count and a state_eggs[eggs_remaining] = count instead, but it shouldn't be necessary unless the age goes really high or you want to run thousands of simulations.
use numpy array operation as much as possible instead of using loop can improve your performance, see below codes tested in notebook - https://www.kaggle.com/gfteafun/notebook03118c731b
Note that when comparing the time the nsize scale matters.
%%time
# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
x = np.array([(x, 0, 0, 0) for x in numsex ] )
iparam = np.array([0, 1, 0, 1])
total_population_per_day = []
for nday in range(nday_limit):
x = x + iparam
c = np.all(x < np.array([2, nday_limit, max_lay_egg, age_limit]), axis=1) & np.all(x >= np.array([1, day_lay_egg, 0, day_lay_egg]), axis=1)
total_population_per_day.append(dict(nsize=len(x[x[:,3]<age_limit, :]), day=nday))
n = x[c, 2].shape[0]
if n > 0:
x[c, 2] = x[c, 2] + 1
newborns = np.array([(x, nday, 0, 0) for x in np.random.choice([0, 1], size=egg_no, p=[.5, .5]) for i in range(n)])
x = np.vstack((x, newborns))
df = pd.DataFrame(total_population_per_day)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()
Energy calculations in molecular simulation are inherently full of "for" loops. Traditionally coordinates for each atom/molecule were stored in arrays. arrays are fairly straightforward to vectorize, but structures are nice to code with. Treating molecules as individual objects, each with their own coordinates, and other properties, is very convenient and much clearer as far as book-keeping goes.
I am using Python version 3.6
My problem is that I cannot figure out how to vectorize calculations when I am using an array of objects... it seems that a for loop cannot be avoided. Is it necessary for me to use arrays in order to take advantage of numpy and vectorize my code?
Here is a python example which utilizes arrays (line 121 of the code), and shows a fast (numpy) and slow ( 'normal') python energy calculation.
https://github.com/Allen-Tildesley/examples/blob/master/python_examples/mc_lj_module.py
The calculation is much faster using the numpy accelerated method because it is vectorized.
How would I vectorize an energy calculation if I was not using arrays, but an array of objects, each with their own coordinates? This seems to necessitate using the slower for loop.
Here is a simple example code with a working slow version of the for loop, and an attempted vectorization that doesn't work:
import numpy as np
import time
class Mol:
num = 0
def __init__(self, r):
Mol.num += 1
self.r = np.empty((3),dtype=np.float_)
self.r[0] = r[0]
self.r[1] = r[1]
self.r[2] = r[2]
""" Alot more useful things go in here in practice"""
################################################
# #
# Main Program #
# #
################################################
L = 5.0 # Length of simulation box (arbitrary)
r_cut_box_sq = L/2 # arbitrary cutoff - required
mol_list=[]
nmol = 1000 # number of molecules
part = 1 # arbitrary molecule to interact with rest of molecules
""" make 1000 molecules (1 atom per molecule), give random coordinates """
for i in range(nmol):
r = np.random.rand(3) * L
mol_list.append( Mol( r ) )
energy = 0.0
start = time.time()
################################################
# #
# Slow but functioning loop #
# #
################################################
for i in range(nmol):
if i == part:
continue
rij = mol_list[part].r - mol_list[i].r
rij = rij - np.rint(rij/L)*L # apply periodic boundary conditions
rij_sq = np.sum(rij**2) # Squared separations
in_range = rij_sq < r_cut_box_sq
sr2 = np.where ( in_range, 1.0 / rij_sq, 0.0 )
sr6 = sr2 ** 3
sr12 = sr6 ** 2
energy += sr12 - sr6
end = time.time()
print('slow: ', end-start)
print('energy: ', energy)
start = time.time()
################################################
# #
# Failed vectorization attempt #
# #
################################################
""" The next line is my problem, how do I vectorize this so I can avoid the for loop all together?
Leads to error AttributeError: 'list' object has no attribute 'r' """
""" I also must add in that part cannot interact with itself in mol_list"""
rij = mol_list[part].r - mol_list[:].r
rij = rij - np.rint(rij/L)*L # apply periodic boundary conditions
rij_sq = np.sum(rij**2)
in_range = rij_sq < r_cut_box_sq
sr2 = np.where ( in_range, 1.0 / rij_sq, 0.0 )
sr6 = sr2 ** 3
sr12 = sr6 ** 2
energy = sr12 - sr6
energy = sum(energy)
end = time.time()
print('faster??: ', end-start)
print('energy: ', energy)
Lastly
Would any possible solutions be affected if inside the energy calculation, it was necessary to loop over each atom in each molecule where their is now more than 1 atom per molecule, and not all molecules have the same number of atoms, thus having a double for loop for molecule-molecule interactions rather than the simple pair-pair interactions currently employed.
Making use of the itertools library might be the way forward here. Suppose you wrap the energy calculation of a pair of molecules in a function:
def calc_pairwise_energy((mol_a,mol_b)):
# function takes a 2 item tuple of molecules
# energy calculating code here
return pairwise_energy
Then you can use itertools.combinations to get all the pairs of molecules and python's built in list comprehensions (the code inside [ ] on the last line below):
from itertools import combinations
pairs = combinations(mol_list,2)
energy = sum( [calc_pairwise_energy(pair) for pair in pairs] )
I've come back to this answer as I realised I hadn't properly answered your question. With what I've already posted the pairwise energy calculation function looked like this (I've made a few optimisations to your code):
def calc_pairwise_energy(molecules):
rij = molecules[0].r - molecules[1].r
rij = rij - np.rint(rij/L)*L
rij_sq = np.sum(rij**2) # Squared separations
if rij_sq < r_cut_box_sq:
return (rij_sq ** -6) - (rij_sq ** - 3)
else:
return 0.0
Whereas a vectorised implementation that does all the pairwise calculations in a single call might look like this:
def calc_all_energies(molecules):
energy = 0
for i in range(len(molecules)-1):
mol_a = molecules[i]
other_mols = molecules[i+1:]
coords = np.array([mol.r for mol in other_mols])
rijs = coords - mol_a.r
# np.apply_along_axis replaced as per #hpaulj's comment (see below)
#rijs = np.apply_along_axis(lambda x: x - np.rint(x/L)*L,0,rijs)
rijs = rijs - np.rint(rijs/L)*L
rijs_sq = np.sum(rijs**2,axis=1)
rijs_in_range= rijs_sq[rijs_sq < r_cut_box_sq]
energy += sum(rijs_in_range ** -6 - rijs_in_range ** -3)
return energy
This is much faster but there is still plenty to optimise here.
If you want to calculate energies with coordinates as inputs, I'm assuming you're looking for pair-wise distances. For this, you should look into the SciPy library. Specifically, I would look at scipy.spatial.distance.pdist. The documentation can be found here.
I wrote a code that implements a simple genetic algorithm to maximize:
f(x) = 15x - x^2
The function has its maximum at 7.5, so the code output should be 7 or 8 since the population are integers.
When I run the code 10 times I get 7 or 8 around three times out of 10.
What modification should I make to further improve the algorithm and what are different types of genetic algorithms?
Here is the code:
from random import *
import numpy as np
#fitness function
def fit(x):
return 15*x -x**2
#covert binary list to decimal number
def to_dec(x):
return int("".join(str(e) for e in x), 2)
#picks pairs from the original population
def gen_pairs(populationl, prob):
pairsl = []
test = [0, 1, 2, 3, 4, 5]
for i in range(3):
pair = []
for j in range(2):
temp = np.random.choice(test, p=prob)
pair.append(populationl[temp].copy())
pairsl.append(pair)
return pairsl
#mating function
def cross_over(prs, mp):
new = []
for pr in prs:
if mp[prs.index(pr)] == 1:
index = np.random.choice([1,2,3], p=[1/3, 1/3, 1/3])
pr[0][:index], pr[1][:index] = pr[1][:index], pr[0][:index]
for pr in prs:
new.append(pr[0])
new.append(pr[1])
return new
#mutation
def mutation(x):
for chromosome in x:
for gene in chromosome:
mutation_prob = np.random.choice([0, 1], p=[0.999, .001])
if mutation_prob == 1:
#m_index = np.random.choice([0,1,2,3])
if gene == 0:
gene = 1
else:
gene = 0
#generate initial population
randlist = lambda n:[randint(0,1) for b in range(1, n+1)]
for j in range(10):
population = [randlist(4) for i in range(6)]
for _ in range(20):
fittness = [fit(to_dec(y)) for y in population]
s = sum(fittness)
prob = [e/s for e in fittness]
pairsg = gen_pairs(population.copy(), prob)
mating_prob = []
for i in pairsg:
mating_prob.append(np.random.choice([0,1], p=[0.4,0.6]))
new_population = cross_over(pairsg, mating_prob)
mutated = mutation(new_population)
decimal_p = [to_dec(i)for i in population]
decimal_new = [to_dec(i)for i in new_population]
# print(decimal_p)
# print(decimal_new)
population = new_population
print(decimal_new)
This is a very typical situation with evolutionary algorithms. Success rate is a quite common metric, and 30% is a decent result.
Just an example, recently I implemented a GP/GE solver for Santa Fe Trail problem, and it demonstrates the success rate of 30% or less.
How to improve success rate
A personal interpretation of the problem based on limited experience follows.
An evolutionary algorithm fails to find a close to global optimum solution when it converges around a local optimum or gets stuck on a great plateau, and has not enough diversity in its population to escape this trap by finding a better region.
You may try to supply your algorithm with more diversity by increasing the size of the population. Or you may look into techniques like novelty search, and quality diversity.
By the way, here is a very nice interactive demonstration of novelty search vs. fitness search: http://eplex.cs.ucf.edu/noveltysearch/userspage/demo.html
I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.
I've written some python code to calculate a certain quantity from a cosmological simulation. It does this by checking whether a particle in contained within a box of size 8,000^3, starting at the origin and advancing the box when all particles contained within it are found. As I am counting ~2 million particles altogether, and the total size of the simulation volume is 150,000^3, this is taking a long time.
I'll post my code below, does anybody have any suggestions on how to improve it?
Thanks in advance.
from __future__ import division
import numpy as np
def check_range(pos, i, j, k):
a = 0
if i <= pos[2] < i+8000:
if j <= pos[3] < j+8000:
if k <= pos[4] < k+8000:
a = 1
return a
def sigma8(data):
N = []
to_do = data
print 'Counting number of particles per cell...'
for k in range(0,150001,8000):
for j in range(0,150001,8000):
for i in range(0,150001,8000):
temp = []
n = []
for count in range(len(to_do)):
n.append(check_range(to_do[count],i,j,k))
to_do[count][1] = n[count]
if to_do[count][1] == 0:
temp.append(to_do[count])
#Only particles that have not been found are
# searched for again
to_do = temp
N.append(sum(n))
print 'Next row'
print 'Next slice, %i still to find' % len(to_do)
print 'Calculating sigma8...'
if not sum(N) == len(data):
return 'Error!\nN measured = {0}, total N = {1}'.format(sum(N), len(data))
else:
return 'sigma8 = %.4f, variance = %.4f, mean = %.4f' % (np.sqrt(sum((N-np.mean(N))**2)/len(N))/np.mean(N), np.var(N),np.mean(N))
I'll try to post some code, but my general idea is the following: create a Particle class that knows about the box that it lives in, which is calculated in the __init__. Each box should have a unique name, which might be the coordinate of the bottom left corner (or whatever you use to locate your boxes).
Get a new instance of the Particle class for each particle, then use a Counter (from the collections module).
Particle class looks something like:
# static consts - outside so that every instance of Particle doesn't take them along
# for the ride...
MAX_X = 150,000
X_STEP = 8000
# etc.
class Particle(object):
def __init__(self, data):
self.x = data[xvalue]
self.y = data[yvalue]
self.z = data[zvalue]
self.compute_box_label()
def compute_box_label(self):
import math
x_label = math.floor(self.x / X_STEP)
y_label = math.floor(self.y / Y_STEP)
z_label = math.floor(self.z / Z_STEP)
self.box_label = str(x_label) + '-' + str(y_label) + '-' + str(z_label)
Anyway, I imagine your sigma8 function might look like:
def sigma8(data):
import collections as col
particles = [Particle(x) for x in data]
boxes = col.Counter([x.box_label for x in particles])
counts = boxes.most_common()
#some other stuff
counts will be a list of tuples which map a box label to the number of particles in that box. (Here we're treating particles as indistinguishable.)
Using list comprehensions is much faster than using loops---I think the reason is that you're basically relying more on the underlying C, but I'm not the person to ask. Counter is (supposedly) highly-optimized as well.
Note: None of this code has been tested, so you shouldn't try the cut-and-paste-and-hope-it-works method here.