Implementing probability function involving heavy combinatorics in python - python

This is regarding the answer to a question I asked on the math stack.
I'm looking to convert this question/solution into python, but I'm having trouble interpreting all of the notation used here.
I realize this post is a bit too 'gimme the code' to be a great question, but I ask with the intention of understanding the math involved here. I don't understand the language of mathematical notations used here in concert very well, but I can interpret python well enough to conceptualize the answer if I see it.
The problem can be set up like this
import numpy as np
bag = np.hstack((
np.repeat(0, 80),
np.repeat(1, 21),
np.repeat(3, 5),
np.repeat(7,1)
))

I'm not sure if this is exactly what you're after but this is how I would calculate, for example, the probability of getting a sum == 6.
It's more practical than mathematical and just addresses this particular problem, so I'm not sure if it will help you under stand the maths.
import numpy as np
import itertools
from collections import Counter
import pandas as pd
bag = np.hstack((
np.repeat(0, 80),
np.repeat(1, 21),
np.repeat(3, 5),
np.repeat(7,1)
))
#107*106*105*104*103*102*101*100*99*98
#Out[176]: 127506499163211168000 #Permutations
##Need to reduce the number to sample from without changing the number of possible combinations
reduced_bag = np.hstack((
np.repeat(0, 10), ## 0 can be chosen all 10 times
np.repeat(1, 10), ## 1 can be chosen all 10 times
np.repeat(3, 5), ## 3 can be chosen up to 5 times
np.repeat(7,1) ## 7 can be chosen once
))
## There are 96 unique combinations
number_unique_combinations = len(set(list(itertools.combinations(reduced_bag,10))))
### sorted list of all combinations
unique_combinations = sorted(list(set(list(itertools.combinations(reduced_bag,10)))))
### sum of each unique combination
sums_list = [sum(uc) for uc in unique_combinations]
### probability for each unique combination
probability_dict = {0:80, 1:21, 3:5, 7:1} ##Dictionary to refer to
n = 107 ##Number in the bag
probability_list = []
##This part is VERY slow to run because of the itertools.permutations
for x in unique_combinations:
print(x)
p = 1 ##Start with the probability again
n = 107 ##Start with a full bag for each combination
count_x = Counter(x)
for i in x:
i_left = probability_dict[i] - (Counter(x)[i] - count_x[i]) ##Number of that type left in bag
p *= i_left/n ##Multiply the probability
n = n-1 # non replacement
count_x[i] = count_x[i] - 1 ##Reduce the number in the bag
p *= len(set(list(itertools.permutations(x,10)))) ##Multiply by the number of permutations per combination
probability_list.append(p)
##sum(probability_list) ## Has a rounding error
##Out[57]: 1.0000000000000002
##
##Put the combinations into dataframe
ar = np.array((unique_combinations,sums_list,probability_list))
df = pd.DataFrame(ar).T
##Name the columns
df.columns = ["combination", "sum","probability"]
## probability that sum is >= 6
df[df["sum"] >= 6]['probability'].sum()
## 0.24139909236232826
## probability that sum is == 6
df[df["sum"] == 6]['probability'].sum()
## 0.06756408790812335

Related

Extracting total value of a tuple

I'm very new to Python, so please forgive my ignorance. I'm trying to calculate the total number of energy units in a system. For example, the Omega here will output both (0,0,0,1) and (2,2,2,1) along with a whole lot of other tuples. I want to extract from Omega how many tuples have a total value of 1 (like the first example) and how many have a total value of 7 (like the second example). How do I achieve this?
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
N = 4 ##The number of Oscillators
q = range(3) ## Range of number of possible energy units per oscillator
Omega = product(q, repeat = N)
print(list(product(q, repeat = N)))
try this:
Omega = product(q, repeat = N)
l = list(product(q, repeat = N))
l1 = [i for i in l if sum(i)==1]
l2 = [i for i in l if sum(i)==7]
print(l1,l2)
I believe you can use sum() on tuples as well as lists of integers/numbers.
Now you say omega is a list of tuples, is that correct? Something like
Omega = [(0,0,0,1), (2,2,2,1), ...)]
In that case I think you can do
sums_to_1 = [int_tuple for int_tuple in omega if sum(int_tuple) == 1]
If you want to have some default value for the tuples that don't sum to one you can put the if statement in the list comprehension in the beginning and do
sums_to_1 = [int_tuple if sum(int_tuple) == 1 else 'SomeDefaultValue' for int_tuple in omega]

PYTHON How to generate 20 million unrepeatable random numbers

Need to generate 20 million unrepeatable random numbers with 8 characters length and save it in an array.
I try with multiprocessing,threading but it stays slow.
Try with multiprocessing:
from numpy.random import default_rng
from multiprocessing import Process,Queue
import os,time
import numpy as np
rng = default_rng()
f=np.array([],dtype=np.int64)
def generate(q,start,stop):
numbers=[rng.choice(range(start,stop),replace=False) for _ in range(1000)]
q.put(numbers)
if __name__ == '__main__':
timeInit = time.time()
for x in range(20000):
q=Queue()
p = Process(target=generate,args=(q,11111111,99999999,))
p.start()
f=np.append(f,q.get())
p.join()
print(f)
timeStop = time.time()
print('[TIME EXECUTED] ' + str(timeStop-timeInit) +' segs')
This took less than 30 secs on my personal laptop, if it works for you:
import random
candidates = list(range(10**7, 10**8)) # all numbers from 10000000 to 99999999
random.shuffle(candidates)
result = candidates[:20* 10**6] # take first 20 million
You haven't explained why you're doing all of that overhead. I simply took a random sample from the candidate numbers:
from random import sample
result = sample(
list(range(10**7, 10**8)),
2*10**7
)
51 seconds on my laptop, with interference from other jobs.
I just ran a more controlled test on both solutions. The one in this post took 48.5 seconds; the one from naicolas took 81.6 seconds, likely due to the extra list creation.
I hope I got your idea. The random numbers that you are trying to generate are actually a bit tricky. Basically we are looking for a set of unique (non-repeatable) but random numbers. In this case, we can not draw random numbers from uniform distribution, because there is no guarantee that numbers are unique.
There are 2 possible algorithms. The first one is to generate A LOT of possible random numbers, and remove those repeated ones. For instance,
import numpy as np
N = 20_000_000
L0 = 11_111_111 # legitimate int in Python
L1 = L0 * 9
not_enough_unique = True
while not_enough_unique:
X = np.random.uniform(L0, L1, int(N * 2)).astype(int)
X_unique = np.unique(X) # remove repeated numbers
not_enough_unique = len(X_unique) < N
random_numbers = X_unique[:N]
np.random.shuffle(random_numbers)
There is also another more "physics" approach. We can start with equal–spaced numbers, and move each number a little bit. The result will not be as random as the first one, but it is much faster and purely fun.
import numpy as np
N = 20_000_000
L0 = 11_111_111 # legitimate int in Python
L1 = L0 * 9
lattice = np.linspace(L0, L1, N) # all numbers have equal spacing
pertubation = np.random.normal(0, 0.4, N) # every number move left/right a little bit
random_numbers = (lattice + pertubation).astype(int)
# check if the minimum distance between two successive numbers
# i.e. all numbers are unique
min_dist = np.abs(np.diff(random_numbers)).min()
print(f"generating random numbers with minimum separation of {min_dist}")
print("(if it is > 1 you are good)")
np.random.shuffle(random_numbers)
(Both algorithms generate the result within 10s on my laptop)

Python - random numbers, higher and lower chances [duplicate]

I have a file with some probabilities for different values e.g.:
1 0.1
2 0.05
3 0.05
4 0.2
5 0.4
6 0.2
I would like to generate random numbers using this distribution. Does an existing module that handles this exist? It's fairly simple to code on your own (build the cumulative density function, generate a random value [0,1] and pick the corresponding value) but it seems like this should be a common problem and probably someone has created a function/module for it.
I need this because I want to generate a list of birthdays (which do not follow any distribution in the standard random module).
scipy.stats.rv_discrete might be what you want. You can supply your probabilities via the values parameter. You can then use the rvs() method of the distribution object to generate random numbers.
As pointed out by Eugene Pakhomov in the comments, you can also pass a p keyword parameter to numpy.random.choice(), e.g.
numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
If you are using Python 3.6 or above, you can use random.choices() from the standard library – see the answer by Mark Dickinson.
Since Python 3.6, there's a solution for this in Python's standard library, namely random.choices.
Example usage: let's set up a population and weights matching those in the OP's question:
>>> from random import choices
>>> population = [1, 2, 3, 4, 5, 6]
>>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
Now choices(population, weights) generates a single sample, contained in a list of length 1:
>>> choices(population, weights)
[4]
The optional keyword-only argument k allows one to request more than one sample at once. This is valuable because there's some preparatory work that random.choices has to do every time it's called, prior to generating any samples; by generating many samples at once, we only have to do that preparatory work once. Here we generate a million samples, and use collections.Counter to check that the distribution we get roughly matches the weights we gave.
>>> million_samples = choices(population, weights, k=10**6)
>>> from collections import Counter
>>> Counter(million_samples)
Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})
An advantage to generating the list using CDF is that you can use binary search. While you need O(n) time and space for preprocessing, you can get k numbers in O(k log n). Since normal Python lists are inefficient, you can use array module.
If you insist on constant space, you can do the following; O(n) time, O(1) space.
def random_distr(l):
r = random.uniform(0, 1)
s = 0
for item, prob in l:
s += prob
if s >= r:
return item
return item # Might occur because of floating point inaccuracies
(OK, I know you are asking for shrink-wrap, but maybe those home-grown solutions just weren't succinct enough for your liking. :-)
pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
R = max(i for r in [random.random()] for i,c in cdf if c <= r)
I pseudo-confirmed that this works by eyeballing the output of this expression:
sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
for _ in range(1000))
Maybe it is kind of late. But you can use numpy.random.choice(), passing the p parameter:
val = numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
I wrote a solution for drawing random samples from a custom continuous distribution.
I needed this for a similar use-case to yours (i.e. generating random dates with a given probability distribution).
You just need the funtion random_custDist and the line samples=random_custDist(x0,x1,custDist=custDist,size=1000). The rest is decoration ^^.
import numpy as np
#funtion
def random_custDist(x0,x1,custDist,size=None, nControl=10**6):
#genearte a list of size random samples, obeying the distribution custDist
#suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x)
#custDist noes not need to be normalized. Add this condition to increase performance.
#Best performance for max_{x in [x0,x1]} custDist(x) = 1
samples=[]
nLoop=0
while len(samples)<size and nLoop<nControl:
x=np.random.uniform(low=x0,high=x1)
prop=custDist(x)
assert prop>=0 and prop<=1
if np.random.uniform(low=0,high=1) <=prop:
samples += [x]
nLoop+=1
return samples
#call
x0=2007
x1=2019
def custDist(x):
if x<2010:
return .3
else:
return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1)
samples=random_custDist(x0,x1,custDist=custDist,size=1000)
print(samples)
#plot
import matplotlib.pyplot as plt
#hist
bins=np.linspace(x0,x1,int(x1-x0+1))
hist=np.histogram(samples, bins )[0]
hist=hist/np.sum(hist)
plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution')
#dist
grid=np.linspace(x0,x1,100)
discCustDist=np.array([custDist(x) for x in grid]) #distrete version
discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist)
plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4)
#decoration
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()
The performance of this solution is improvable for sure, but I prefer readability.
Make a list of items, based on their weights:
items = [1, 2, 3, 4, 5, 6]
probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
# if the list of probs is normalized (sum(probs) == 1), omit this part
prob = sum(probabilities) # find sum of probs, to normalize them
c = (1.0)/prob # a multiplier to make a list of normalized probs
probabilities = map(lambda x: c*x, probabilities)
print probabilities
ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.'))
ml = len(str(ml)) - str(ml).find('.') -1
amounts = [ int(x*(10**ml)) for x in probabilities]
itemsList = list()
for i in range(0, len(items)): # iterate through original items
itemsList += items[i:i+1]*amounts[i]
# choose from itemsList randomly
print itemsList
An optimization may be to normalize amounts by the greatest common divisor, to make the target list smaller.
Also, this might be interesting.
Another answer, probably faster :)
distribution = [(1, 0.2), (2, 0.3), (3, 0.5)]
# init distribution
dlist = []
sumchance = 0
for value, chance in distribution:
sumchance += chance
dlist.append((value, sumchance))
assert sumchance == 1.0 # not good assert because of float equality
# get random value
r = random.random()
# for small distributions use lineair search
if len(distribution) < 64: # don't know exact speed limit
for value, sumchance in dlist:
if r < sumchance:
return value
else:
# else (not implemented) binary search algorithm
from __future__ import division
import random
from collections import Counter
def num_gen(num_probs):
# calculate minimum probability to normalize
min_prob = min(prob for num, prob in num_probs)
lst = []
for num, prob in num_probs:
# keep appending num to lst, proportional to its probability in the distribution
for _ in range(int(prob/min_prob)):
lst.append(num)
# all elems in lst occur proportional to their distribution probablities
while True:
# pick a random index from lst
ind = random.randint(0, len(lst)-1)
yield lst[ind]
Verification:
gen = num_gen([(1, 0.1),
(2, 0.05),
(3, 0.05),
(4, 0.2),
(5, 0.4),
(6, 0.2)])
lst = []
times = 10000
for _ in range(times):
lst.append(next(gen))
# Verify the created distribution:
for item, count in Counter(lst).iteritems():
print '%d has %f probability' % (item, count/times)
1 has 0.099737 probability
2 has 0.050022 probability
3 has 0.049996 probability
4 has 0.200154 probability
5 has 0.399791 probability
6 has 0.200300 probability
based on other solutions, you generate accumulative distribution (as integer or float whatever you like), then you can use bisect to make it fast
this is a simple example (I used integers here)
l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')]
def get_cdf(l):
ret=[]
c=0
for i in l: c+=i[0]; ret.append((c, i[1]))
return ret
def get_random_item(cdf):
return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1]
cdf=get_cdf(l)
for i in range(100): print get_random_item(cdf),
the get_cdf function would convert it from 20, 60, 10, 10 into 20, 20+60, 20+60+10, 20+60+10+10
now we pick a random number up to 20+60+10+10 using random.randint then we use bisect to get the actual value in a fast way
you might want to have a look at NumPy Random sampling distributions
None of these answers is particularly clear or simple.
Here is a clear, simple method that is guaranteed to work.
accumulate_normalize_probabilities takes a dictionary p that maps symbols to probabilities OR frequencies. It outputs usable list of tuples from which to do selection.
def accumulate_normalize_values(p):
pi = p.items() if isinstance(p,dict) else p
accum_pi = []
accum = 0
for i in pi:
accum_pi.append((i[0],i[1]+accum))
accum += i[1]
if accum == 0:
raise Exception( "You are about to explode the universe. Continue ? Y/N " )
normed_a = []
for a in accum_pi:
normed_a.append((a[0],a[1]*1.0/accum))
return normed_a
Yields:
>>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200 } )
[('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]
Why it works
The accumulation step turns each symbol into an interval between itself and the previous symbols probability or frequency (or 0 in the case of the first symbol). These intervals can be used to select from (and thus sample the provided distribution) by simply stepping through the list until the random number in interval 0.0 -> 1.0 (prepared earlier) is less or equal to the current symbol's interval end-point.
The normalization releases us from the need to make sure everything sums to some value. After normalization the "vector" of probabilities sums to 1.0.
The rest of the code for selection and generating a arbitrarily long sample from the distribution is below :
def select(symbol_intervals,random):
print symbol_intervals,random
i = 0
while random > symbol_intervals[i][1]:
i += 1
if i >= len(symbol_intervals):
raise Exception( "What did you DO to that poor list?" )
return symbol_intervals[i][0]
def gen_random(alphabet,length,probabilities=None):
from random import random
from itertools import repeat
if probabilities is None:
probabilities = dict(zip(alphabet,repeat(1.0)))
elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)):
probabilities = dict(zip(alphabet,probabilities)) #ordered
usable_probabilities = accumulate_normalize_values(probabilities)
gen = []
while len(gen) < length:
gen.append(select(usable_probabilities,random()))
return gen
Usage :
>>> gen_random (['a','b','c','d'],10,[100,300,400,200])
['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c'] #<--- some of the time
Here is a more effective way of doing this:
Just call the following function with your 'weights' array (assuming the indices as the corresponding items) and the no. of samples needed. This function can be easily modified to handle ordered pair.
Returns indexes (or items) sampled/picked (with replacement) using their respective probabilities:
def resample(weights, n):
beta = 0
# Caveat: Assign max weight to max*2 for best results
max_w = max(weights)*2
# Pick an item uniformly at random, to start with
current_item = random.randint(0,n-1)
result = []
for i in range(n):
beta += random.uniform(0,max_w)
while weights[current_item] < beta:
beta -= weights[current_item]
current_item = (current_item + 1) % n # cyclic
else:
result.append(current_item)
return result
A short note on the concept used in the while loop.
We reduce the current item's weight from cumulative beta, which is a cumulative value constructed uniformly at random, and increment current index in order to find the item, the weight of which matches the value of beta.

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Generate N "random" string of length K using probability table

How to create N "random" strings of length K using the probability table? K would be some even number.
prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
Let's say K = 6, there would be a higher probability of 'acacab' than 'aaaaaa'.
This is sub-problem of a larger problem that I’m using to generate synthetic sequences based on a probability table. I’m not sure how to use the probability table to generate “random” strings?
What I have so far:
def seq_prob(fprob_table,K= 6, N= 10):
#fprob_table is the probability dictionary that you input
#K is the length of the sequence
#N is the amount of sequences
seq_list = []
#possibly using itertools or random to generate the semi-"random" strings based on the probabilities
return seq_list
There are some good approaches to making weighted random choices described at the end of the documentation for the builtin random module:
A common task is to make a random.choice() with weighted probabilities.
If the weights are small integer ratios, a simple technique is to build a sample population with repeats:
>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'
A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate(), and then locate the random value with bisect.bisect():
>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'
To adapt that latter approach to your specific problem, I'd do:
import random
import itertools
import bisect
def seq_prob(fprob_table, K=6, N=10):
choices, weights = fprob_table.items()
cumdist = list(itertools.accumulate(weights))
results = []
for _ in range(N):
s = ""
while len(s) < K:
x = random.random() * cumdist[-1]
s += choices[bisect.bisect(cumdist, x)]
results.append(s)
return results
This assumes that the key strings in your probability table are all the same length If they have multiple different lengths, this code will sometimes (perhaps most of the time!) give answers that are longer than K characters. I suppose it also assumes that K is an exact multiple of the key length, though it will actually work if that's not true (it just will give result strings that are all longer than K characters, since there's no way to get K exactly).
You could use random.random:
from random import random
def seq_prob(fprob_table, K=6, N=10):
#fprob_table is the probability dictionary that you input
#K is the length of the sequence
#N is the amount of sequences
seq_list = []
s = ""
while len(seq_list) < N:
for k, v in fprob_table.items():
if len(s) == K:
seq_list.append(s)
s = ""
break
rn = random()
if rn <= v:
s += k
return seq_list
This can be no doubt be improved upon but the random.random is useful when dealing with probability.
I'm sure there is a cleaner/better way, but here is one easy way to do this.
Here we're filling pick_list with the 100 separate character-pair values, the number of values determined by the probability. In this case, there are 20 'aa', 30 'ab' and 50 'ac' entries within pick_list. Then random.choice(pick_list) uniformly pulls a random entry from the list.
import random
prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
def seq_prob(fprob_table, K=6, N=10):
#fprob_table is the probability dictionary that you input
# fill list with number of items based on the probabilities
pick_list = []
for key, prob in fprob_table.items():
pick_list.extend([key] * int((prob * 100)))
#K is the length of the sequence
#N is the amount of sequences
seq_list = []
for i in range(N):
sub_seq = "".join(random.choice(pick_list) for _ in range(int(K/2)))
seq_list.append(sub_seq)
return seq_list
With results:
seq_prob(prob_table)
['ababac',
'aaacab',
'aaaaac',
'acacac',
'abacac',
'acaaac',
'abaaab',
'abaaab',
'aaabaa',
'aaabaa']
If your tables or sequences are large, using numpy may be helpful as it will probably be significantly faster. Also, numpy is built for this sort of problem, and the approach is easy to understand and just 3 or 4 lines.
The idea would be to convert the probabilities into cumulative probabilities, ie, mapping (.2, .5, .3) to (.2, .7, 1.), and then random numbers generated along the flat distribution from 0 to 1 will fall within the bins of the cumulative sum with a frequency corresponding to the weights. Numpy's searchsorted can be used to quickly find which bin the random values lie within. That is,
import numpy as np
prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
N = 10
k = 3 # number of strings (not number of characters)
rvals = np.random.random((N, k)) # generate a bunch of random values
string_indices = np.searchsorted(np.cumsum(prob_table.values()), rvals) # weighted indices
x = np.array(prob_table.keys())[string_indices] # get the strings associated with the indices
y = ["".join(x[i,:]) for i in range(x.shape[0])] # convert this to a list of strings
# y = ['acabab', 'acacab', 'acabac', 'aaacaa', 'acabac', 'acacab', 'acabaa', 'aaabab', 'abacac', 'aaabab']
Here I used k as the number of strings you would need, rather than K as the number of characters, since the problem statement is ambiguous about strings/characters.

Categories