Statistics: Optimizing probability calculations within python

Statistics: Optimizing probability calculations within python - python

Setup:
The question is complex form of a classic probability question:
70 colored balls are placed in an urn, 10 for each of the seven rainbow colors.
What is the expected number of distinct colors in 20 randomly picked balls?
My solution is python's itertools library:
combos = itertools.combinations(urn, 20),
print sum([1 for x in combos])
(where urn is a list of the 70 balls in the urn).
I can unpack the iterator up to a length of combinations(urn, 8) past that my computer can't handle it.
Note: I know this wouldn't give me the answer, this is only the road block in my script, in other words if this worked my script would work.
Question: How could I find the expected colors accurately, without the worlds fastest super computer? Is my way even computationally possible?

Since a couple of people have asked to see the mathematical solution, I'll give it. This is one of the Project Euler problems that can be done in a reasonable amount of time with pencil and paper. The answer is
7(1 - (60 choose 20)/(70 choose 20))
To get this write X, the count of colors present, as a sum X0+X1+X2+...+X6, where Xi is 1 if the ith color is present, and 0 if it is not present.
E(X)
= E(X0+X1+...+X6)
= E(X0) + E(X1) + ... + E(X6) by linearity of expectation
= 7E(X0) by symmetry
= 7 * probability that a particular color is present
= 7 * (1- probability that a particular color is absent)
= 7 * (1 - (# ways to pick 20 avoiding a color)/(# ways to pick 20))
= 7 * (1 - (60 choose 20)/(70 choose 20))
Expectation is always linear. So, when you are asked to find the average value of some random quantity, it often helps to try to rewrite the quantity as a sum of simpler pieces such as indicator (0-1) random variables.
This does not say how to make the OP's approach work. Although there is a direct mathematical solution, it is good to know how to iterate through the cases in an organized and practicable fashion. This could help if you next wanted a more complicated function of the set of colors present than the count. Duffymo's answer suggested something that I'll make more explicit:
You can break up the ways to draw 20 calls from 70 into categories indexed by the counts of colors. For example, the index (5,5,10,0,0,0,0) means we drew 5 of the first color, 5 of the second color, 10 of the third color, and none of the other colors.
The set of possible indices is contained in the collection of 7-tuples of nonnegative integers with sum 20. Some of these are impossible, such as (11,9,0,0,0,0,0) by the problem's assumption that there are only 10 balls of each color, but we can deal with that. The set of 7-tuples of nonnegative numbers adding up to 20 has size (26 choose 6)=230230, and it has a natural correspondence with the ways of choosing 6 dividers among 26 spaces for dividers or objects. So, if you have a way to iterate through the 6 element subsets of a 26 element set, you can convert these to iterate through all indices.
You still have to weight the cases by the counts of the ways to draw 20 balls from 70 to get that case. The weight of (a0,a1,a2,...,a6) is (10 choose a0)(10 choose a1)...*(10 choose a6). This handles the case of impossible indices gracefully, since 10 choose 11 is 0 so the product is 0.
So, if you didn't know about the mathematical solution by the linearity of expectation, you could iterate through 230230 cases and compute a weighted average of the number of nonzero coordinates of the index vector, weighted by a product of small binomial terms.

Wouldn't it just be combinations with repetition?
http://www.mathsisfun.com/combinatorics/combinations-permutations.html

Make an urn with 10 of each color.
Decide on the number of trials you want.
Make a container to hold the result of each trial
for each trial, pick a random sample of twenty items from the urn, make a set of those items, add the length of that set to the results.
find the average of the results

Related

Matrix Math - Maximizing

I have a dataframe that with an index of magic card names. The columns are the same index, resulting in a 1081 x 1081 dataframe of each card in my collection paired with each other card in my collection.
I have code that identifies combos of cards that go well together. For example "Whenever you draw a card" pairs well with "Draw a card" cards. I find the junction of those two cards and increase its value by 1.
Now, I need to find the maximum value for 36 cards.
But, how?
Randomly selecting cards is useless, there are 1.717391336 E+74 potential combinations. I've tried pulling out the lowest values and that reduces the set of potential combinations, but even at 100 cards you're talking about 1.977204582 E+27 potentials.
This has to have been solved by someone smarter than me - can ya'll point me in the right direction?

As you pointed out already, the combinatorics are not on your side here. There are 1081 choose 36 possible sets (binomial coefficient), so it is out of question to check all of them.
I am not aware of any practicable solution to find the optimal set for the general problem, that is without knowing the 1081x1081 matrix.
For an approximate solution for the general problem, you might want to try a greedy approach, while keeping a history of n sets after each step, with e.g. n = 1000.
So you would start with going through all sets with 2 cards, which is 1081 * 1080 / 2 combinations, look up the value in the matrix for each and pick the n max ones.
In the second step, for each of the n kept sets, go through all possible combinations with a third card (and check for duplicate sets), i.e. checking n * 1079 sets, and keep the n max ones.
In the third step, check n * 1078 sets with a fourth card, and so on, and so forth.
Of course, this won't give you the optimal solution for the general case, but maybe it's good enough for your given situation. You can also take a look at the history, to get a feeling for how often it happens that the best set from step x is caught up by another set in a step y > x. Depending on your matrix, it might not happen that often or even never.

Choose One Item from Every List, up to N combination, uniform distribution

I have 100 lists [x1..x100] , each one containing about 10 items. [x_i_1,...x_i_10]
I need to generate 80 vectors. Each vector is a production of all the lists, kind of like itertools.product(*x), except 2 things:
(1)
I need every item in each vector to have a uniform distribution.
for example:
[ np.random.choice(xi) for xi in [x1..x100]] would be good, except for my seconds condition:
(2)
i can't have repetitions.
itertools.product solves this, but it doesn't meet condition (1).
I need to generate 80 vectors, use them, and re-ask for another 80, and repeat this process until a certain condition is met.
for EACH vector across all 80-size-batch, i need them to be uniform (condition 1) and non repeating (condition 2)
Creating all permutations and shuffling that list is a great solution for a smaller list, I'm using this batch system because of the HUGE number of possible permutations
Any ideas?
thx

Just use [np.random.choice(xi) for xi in [x1..x100]]. The probability that the same vector will be generated twice in 80 trials is vanishingly small. By the birthday problem the probability that n items chosen independently from a set of d items will contain a repeated item chosen is approximately 1 - exp(n*(n-1)/(2*d)). In your case n = 80 and d = 10**100. The resulting probability is zero to a ridiculously large number of decimal places (the estimate implies that the probability begins 0.000 ... with approximately 1.37 x 10^97 zeros after the decimal point). Forget 80. You could generate 80 trillion such vectors and still have a vanishingly small probability of generating the same vector twice.

Random Integer inside range with probability (Python) [duplicate]

This question already has answers here:
Random weighted choice
(7 answers)
Closed 8 years ago.
I am making a text-based RPG. I have an algorithm that determines the damage dealt by the player to the enemy which is based off the values of two variables. I am not sure how the first part of the algorithm will work quite yet, but that isn't important.
(AttackStrength is an attribute of the player that represents generally how strong his attacks are. WeaponStrength is an attribute of swords the player wields and represents generally how strong attacks are with the weapon.)
Here is how the algorithm will go:
import random
Damage = AttackStrength (Do some math operation to WeaponStrength) WeaponStrength
DamageDealt = randrange(DamageDealt - 4, DamageDealt + 1) #Bad pseudocode, sorry
What I am trying to do with the last line is get a random integer inside a range of integers with the minimum bound as 4 less than Damage, and the maximum bound as 1 more than Damage. But, that's not all. I want to assign probabilities that:
X% of the time DamageDealt will equal Damage
Y% of the time DamageDealt will equal one less than Damage
Z% of the time DamageDealt will equal two less than Damage
A% of the time DamageDealt will equal three less than Damage
B% of the time DamageDealt will equal three less than Damage
C% of the time DamageDealt will equal one more than Damage
I hope I haven't over-complicated all of this thank you!

I think the easiest way to do random weighted probability when you have nice integer probabilities like that is to simply populate a list with multiple copies of your choices - in the right ratios - then choose one element from it, randomly.
Let's do it from -3 to 1 with your (original) weights of 10,10,25,25,30 percent. These share a gcd of 5, so you only need a list of length 20 to hold your choices:
choices = [-3]*2 + [-2]*2 + [-1]*5 + [0]*5 + [1]*6
And implementation done, just choose randomly from that. Demo showing 100 trials:
trials = [random.choice(choices) for _ in range(100)]
[trials.count(i) for i in range(-3,2)]
Out[18]: [11, 7, 27, 22, 33]

Essentially, what you're trying to accomplish is simulation of a loaded die: you have six possibilities and want to assign different probabilities to each one. This is a fairly interesting problem, mathematically speaking, and here is a wonderful piece on the subject.
Still, you're probably looking for something a little less verbose, and the easiest pattern to implement here would be via roulette wheel selection. Given a dictionary where keys are the various 'sides' (in this case, your possible damage formulae) and the values are the probabilities that each side can occur (.3, .25, etc.), the method looks like this:
def weighted_random_choice(choices):
max = sum(choices.values())
pick = random.uniform(0, max)
current = 0
for key, value in choices.items():
current += value
if current > pick:
return key

Suppose that we wanted to have these relative weights for the outcomes:
a = (10, 15, 15, 25, 25, 30)
Then we create a list of partial sums b and a function c:
import random
b = [sum(a[:i+1]) for i,x in enumerate(a)]
def c():
n = random.randrange(sum(a))
for i, v in enumerate(b):_
if n < v: return i
The function c will return an integer from 0 to len(a)-1 with probability proportional to the weights specified in a.

This can be a tricky problem with a lot of different probabilities. Since you want to impose probabilities on the outcomes it's not really fair to call them "random". It always helps to imagine how you might represent your data. One way would be to keep a tuple of tuples like
probs = ((10, +1), (30, 0), (25, -1), (25, -2), (15, -3))
You will notice I have adjusted the series to put the highest adjustment first and so on. I have also removed the duplicate "15, -3) that your question implies because (I imagine) of a line duplicated by accident. One very useful test is to ensure that your probabilities add up to 100 (since I've represented them as integer percentages). This reveals a data fault:
>>> sum(prob[0] for prob in probs)
105
This needn't be an issue unless you really want your probabilities to sum to a sensible value. If this isn't necessary you can just treat them as weightings and select random numbers from (0, 104) instead of (0, 99). This is the course I will follow, but the adjustment should be relatively simple.
Given probs and a random number between 0 and (in your case) 104, you can iterate over the probs structure, accumulating probabilities until you find the bin this particular random number belongs to. This would look (something) like:
def damage_offset(N):
prob = random.randint(0, N-1)
cum_prob = 0
for prob, offset in probs:
cum_prob += prob
if cum_prob >= prob:
return offset
This should always terminate if you get your data right (hence my paranoid check on your weightings - I've been doing this quite a while).
Of course it's often possible to trade memory for speed. If the above needs to work faster then it's relatively easy to create a structure that maps random integer choices direct to their results. One way to construct such a mapping would be
damage_offsets = []
for i in range(N):
damage_offsets.append(damage_offset(i))
Then all you have to do after you've picked your random number r between 1 and N is to look up damage_offsets[r-1] for the particular value of r1 and you have created an O(1) operation. As I mentioned at the start, this isn't likely going to be terribly useful unless your probability list becomes huge (but if it does then you really will need to to avoid O(N) operations when you have large N for the number of probability buckets).
Apologies for untested code.

Divide set into subsets with equal number of elements

For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?

If I understand correctly your problem, you might use random.sample() in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset

I would tackle this as follows:
Divide into 3 equal subsets.
Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.

You can easily do this using the plyr library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))

In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)

Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.

Challenging dynamic programming problem

This is a toned down version of a computer vision problem I need to solve. Suppose you are given parameters n,q and have to count the number of ways of assigning integers 0..(q-1) to elements of n-by-n grid so that for each assignment the following are all true
No two neighbors (horizontally or vertically) get the same value.
Value at positions (i,j) is 0
Value at position (k,l) is 0
Since (i,j,k,l) are not given, the output should be an array of evaluations above, one for every valid setting of (i,j,k,l)
A brute force approach is below. The goal is to get an efficient algorithm that works for q<=100 and for n<=18.
def tuples(n,q):
return [[a,]+b for a in range(q) for b in tuples(n-1,q)] if n>1 else [[a] for a in range(q)]
def isvalid(t,n):
grid=[t[n*i:n*(i+1)] for i in range(n)];
for r in range(n):
for c in range(n):
v=grid[r][c]
left=grid[r][c-1] if c>0 else -1
right=grid[r][c-1] if c<n-1 else -1
top=grid[r-1][c] if r > 0 else -1
bottom=grid[r+1][c] if r < n-1 else -1
if v==left or v==right or v==top or v==bottom:
return False
return True
def count(n,q):
result=[]
for pos1 in range(n**2):
for pos2 in range(n**2):
total=0
for t in tuples(n**2,q):
if t[pos1]==0 and t[pos2]==0 and isvalid(t,n):
total+=1
result.append(total)
return result
assert count(2,2)==[1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
Update 11/11
I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 3 hours for n=10, any q, from author's estimate)

Maybe this sounds too simple, but it works. Randomly distribute values to all the cells until only two are empty. Test for adjacency of all values. Compute the average the percent of successful casts vs. all casts until the variance drops to within an acceptable margin.
The risk goes to zero and the that which is at risk is only a little runtime.

This isn't an answer, just a contribution to the discussion which is too long for a comment.
tl; dr; Any algorithm which boils down to, "Compute the possibilities and count them," such as Eric Lippert's or a brute force approach won't work for #Yaroslav's goal of q <= 100 and n <= 18.
Let's first think about a single n x 1 column. How many valid numberings of this one column exist? For the first cell we can pick between q numbers. Since we can't repeat vertically, we can pick between q - 1 numbers for the second cell, and therefore q - 1 numbers for the third cell, and so on. For q == 100 and n == 18 that means there are q * (q - 1) ^ (n - 1) = 100 * 99 ^ 17 valid colorings which is very roughly 10 ^ 36.
Now consider any two valid columns (call them the bread columns) separated by a buffer column (call it the mustard column). Here is a trivial algorithm to find a valid set of values for the mustard column when q >= 4. Start at the top cell of the mustard column. We only have to worry about the adjacent cells of the bread columns which have at most 2 unique values. Pick any third number for the mustard column. Consider the second cell of the mustard column. We must consider the previous mustard cell and the 2 adjacent bread cells with a total of at most 3 unique values. Pick the 4th value. Continue to fill out the mustard column.
We have at most 2 columns containing a hard coded cell of 0. Using mustard columns, we can therefore make at least 6 bread columns, each with about 10 ^ 36 solutions for a total of at least 10 ^ 216 valid solutions, give or take an order of magnitude for rounding errors.
There are, according to Wikipedia, about 10 ^ 80 atoms in the universe.
Therefore, be cleverer.

Update 11/11 I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 41 hours hours for n=10, any q, from author's estimate)
I'm the author. Not 41, just 3 embarrassingly parallelizable CPU hours. I've counted symmetries. For n=10 there are only 675 really distinct pairs of (i,j) and (k,l). My program needs ~ 16 seconds per each.

I'm building a contribution based on the contribution to the discussion by Dave Aaron Smith.
Let's not consider for now the last two constraints ((i,j) and (k,l)).
With only one column (nx1) the solution is q * (q - 1) ^ (n - 1).
How many choices for a second column ? (q-1) for the top cell (1,2) but then q-1 or q-2 for the cell (2,2) if (1,2)/(2,1) have or not the same color.
Same thing for (3,2) : q-1 or q-2 solutions.
We can see we have a binary tree of possibilities and we need to sum over that tree. Let's assume left child is always "same color on top and at left" and right child is "different colors".
By computing over the tree the number of possibilities for the left column to create a such configurations and the number of possibilities for the new cells we are coloring we would count the number of possibilities for coloring two columns.
But let's now consider the probability distribution foe the coloring of the second column : if we want to iterate the process, we need to have an uniform distribution on the second column, it would be like the first one never existed and among all coloring of the first two column we could say things like 1/q of them have color 0 in the top cell of second column.
Without an uniform distribution it would be impossible.
The problem : is the distribution uniform ?
Answer :
We would have obtain the same number of solution by building first the second column them the first one and then the third one. The distribution of the second column is uniform in that case so it also is in the first case.
We can now apply the same "tree idea" to count the number of possibilities for the third column.
I will try to develop on that and build a general formula (since the tree is of size 2^n we don't want to explicitly explore it).

A few observations which might help other answerers as well:
The values 1..q are interchangeable - they could be letters and the result would be the same.
The constraints that no neighbours match is a very mild one, so a brute force approach will be excessively expensive. Even if you knew the values in all but one cell, there would still be at least q-8 possibilities for q>8.
The output of this will be pretty long - every set of i,j,k,l will need a line. The number of combinations is something like n2(n2-3), since the two fixed zeroes can be anywhere except adjacent to each other, unless they need not obey the first rule. For n=100 and q=18, the maximally hard case, this is ~ 1004 = 100 million. So that's your minimum complexity, and is unavoidable as the problem is currently stated.
There are simple cases - when q=2, there are the two possible checkerboards, so for any given pair of zeroes the answer is 1.
Point 3 makes the whole program O( n2(n2-3) ) as a minimum, and also suggests that you will need something reasonably efficient for each pair of zeroes as simply writing 100 million lines without any computation will take a while. For reference, at a second per line, that is 1x108s ~ 3 years, or 3 months on a 12-core box.
I suspect that there is an elegant answer given a pair of zeroes, but I'm not sure that there is an analytic solution to it. Given that you can do it with 2 or 3 colours depending on the positions of the zeroes, you could split the map into a series of regions, each of which uses only 2 or 3 colours, and then it's just the number of different combinations of 2 or 3 in q (qC2 or qC3) for each region times the number of regions, times the number of ways of splitting the map.

I'm not a mathematician, but it occurs to me that there ought to be an analytical solution to this problem, namely:
First, compute now many different colourings are possible for NxN board with Q colours (including that neighbours, defined as having common edge don't get same color). This ought to be pretty simple formula.
Then figure out how many of these solutions have 0 in (i,j), this should be 1/Q's fraction.
Then figure out how many of remaining solutions have 0 in (k,l) depending on manhattan distance |i-k|+|j-l|, and possibly distance to the board edge and "parity" of these distances, as in distance divisible by 2, divisible by 3, divisible by Q.
The last part is the hardest, though I think it might still be doable if you are really good at math.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.