So, I'm a beginner in python (coding in general, really), and I've tried to make this little program which generates a random number of rods in 305 attempts
import random
rods = 0
def blazerods():
global rods
seed = random.randint(0, 100000000000)
random.seed(seed)
i = 0
rods = 0
for i in range(0, 305):
rnd = random.random()
if rnd < 0.50:
rods += 1
print(rods)
return rods
while 1==1:
blazerods()
if rods >= 211:
break
The goal is to get 211 or more rods. However, I ran the program for 30 minutes without results.
My questions are: Is it even possible to get 211 or higher with just this code I included?
Can I make it more likely that rods can be more than 211 (still being a very unlikely result, ofc) without changing the chance(50%)?
Is random.seed(seed) even useful?
The probability distribution of rods is Binomial(305,0.5), that is the probability of getting exactly n rods is (305 choose n) * 0.5^305.
To get the probability to get at least 211, you need to sum these terms from 211 to 305. Wolfram alpha gives that as 8.8e-12.
So... it is really, really unlikely and you will have to wait a long time.
If your loop runs 1000 times a second, you will expect to have enough rods about once every 4 years.
If I remember correctly, Matt Parker from the Youtube channel Stand-up Maths has something to say about this particular case in his video "How lucky is too lucky".
As pointed out by Jens, this is easy to calculate via the Binomial distribution. The SciPy stats module allows you to calculate this by doing:
from scipy import stats
# i.e. 305 draws with equal probability
d = stats.binom(305, 0.5)
# the probability of seeing something greater than this value
p = d.sf(210)
which should give you the same value as Jens got: ~8.8e-12.
Next we can use the datetime module to convert this number into the expected time you have to wait:
from datetime import timedelta
time_per_try = timedelta(seconds=1/1000)
print(time_per_try / p)
which should give you ~1300 days, or 3.6 years. Technically, this is the time you'll have to wait to have a 50% chance of seeing it, and it could appear much sooner or later.
You can calculate reasonable values of when this would happen, using the negative binomial distribution. In Python, this looks like:
for q in stats.nbinom(1, p).ppf([0.025, 0.975]):
print(time_per_try * q)
where the 0.025 and 0.975 values give you the 95% confidence interval you hear scientists talking about.
It tells you that if you had 20 computers running your algorithm in parallel, each doing 1000 tests per second, you could expect the first one to finish in around a month while the slowest one would likely be going on for more than 10 years.
Related
I want to calculate the remaining probabilities for each result in a football game at n minute.
In this case I have expected goals for home team of 2.69 and away team 1.12 at 70 minute for a current result of 2-1
Code
from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd
xgh = 2.69
xga = 1.12
minute = 70
hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga, k=l, loc=ag)
pas.append(pa)
prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)
prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))
This return a probability of 2-1 final result for 2.21% that is pretty low I expect an high probability considering only 20 minutes left
Math considerations
Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.
The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).
Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).
But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).
So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.
In python, that answer is poisson.pmf(k, μT/τ).
In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way).
Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978). Or, using your keyword style poisson.pmf(mu=0.5978, k=0). Or using loc, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2) (but that is just cosmetic. Having a loc parameter just replace k by k-loc)
tl;dr solution
So, long story short, you just need to scale down xgh and xga so that they reflect the expected number of goals in the remaining time.
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
pas.append(pa)
Other comments
zip
While at it, and since there is a python tag, some comments on the code
for i, l in zip(range(0, 6), range(0, 6)):
print(i,l)
produces
0 0
1 1
2 2
3 3
4 4
5 5
So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)
So just
for k in range(0, 6):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
pas.append(pa)
I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product instead of that zip, before you realized that this was computing several time the same exact pmf.
Cross product
That usage of product has probably been then reduced to the task of computing phs[i]×pas[j] for all i,j. That is a good usage of product.
But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j], let numpy do the job. It will be more efficient at it.
prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
Getting arrays directly from Poisson
Which leads to another optimization. If the goal is to transform phs and pha into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf can have k being a list rather than a scalar, and then returns a list rather than a scalar.
So
phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
So, altogether
prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
Timings
Optimisations
Time in μs
Without
1647 μs
With
329 μs
So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.
I would like to find the fastest way to generate ~10^9 poisson random numbers in python/numpy—for instance, say I have a mean Poisson parameter (calculated elsewhere) of shape (1000, 2000), and I need 500 independent samples. This is a bottleneck in my code, taking several minutes to complete. I have tried three methods, but am looking for something faster:
import numpy as np
# example parameters
nsamples = 500
nmeas = 2000
ninputs = 1000
lambdax = np.ones([ninputs, nmeas]) * 20
# numpy, one big array
sample0 = np.random.poisson(lam=lambdax, size=(nsamples, ninputs, nmeas))
# numpy, current version where other code happens in the loop
sample1 = np.zeros([nsamples, ninputs, nmeas])
for i in range(nsamples):
sample1[i, :, :] = np.random.poisson(lam=lambdax)
# scipy
from scipy.stats import poisson
sample2 = poisson.rvs(lambdax, size=(nsamples, ninputs, nmeas))
Results:
sample0: 1 m 16 s
sample1: 1 m 20 s
sample2: 1 m 50 s
Not shown here, I am also parallelizing the independent samples via multiprocessing, but the calculations are still pretty expensive for such large parameters. Is there a better way?
I have been in your shoes and here are my suggestions:
For large mean values, poisson works similar to uniform. check out this post (and probably more if you search) .
~1m runtime seems reasonable to generate such a large number of random numbers. I don't think you can top sample0 method by much via just coding. Now depending on what you want to do with random numbers,
if your issue is rerunning program multiple times, try saving sample0 into a file and reloading it in the next runs.
if not, I suggest creating lower number of randoms and reuse them. A lot of those random numbers in sample0 will be repeated in your sample, depending on your mean value. You might want to create smaller sample size and randomly choose from them. for example I would chose a random number from sample0 and reuse it for e.g. 100 times (since that number would appear in sample0 over 100 times anyways).
If you provide more information on what you intend to do with random numbers, we might be able to help more. Otherwise, coding-wise I am not sure if you can do much further.
I'm relatively new to python and wanted to test myself, by tackling the birthday problem. Rather than calculating it mathematically, I wanted to simulate it, to see if I would get the right answer. So I assigned all boolean values in the list sieve[] as False and then randomly pick a value from 0 to 364 and change it to True, if it's already True then it outputs how many times it had to iterate as an answer.
For some reason, every time I run the code, I get a value between 24.5 and 24.8
The expected result for 50% is 23 people, so why is my result 6% higher than it should be? Is there an error in my code?
import random
def howManyPeople():
sieve = [False] * 365
count = 1
while True:
newBirthday = random.randint(0,364)
if sieve[newBirthday]:
return count
else:
sieve[newBirthday] = True
count += 1
def multipleRun():
global timesToRun
results = []
for i in range(timesToRun):
results.append(howManyPeople())
finalResultAverage = sum(results)
return (finalResultAverage / timesToRun)
timesToRun = int(input("How many times would you like to run this code?"))
print("Average of all solutions = " + str(multipleRun()) + " people")
There's no error in your code. You're computing the mean of your sample of howManyPeople return values, when what you're really interested in (and what the birthday paradox tells you about) is the median of the distribution.
That is, you've got a random process where you incrementally add people to a set, then report the total number of people in that set on the first birthday collision. The birthday paradox implies that at least 50% of the time, your set will have 23 or fewer people. That's not the same thing as saying the expected number of people in the set is 23.0 or smaller.
Here's what I see from one million samples of your howManyPeople function.
In [4]: sample = [howManyPeople() for _ in range(1000000)]
In [5]: import numpy as np
In [6]: np.median(sample)
Out[6]: 23.0
In [7]: np.mean(sample)
Out[7]: 24.617082
In [8]: np.mean([x <= 23 for x in sample])
Out[8]: 0.506978
Note that there's a (tiny) amount of luck here: the median of the distribution of howManyPeople return values is 23 (at least according to Wikipedia's definition), but there's a chance that an unusual sample could have different median, purely through randomness. In this particular case, that chance is entirely negligible. And as user2357112 points out in comments, things are a bit messier in the 2-day year example, where any real number between 2.0 and 3.0 (inclusive) is a valid distribution median, and we could reasonably expect a sample median to be either 2 or 3.
Instead of sampling, we can also compute the probabilities of each output of howManyPeople directly: for any positive integer k, the probability that the output is strictly larger than k is the same as the probability that the first k people have distinct birthdays, which is given (in Python syntax) by factorial(365)/factorial(k)/365**k, and we can use that to compute the probabilities of individual outputs. Here I'm using the name X for the random variable represented by howManyPeople. Some inefficient code:
from math import factorial
def prob_X_greater_than(k):
"""Probability that the output of howManyPeople is > k."""
if k <= 0:
return 1.0
elif k > 365:
return 0.0
else:
return factorial(365) / factorial(365 - k) / 365**k
def prob_X_equals(k):
"""Probability that the output of howManyPeople is == k."""
return prob_x_greater_than(k-1) - prob_x_greater_than(k)
With this, we can get the exact (well, okay, exact up to numerical errors) mean and verify that it roughly matches what we got from the sample:
In [18]: sum(k*prob_x_equals(k) for k in range(1, 366))
Out[18]: 24.616585894598863
And the birthday paradox in this case should tell us that the sum of the probabilities for k <= 23 is greater than 0.5:
In [19]: sum(prob_x_equals(k) for k in range(1, 24))
Out[19]: 0.5072972343239854
What you're seeing is normal. There may be a >50% chance of having a duplicate birthday in a room of 23 random people (ignoring leap years and nonuniform birthday distributions), but that doesn't mean that if you add people to a room one by one, the mean point at which you get a duplicate will be 23.
To get an intuitive feel for this, imagine if years only had two days. In this case, it's clear that there's a 50% chance of having a duplicate birthday in a room with 2 people. However, if you add random people to the room one by one, you're going to need at least two people - 50% chance of stopping at 2 and 50% of 3. The mean stopping point is 2.5, not 2.
I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:
import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque
def sort(unsorted):
if len(unsorted) <= 1:
return unsorted
to_merge = deque(deque([elem]) for elem in unsorted)
while len(to_merge) > 1:
left = to_merge.popleft()
right = to_merge.popleft()
to_merge.append(merge(left, right))
return to_merge.pop()
def merge(left, right):
result = deque()
while left or right:
if left and right:
elem = left.popleft() if left[0] > right[0] else right.popleft()
elif not left and right:
elem = right.popleft()
elif not right and left:
elem = left.popleft()
result.append(elem)
return result
LOOP_COUNT = 100
START_N = 1
END_N = 1000
def test(fun, test_data):
start = time.clock()
for _ in xrange(LOOP_COUNT):
fun(test_data)
return time.clock() - start
def run_test():
timings, elem_nums = [], []
test_data = random.sample(xrange(100000), END_N)
for i in xrange(START_N, END_N):
loop_test_data = test_data[:i]
elapsed = test(sort, loop_test_data)
timings.append(elapsed)
elem_nums.append(len(loop_test_data))
print "%f s --- %d elems" % (elapsed, len(loop_test_data))
plt.plot(elem_nums, timings)
plt.show()
run_test()
As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:
Things I've tried to investigate the issue:
PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious.
`
I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.
So how can this behaviour be explained and - hopefully - fixed?
UPD: changed lists to collections.deque
UPD2: added the full test code
UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.
You are simply picking up the impact of other processes on your machine.
You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.
Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.
I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.
That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.
If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.
Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.
EDIT:
I realized that I didn't clarify one point in my proposed solution to your problem.
Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?
Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).
But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.
So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):
seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
random.seed(seed)
for i in range(inputs_per_size):
for n in range(1000):
input = generate_random_input(size = n)
execution_time = get_execution_time(input)
if runtimes[(n, i)] > execution_time:
runtimes[(n,i)] = execution_time
for n in range(1000):
runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size
Now you can use runtimes[n] to build your plot.
Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).
I can reproduce the spikes using your code:
You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:
It’s tempting to calculate mean and standard deviation from the result
vector and report these. However, this is not very useful. In a
typical case, the lowest value gives a lower bound for how fast your
machine can run the given code snippet; higher values in the result
vector are typically not caused by variability in Python’s speed, but
by other processes interfering with your timing accuracy. So the min()
of the result is probably the only number you should be interested in.
After that, you should look at the entire vector and apply common
sense rather than statistics.
timeit module can choose appropriate parameters for you:
$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'
Here's timeit-based performance curve:
The figure shows that sort() behavior is consistent with O(n*log(n)):
|------------------------------+-------------------|
| Fitting polynom | Function |
|------------------------------+-------------------|
| 1.00 log2(N) + 1.25e-015 | N |
| 2.00 log2(N) + 5.31e-018 | N*N |
| 1.19 log2(N) + 1.116 | N*log2(N) |
| 1.37 log2(N) + 2.232 | N*log2(N)*log2(N) |
To generate the figure I've used make-figures.py:
$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin
where:
# adapt sorting functions for make-figures.py
def msort(lists):
assert len(lists) == 1
return sort(lists[0]) # `sort()` from the question
def msort_builtin(lists):
assert len(lists) == 1
return sorted(lists[0]) # builtin
Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).
I have some small piece of software that calculates the number of factors of each triangle number to see what is the first one of them has more than X number of factors (yes, it's a projecteuler problem, number 12,, although i didn't solve it yet)... as am trying making X some random values to see what the code does and in how much time, I noticed something strange (to me at least): until X=47 the execution time increases in obviously normal way, but when X = 48 it increases more than normal, and function calls are much greater than the rate, it (explodes) if I would say that.. why does it do that??
the code:
def fac(n):
c=0
for i in range (1,n+1):
if n%i==0:
c=c+1
return c
n=1
while True:
summ=0
for i in range (1,n+1):
summ=summ+i
if fac(summ)>X:
break
n=n+1
print summ
and when profiling:
when X=45 : 314 function calls in 0.027 CPU seconds
when X=46 : 314 function calls in 0.026 CPU seconds
when X=47 : 314 function calls in 0.026 CPU seconds
when X=48 : 674 function calls in 0.233 CPU seconds
when X=49 : 674 function calls in 0.237 CPU seconds
I assume that if I continued I would meet other points that system calls increases and time increases suddenly, and previously there were points like that but time was so small so it did't matter so much.. Why function calls suddenly increases?? Isn't it supposed just to call the function one more time for the new value??
P.S. am using cProfile as a profiler, and X in the code here is just for demonstration, I write the value directly in the code... thank you in advance...
Have you looked at the actual values involved?
The first triangular number with more than 47 factors is T(104) = 5460, which has 48 factors.
But the first triangular number with more than 48 factors is T(224) = 25200, which has 90 factors. So no wonder it takes a lot more work.
If your code runs up to T(n), then it calls range 2n times and fac n times, for a total of 3n function calls. Thus for T(104) it requires 312 function calls, and for T(224) it requires 672 function calls. Presumably there are 2 function calls of overhead somewhere that you're not showing us, which explains the profiling results you get.
Your current strategy is not going to get you to the answer for the Project Euler problem. But I can give some hints.
Do you have to start over again with summ=0 each time you compute a triangular number?
Do you have to loop over all the numbers up to n in order to work out how many divisors it has? Could there be a quicker way? (How many divisors does 216 = 65536 have? Do you have to loop over all the numbers from 1 to 65536?)
How many divisors do triangular numbers have? (Look at some small triangular numbers where you can compute the answer.) Can you see any patterns that would help you compute the answer for much bigger triangular numbers?
If you check the output you'll see several spikes (sudden increasement) in execution time.
The reason is that the number of loops needed is not going up gradually but abruptly. Print out n after you while True loop and you'll see it.
Note: Euler is math site, don't write brute force algorithms ;)