Poission Distribution considering time left - python

I want to calculate the remaining probabilities for each result in a football game at n minute.
In this case I have expected goals for home team of 2.69 and away team 1.12 at 70 minute for a current result of 2-1
Code
from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd
xgh = 2.69
xga = 1.12
minute = 70
hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga, k=l, loc=ag)
pas.append(pa)
prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)
prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))
This return a probability of 2-1 final result for 2.21% that is pretty low I expect an high probability considering only 20 minutes left

Math considerations
Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.
The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).
Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).
But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).
So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.
In python, that answer is poisson.pmf(k, μT/τ).
In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way).
Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978). Or, using your keyword style poisson.pmf(mu=0.5978, k=0). Or using loc, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2) (but that is just cosmetic. Having a loc parameter just replace k by k-loc)
tl;dr solution
So, long story short, you just need to scale down xgh and xga so that they reflect the expected number of goals in the remaining time.
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
pas.append(pa)
Other comments
zip
While at it, and since there is a python tag, some comments on the code
for i, l in zip(range(0, 6), range(0, 6)):
print(i,l)
produces
0 0
1 1
2 2
3 3
4 4
5 5
So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)
So just
for k in range(0, 6):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
pas.append(pa)
I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product instead of that zip, before you realized that this was computing several time the same exact pmf.
Cross product
That usage of product has probably been then reduced to the task of computing phs[i]×pas[j] for all i,j. That is a good usage of product.
But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j], let numpy do the job. It will be more efficient at it.
prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
Getting arrays directly from Poisson
Which leads to another optimization. If the goal is to transform phs and pha into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf can have k being a list rather than a scalar, and then returns a list rather than a scalar.
So
phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
So, altogether
prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
Timings
Optimisations
Time in μs
Without
1647 μs
With
329 μs
So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.

Related

Is this just very unlikely? Or is it impossible

So, I'm a beginner in python (coding in general, really), and I've tried to make this little program which generates a random number of rods in 305 attempts
import random
rods = 0
def blazerods():
global rods
seed = random.randint(0, 100000000000)
random.seed(seed)
i = 0
rods = 0
for i in range(0, 305):
rnd = random.random()
if rnd < 0.50:
rods += 1
print(rods)
return rods
while 1==1:
blazerods()
if rods >= 211:
break
The goal is to get 211 or more rods. However, I ran the program for 30 minutes without results.
My questions are: Is it even possible to get 211 or higher with just this code I included?
Can I make it more likely that rods can be more than 211 (still being a very unlikely result, ofc) without changing the chance(50%)?
Is random.seed(seed) even useful?
The probability distribution of rods is Binomial(305,0.5), that is the probability of getting exactly n rods is (305 choose n) * 0.5^305.
To get the probability to get at least 211, you need to sum these terms from 211 to 305. Wolfram alpha gives that as 8.8e-12.
So... it is really, really unlikely and you will have to wait a long time.
If your loop runs 1000 times a second, you will expect to have enough rods about once every 4 years.
If I remember correctly, Matt Parker from the Youtube channel Stand-up Maths has something to say about this particular case in his video "How lucky is too lucky".
As pointed out by Jens, this is easy to calculate via the Binomial distribution. The SciPy stats module allows you to calculate this by doing:
from scipy import stats
# i.e. 305 draws with equal probability
d = stats.binom(305, 0.5)
# the probability of seeing something greater than this value
p = d.sf(210)
which should give you the same value as Jens got: ~8.8e-12.
Next we can use the datetime module to convert this number into the expected time you have to wait:
from datetime import timedelta
time_per_try = timedelta(seconds=1/1000)
print(time_per_try / p)
which should give you ~1300 days, or 3.6 years. Technically, this is the time you'll have to wait to have a 50% chance of seeing it, and it could appear much sooner or later.
You can calculate reasonable values of when this would happen, using the negative binomial distribution. In Python, this looks like:
for q in stats.nbinom(1, p).ppf([0.025, 0.975]):
print(time_per_try * q)
where the 0.025 and 0.975 values give you the 95% confidence interval you hear scientists talking about.
It tells you that if you had 20 computers running your algorithm in parallel, each doing 1000 tests per second, you could expect the first one to finish in around a month while the slowest one would likely be going on for more than 10 years.

Fastest way to generate ~10^9 poisson random numbers in python/numpy

I would like to find the fastest way to generate ~10^9 poisson random numbers in python/numpy—for instance, say I have a mean Poisson parameter (calculated elsewhere) of shape (1000, 2000), and I need 500 independent samples. This is a bottleneck in my code, taking several minutes to complete. I have tried three methods, but am looking for something faster:
import numpy as np
# example parameters
nsamples = 500
nmeas = 2000
ninputs = 1000
lambdax = np.ones([ninputs, nmeas]) * 20
# numpy, one big array
sample0 = np.random.poisson(lam=lambdax, size=(nsamples, ninputs, nmeas))
# numpy, current version where other code happens in the loop
sample1 = np.zeros([nsamples, ninputs, nmeas])
for i in range(nsamples):
sample1[i, :, :] = np.random.poisson(lam=lambdax)
# scipy
from scipy.stats import poisson
sample2 = poisson.rvs(lambdax, size=(nsamples, ninputs, nmeas))
Results:
sample0: 1 m 16 s
sample1: 1 m 20 s
sample2: 1 m 50 s
Not shown here, I am also parallelizing the independent samples via multiprocessing, but the calculations are still pretty expensive for such large parameters. Is there a better way?
I have been in your shoes and here are my suggestions:
For large mean values, poisson works similar to uniform. check out this post (and probably more if you search) .
~1m runtime seems reasonable to generate such a large number of random numbers. I don't think you can top sample0 method by much via just coding. Now depending on what you want to do with random numbers,
if your issue is rerunning program multiple times, try saving sample0 into a file and reloading it in the next runs.
if not, I suggest creating lower number of randoms and reuse them. A lot of those random numbers in sample0 will be repeated in your sample, depending on your mean value. You might want to create smaller sample size and randomly choose from them. for example I would chose a random number from sample0 and reuse it for e.g. 100 times (since that number would appear in sample0 over 100 times anyways).
If you provide more information on what you intend to do with random numbers, we might be able to help more. Otherwise, coding-wise I am not sure if you can do much further.

How to Use Numpy.FFT on A Pulsed Signal

I am just starting to learn numpy.fft, so apologies in advance.
I have an array consisting of 1000 elements of 1s and 0s, representing 1000ms of pulsed input consisting of trues and falses. I wanted to perform rfft on this array. For a simple example, I created this array that has a 1 on every 3rd element and 0 otherwise:
freq = 3
for j in range(0, 1000):
if freq != 0 and (((j + 1) % freq) == 0):
arr3hz.append(1)
else:
arr3hz.append(0)
I was expecting rfft to give me 3Hz somehow, I used this code:
n = len(arr3hz)
d = 1 / 1000
hs = np.fft.rfft(arr3hz)
fs = np.fft.rfftfreq(n, d)
amps = np.absolute(hs)
for j in range(0, len(fs)):
fw.write("Freq: %d Amp: %f\n" % (fs[j], amps[j]))
On my written file, I am just seeing random frequency elements with random amplitudes, which I was not able to make sense of. What is wrong with my use of numpy.rfft? I was also not sure of what to use for n and d as well for an array like this.
There are a few things going on here.
The first block gives a period of 3 ms, i.e. a frequency of 333.33 Hz
The mean is not zero, so there will also be a zero frequency component.
1000 ms is not divisible by 3 ms. The discrete Fourier transform assumes that the entire signal repeats with period equal to the window length (1000 ms). This means that you have 332 intervals of 3 ms and 1 interval of 4 ms. As it is not periodic at 333 Hz, there will be a spread of frequencies.
You were using the correct values for n and d.
In general, I find it more helpful to plot the output than printing out the values.
import matplotlib.pyplot as plt
plt.plot(fs, amps)
see

Generate a specific number of permutations

I have browsed SO extensively and I have found many questions about generating all possible permutations, but none regarding generating a specific number of permutations.
I developed, thanks to many SO questions, a decent permutation test routine. However I have to repeat it many times, and it is taking a too long time.
my code:
def exact_mc_perm_test(ys, nmc,boolean_selection):
# xs sample from a time series
# ys all time series
# print nmc
# sample difference in mean
mean_ys = np.mean(ys)
diff = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k=0
for j in np.arange(nmc):
# in place shuffling
np.random.shuffle(ys)
# difference now between fixed all time series and shuffled subsamplevalues
diff_shuffled = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k += diff < diff_shuffled
return k / nmc
I took this SO answer and modify it for my specific test.
I have to run it over a 3D array stored in an xarray. the dataset has (lon,lat,time) coordinates, I need to run it for each (lon,lat) position (along the time dimension)
I run it using chain.iteratools:
for ii in chain.from_iterable(zip(*dataset.variable())):
iis = ii[selected_position].values
ind_x =dataset.lon==ii.lon
ind_y =dataset.lat==ii.lat
dataset.perm_test[ind_y, ind_x] = exact_mc_perm_test1(iis, ii.values, 1000.,selected_position)
Ideally I want to run a permutation test with 20000 permutations. The two loops (within (lon,lat) and for 20000 shuffles) adds up.
I am looking to speed up the permutation test code.
Therefore I though about trying to generate a 2D array of shape (len(ys),20000) with essentially 20000 shuffled ys array, and then access them at ones and calculate the 20000 differences (diff in the code). (Or find a trade off between memory usage and the looping, so maybe do 5 loops for 4000 shuffles at the time).
I could not figure out or find a way to do this.
The permutations command from itertools generates all the possible permutations which in my case are too many to handle.
I have looked at the random library but couldn't find something that fits my need. Any suggestion?
Take a look at compress() and permutations() from the itertools:
for perm in compress(permutation(iterable, r=length), boolean_selection):
print(perm)

Python: Solving a complex scenario analysis

I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?
First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])

Categories