I am trying to achieve agents arrival in my model according to a poisson process. I know from data that on average 230 agents arrive per day (or 9.583 agents/hr or 0.1597/minute). In the simulations, now I need to use this information to add agents. One simulation time step is equal to 5 minutes (real time) and if if we calculate from data, then on average 0.7986 agents should be added to simulation every time step to achieve an average of 230 per day. But how could I do this? I cannot use 0.7986 per time step because I need integer number to add agent. If I round off 0.7986 to 1, then I over estimate this.
It is clear that we cannot add agent every time step but I have no clue how to select a time step in which an agent must be added. If I know which time step I need to select to add an agent, I can do that easily. Does any one know how to do this in Python? I tried the below code but cannot really understand what it is actually
for i in range(1,12): # 1 simulation time step is equal 5min, so this loops covers 1 hour.
time=int(random.expovariate(1/0.7986))
I do not really understand the above code as it produces quite different numbers. Any help please.
If agent arrivals is a Poisson process then the time between individual agent arrivals has an exponential distribution. That's what the code you provided generates, but is only useful if you are using continuous time with discrete event scheduling. With time-step as the time advance mechanism, you actually just want to stick with the Poisson distribution, adjusting the rate to match your time-step interval size, which you've already done.
import numpy
last_step = 12 * 24 # to simulate one day, for example
rate = 230.0 / last_step
for time_step in range(1, last_step + 1):
number_of_new_agents = numpy.random.poisson(rate)
for new_agent_number in range(number_of_new_agents):
# do whatever you want at this point
Note that the number_of_new_agents will often be 0, in which case the inner loop will iterate zero times.
Related
I want to calculate the remaining probabilities for each result in a football game at n minute.
In this case I have expected goals for home team of 2.69 and away team 1.12 at 70 minute for a current result of 2-1
Code
from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd
xgh = 2.69
xga = 1.12
minute = 70
hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga, k=l, loc=ag)
pas.append(pa)
prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)
prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))
This return a probability of 2-1 final result for 2.21% that is pretty low I expect an high probability considering only 20 minutes left
Math considerations
Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.
The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).
Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).
But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).
So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.
In python, that answer is poisson.pmf(k, μT/τ).
In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way).
Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978). Or, using your keyword style poisson.pmf(mu=0.5978, k=0). Or using loc, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2) (but that is just cosmetic. Having a loc parameter just replace k by k-loc)
tl;dr solution
So, long story short, you just need to scale down xgh and xga so that they reflect the expected number of goals in the remaining time.
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
pas.append(pa)
Other comments
zip
While at it, and since there is a python tag, some comments on the code
for i, l in zip(range(0, 6), range(0, 6)):
print(i,l)
produces
0 0
1 1
2 2
3 3
4 4
5 5
So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)
So just
for k in range(0, 6):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
pas.append(pa)
I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product instead of that zip, before you realized that this was computing several time the same exact pmf.
Cross product
That usage of product has probably been then reduced to the task of computing phs[i]×pas[j] for all i,j. That is a good usage of product.
But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j], let numpy do the job. It will be more efficient at it.
prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
Getting arrays directly from Poisson
Which leads to another optimization. If the goal is to transform phs and pha into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf can have k being a list rather than a scalar, and then returns a list rather than a scalar.
So
phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
So, altogether
prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
Timings
Optimisations
Time in μs
Without
1647 μs
With
329 μs
So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.
So, I'm a beginner in python (coding in general, really), and I've tried to make this little program which generates a random number of rods in 305 attempts
import random
rods = 0
def blazerods():
global rods
seed = random.randint(0, 100000000000)
random.seed(seed)
i = 0
rods = 0
for i in range(0, 305):
rnd = random.random()
if rnd < 0.50:
rods += 1
print(rods)
return rods
while 1==1:
blazerods()
if rods >= 211:
break
The goal is to get 211 or more rods. However, I ran the program for 30 minutes without results.
My questions are: Is it even possible to get 211 or higher with just this code I included?
Can I make it more likely that rods can be more than 211 (still being a very unlikely result, ofc) without changing the chance(50%)?
Is random.seed(seed) even useful?
The probability distribution of rods is Binomial(305,0.5), that is the probability of getting exactly n rods is (305 choose n) * 0.5^305.
To get the probability to get at least 211, you need to sum these terms from 211 to 305. Wolfram alpha gives that as 8.8e-12.
So... it is really, really unlikely and you will have to wait a long time.
If your loop runs 1000 times a second, you will expect to have enough rods about once every 4 years.
If I remember correctly, Matt Parker from the Youtube channel Stand-up Maths has something to say about this particular case in his video "How lucky is too lucky".
As pointed out by Jens, this is easy to calculate via the Binomial distribution. The SciPy stats module allows you to calculate this by doing:
from scipy import stats
# i.e. 305 draws with equal probability
d = stats.binom(305, 0.5)
# the probability of seeing something greater than this value
p = d.sf(210)
which should give you the same value as Jens got: ~8.8e-12.
Next we can use the datetime module to convert this number into the expected time you have to wait:
from datetime import timedelta
time_per_try = timedelta(seconds=1/1000)
print(time_per_try / p)
which should give you ~1300 days, or 3.6 years. Technically, this is the time you'll have to wait to have a 50% chance of seeing it, and it could appear much sooner or later.
You can calculate reasonable values of when this would happen, using the negative binomial distribution. In Python, this looks like:
for q in stats.nbinom(1, p).ppf([0.025, 0.975]):
print(time_per_try * q)
where the 0.025 and 0.975 values give you the 95% confidence interval you hear scientists talking about.
It tells you that if you had 20 computers running your algorithm in parallel, each doing 1000 tests per second, you could expect the first one to finish in around a month while the slowest one would likely be going on for more than 10 years.
I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?
First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])
I would like to produce a random data of baby sleep time, but I want the random data to behave similarly (not necessarily equally) to the following graph:
(This is just an imaginary data, please don't conclude anything from this, specially not when your baby should sleep...)
The output that I want to produce is something like:
Baby name Sleep start Sleep end
Noah 2016/03/21 08:38 2016/03/21 09:28
Liam 2016/03/21 12:43 2016/03/21 15:00
Emma 2016/03/21 19:45 2016/03/22 06:03
So I thought I will create a weights table of time of day and weight (for the chance that a baby will sleep).
The question is how would I generate from this weights table a random data of a range of time to sleep?
(Think about if a baby start to sleep at around 8am, most likely he/she will wake in the next two hours and not continue to sleep more, and almost certainly won't sleep till 7am).
Is there another way you would build this (without the weights table)?
I prefer to build this in Python(3), but I would appreciate the general algorithm or lead to the solution.
Given the weights table data, you could use numpy.random.choice:
np.random.choice(list_of_times,
num_babies,
p=list_of_weights_for_those_times)
Without using a weights table, you would need to find the function that describes your distribution. Then see the answer to this question.
Let me start with answering the reverse of your question, since I misunderstood it; but it gave me the answer too.
Assume that, you already have a list of intervals dispersed around the 24 hours. You would like to find a number of the intervals that overlaps any given minute of the day; which you refer to as weight.
I can think of two approaches. But, first you should convert your time intervals into minutes, so the times in your list becomes:
# Note the 19:45-06:03 has been split into two intervals (1185,1440) and (0,363)
st = sorted(list(to_parts(sleep_table))
>>> [(0, 363), (518, 568), (763, 900), (1185, 1440)]
First, a simple solution will be to convert all intervals into a bunch of 1s and sum over all the intervals:
eod = 60*24
weights = reduce(lambda c,x: [l+r for l,r in zip(c, [0]*x[0] + [1]*(x[1]-x[0]) + [0]*(eod-x[1]))] ,st,[0]*eod)
This will give you a list of size 1440, where each entry is the weight for a given minute of the day.
Second, is a tiny bit more complex line sweep algorithm, which will give you the same values in O(nlogn) time for n segments. All you need is to just take the start and end times of the intervals and sort them, while keeping track of whether a time is a start or end time:
def start_end(st):
for t in st:
yield (t[0],1)
yield (t[1],-1)
sorted(list(start_end(st)))
#perform a line sweep to find changes in the weights
map(lambda (i,l):(i,sum(map(itemgetter(1),l))),groupby(sorted(list(start_end(st))), itemgetter(0)))
#compute the running sum of weights
#See question 35605014 for this part of the answer
Now, if you start from the weights themselves. You can easily convert it into a list of starts and end times, which are not coupled into intervals. All you need to do is to convert the smooth spline in the post into a step function. Whenever the step function increases in value, you add a sleep start time, and whenever it goes down you add a sleep stop time. Finally you perform a line sweep to match the sleep start time to a sleep end time. There is a bit of wiggle room here; as you can match any start time with any end time. If you want more data points, you can introduce additional sleep start and end times, as long as they are at the same point in time.
I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins: