Algorithm for random range of time in day with weights - python

I would like to produce a random data of baby sleep time, but I want the random data to behave similarly (not necessarily equally) to the following graph:
(This is just an imaginary data, please don't conclude anything from this, specially not when your baby should sleep...)
The output that I want to produce is something like:
Baby name Sleep start Sleep end
Noah 2016/03/21 08:38 2016/03/21 09:28
Liam 2016/03/21 12:43 2016/03/21 15:00
Emma 2016/03/21 19:45 2016/03/22 06:03
So I thought I will create a weights table of time of day and weight (for the chance that a baby will sleep).
The question is how would I generate from this weights table a random data of a range of time to sleep?
(Think about if a baby start to sleep at around 8am, most likely he/she will wake in the next two hours and not continue to sleep more, and almost certainly won't sleep till 7am).
Is there another way you would build this (without the weights table)?
I prefer to build this in Python(3), but I would appreciate the general algorithm or lead to the solution.

Given the weights table data, you could use numpy.random.choice:
np.random.choice(list_of_times,
num_babies,
p=list_of_weights_for_those_times)
Without using a weights table, you would need to find the function that describes your distribution. Then see the answer to this question.

Let me start with answering the reverse of your question, since I misunderstood it; but it gave me the answer too.
Assume that, you already have a list of intervals dispersed around the 24 hours. You would like to find a number of the intervals that overlaps any given minute of the day; which you refer to as weight.
I can think of two approaches. But, first you should convert your time intervals into minutes, so the times in your list becomes:
# Note the 19:45-06:03 has been split into two intervals (1185,1440) and (0,363)
st = sorted(list(to_parts(sleep_table))
>>> [(0, 363), (518, 568), (763, 900), (1185, 1440)]
First, a simple solution will be to convert all intervals into a bunch of 1s and sum over all the intervals:
eod = 60*24
weights = reduce(lambda c,x: [l+r for l,r in zip(c, [0]*x[0] + [1]*(x[1]-x[0]) + [0]*(eod-x[1]))] ,st,[0]*eod)
This will give you a list of size 1440, where each entry is the weight for a given minute of the day.
Second, is a tiny bit more complex line sweep algorithm, which will give you the same values in O(nlogn) time for n segments. All you need is to just take the start and end times of the intervals and sort them, while keeping track of whether a time is a start or end time:
def start_end(st):
for t in st:
yield (t[0],1)
yield (t[1],-1)
sorted(list(start_end(st)))
#perform a line sweep to find changes in the weights
map(lambda (i,l):(i,sum(map(itemgetter(1),l))),groupby(sorted(list(start_end(st))), itemgetter(0)))
#compute the running sum of weights
#See question 35605014 for this part of the answer
Now, if you start from the weights themselves. You can easily convert it into a list of starts and end times, which are not coupled into intervals. All you need to do is to convert the smooth spline in the post into a step function. Whenever the step function increases in value, you add a sleep start time, and whenever it goes down you add a sleep stop time. Finally you perform a line sweep to match the sleep start time to a sleep end time. There is a bit of wiggle room here; as you can match any start time with any end time. If you want more data points, you can introduce additional sleep start and end times, as long as they are at the same point in time.

Related

Quickly get coverage of each position in arrary

It is a short problem, There is a list of intervals, for example:
[1,4],[2,5],[3,6]
I want to get the coverage, which means number of intervals that overlap on the current position of each position from smallest to largest number (1 to 6)(like a line sweep from 1 to 6 and get number of hits at each position), the result should be:
[1,2,3,3,2,1]
Is there any way to quickly find this out, I can iterate all positions and see if it is in each interval, but that is too slow, is there any way to get this quick? Because in practice there can be millions of intervals, I am thinking of representing each interval as bit arrary but still cannot figure out, if anyone has idea, please let me know Thanks!
You could map your set of intervals in a set of number pairs (time; delta)
[1,4],[2,5],[3,6]
becomes
(1;+1), (4;-1), (2;+1), (5;-1), (3;+1), (6;-1)
Then sort the set of pairs in ascending order of their time entries
(1;+1), (2;+1), (3;+1), (4;-1), (5;-1), (6;-1)
Finally, go through this list and increment/decrement the coverage count, initialized at zero:
(1;1), (2;2), (3;3), (4;2), (5;1), (6;0)

agents arrival according to poisson process

I am trying to achieve agents arrival in my model according to a poisson process. I know from data that on average 230 agents arrive per day (or 9.583 agents/hr or 0.1597/minute). In the simulations, now I need to use this information to add agents. One simulation time step is equal to 5 minutes (real time) and if if we calculate from data, then on average 0.7986 agents should be added to simulation every time step to achieve an average of 230 per day. But how could I do this? I cannot use 0.7986 per time step because I need integer number to add agent. If I round off 0.7986 to 1, then I over estimate this.
It is clear that we cannot add agent every time step but I have no clue how to select a time step in which an agent must be added. If I know which time step I need to select to add an agent, I can do that easily. Does any one know how to do this in Python? I tried the below code but cannot really understand what it is actually
for i in range(1,12): # 1 simulation time step is equal 5min, so this loops covers 1 hour.
time=int(random.expovariate(1/0.7986))
I do not really understand the above code as it produces quite different numbers. Any help please.
If agent arrivals is a Poisson process then the time between individual agent arrivals has an exponential distribution. That's what the code you provided generates, but is only useful if you are using continuous time with discrete event scheduling. With time-step as the time advance mechanism, you actually just want to stick with the Poisson distribution, adjusting the rate to match your time-step interval size, which you've already done.
import numpy
last_step = 12 * 24 # to simulate one day, for example
rate = 230.0 / last_step
for time_step in range(1, last_step + 1):
number_of_new_agents = numpy.random.poisson(rate)
for new_agent_number in range(number_of_new_agents):
# do whatever you want at this point
Note that the number_of_new_agents will often be 0, in which case the inner loop will iterate zero times.

Predict a user input using current datetime and past history

The project
I'm working on a small project where users can create events (such as Eat, Sleep, Watch a movie, etc.) and record log entries matching these events.
My data model looks like this (the application itself is in Python 3 / Django, but I don't think it matters here):
# a sample event
event = {
'id': 1,
'name': 'Eat',
}
# a sample entry
entry = {
'event_id': event['id'],
'user_id': 12,
# record date of the entry
'date': '2017-03-16T12:56:32.095465+00:00',
# user-given tags for the entry
'tags': ['home', 'delivery'],
# A user notation for the entry, it can be positive or negative
'score': 2,
'comment': 'That was a tasty meal',
}
Users can record an arbitrary number of entries for an arbitrary number of events, they can create new events when they need to. The data is stored in a relational database.
Now, I'd like to make data entry easier for users, by suggesting them relevant events when they visit the "Add an entry" form. At the moment, they can select the event corresponding to their entry in a dropdown, but I want to suggest them a few relevant events on top of that.
I was thinking that, given a user history (all the recorded entries), it should be possible to predict likely inputs, by identifying patterns in entries, e.g:
Eat usually happens every day, around noon and 7:00PM
Sleep usually occurs after 10:00 PM
Watch a movie usually occurs after on friday nights after 8:00 PM
Ideally, I'd like a function, that, given a user ID and a datetime, and using the user history, will return a list of events that are more likely to occur:
def get_events(user_id, datetime, max=3):
# implementation
# returns a list of up to max events
return events
So if I take the previous example (with more human dates), I would get the following results:
>>> get_events(user_id, 'Friday at 9:00 PM')
['Watch a movie', 'Sleep', 'Eat']
>>> get_events(user_id, 'Friday at 9:00 PM', max=2)
['Watch a movie', 'Sleep']
>>> get_events(user_id, 'Monday at 9:00 PM')
['Sleep', 'Eat', 'Watch a movie']
>>> get_events(user_id, 'Monday at noon')
['eat']
Of course, in the real life, I'll pass real datetimes, and I'd like to get an event ID so I can get corresponding data from the database.
My question
(Sorry, if it took some time to explain the whole thing)
My actual question is, what are the actual algorithms / tools / libraries required to achieve this? Is it even possible to do?
My current guess is that I'll need to use some fancy machine learning stuff, using something like scikit-learn and classifiers, feeding it with the user history to train it, then ask the whole thing to do its magic.
I'm not familiar at all with machine learning and I fear I don't have enough mathematical/scientific background to get started on my own. Can you provide me some reference material to help me understand how to solve this, the algorithms / vocabulary I have to dig in, or some pseudocode ?
I think a k-nearest neighbours (kNN) approach would be a good starting point. The idea in this specific case is to look for the k events that happen closest to the given time-of-day and count the ones that occur the most often.
Example
Say you have as input Friday at 9:00 PM. Take the distance of all
events in the database to this date and rank them in ascending order.
For example if we take the distance in minutes for all elements in the
database, an example ranking could be as follows.
('Eat', 34)
('Sleep', 54)
('Eat', 76)
...
('Watch a movie', 93)
Next you take the first k = 3 of those and compute how often they
occur,
('Eat', 2)
('Sleep', 1)
so that the function returns ['Eat', 'Sleep'] (in that order).
Choosing a good k is important. Too small values will allow accidental outliers (doing something once on a specific moment) to have a large influence on the result. Choosing k too large will allow for unrelated events to be included in the count. One way to alleviate this is by using distance weighted kNN (see below).
Choosing a distance function
As mentioned in the comments, using a simple distance between two timestamps might lose you some information such as day-of-week. We can solve this by making the distance function d(e1, e2) slightly more complex. In this case, we can choose it to be a tradeoff between time-of-day and day-of-week, e.g.
d(e1, e2) = a * |timeOfDay(e1) - timeOfDay(e2)| * (1/1440) +
b * |dayOfWeek(e1) - dayOfWeek(e2)| * (1/7)
where we normalised both differences by the maximum difference possible between times in a day (in minutes) and days of the week. a and b are parameters that can be used to give more weight to one of these differences. For example if we choose a = 3 and b = 1 we say that occurring at the same day is three times more important than occurring at the same time.
Distance weighted kNN
You can increase complexity (and hopefully performance) by not simply selecting the k closest elements but by assigning a weight (such as the distance) to all events according to their distance of the given point. Let e be the input example and o be an example from the database. Then we compute o's weight with respect to e as
1
w_o = ---------
d(e, o)^2
We see that points lose weight faster than their distance to e increases. In your case, a number of elements is then to be selected from the final ranking. This can be done by summing the weights for identical events to compute a final ranking of event types.
Implementation
The nice thing about kNN is that it is very easy to implement. You'll roughly need the following components.
An implementation of the distance function d(e1, e2).
A function that ranks all elements in the database according to this function and a given input example.
def rank(e, db, d):
""" Rank the examples in db with respect to e using
distance function d.
"""
return sorted([(o, d(e, o)) for o in db],
key=lambda x: x[1])
A function that selects some elements from this ranking.

Calculate time intervals with custom start/end points

I have a few datasets each of which has time intervals as following:
configStartDate configEndDate
2012-06-07 10:38:01.000 2012-06-11 13:35:25.000
2012-07-12 20:00:55.000 2012-07-17 10:17:53.000
2012-07-18 12:44:15.000 2012-07-20 02:15:47.000
2012-07-20 02:15:47.000 2012-10-05 10:35:19.000
2012-10-05 10:35:19.000 2012-11-13 10:44:24.000
I need to write a query function (in R, but I am just figuring out the logic right now; prototyping in Python) which would take two custom start and end dates and sum up the intervals in between.
The issue is that the query dates could start in the middle, or outside of the time chunks. So, for instance, in the above example, my query could be for time interval 2012-06-09 and 2012-11-11, in which case I’d have to modify the start and end dates of the first and last chunk. But, the first interval could also start in the middle of the second chunk, etc., etc.
The code to add up chunks is trivial:
diff_days = (pd.to_datetime(df_41.configEndDate) - pd.to_datetime(df_41.configStartDate)).astype('timedelta64[h]') / 24
print(sum(diff_days))
# 126.541666667 days
But now I am looking for the most efficient way to do the custom query starts and ends.
Right now what I am thinking is:
Loop through each configStartDate, configEndDate combination:
If query_start is before the particular chunk’s end date, set that chunk as the first to be included in the calculation and set its start date as the max(query_start, current_start_date). Exit the loop.
Then do the same but opposite for the query_end (replace the chunk's end date with start date, before with after, and maximum with minimum). Set that chunk as the last to be included for the calculation. Exit the loop.
In R-style pseudocode, it would look like:
ix = which(end_date > start_query)[1]
start_date[ix] = max(start_date[ix], start_query)
chunk[ix] -> first chunk
repeat with end_query, opposite signs
Is there an easier way to implement this? (I am not looking for the specific code help; just the logic advise.) Thanks.

Dividing Pandas DataFrame rows into similar time-based groups

I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:

Categories