I've been studying the OR-Tools Employee Scheduling example and would like to change it so that the model would allow employees to be assigned to multiple shifts per day and at the same time give preference to solutions
where an employee is assigned consecutive shifts within the span of a given day. Allowing multiple shifts seems straightforward, but how can I define an objective function that would prefer consecutive shifts? I don't want consecutiveness to be a hard constraint for the solution.
First, ignore the nurse rostering example, look at the shift_scheduling_sat.py example.
You can try having more types of shifts.
If you really want flexible shifts, define a maximum number of continuous shifts, each with start and end.
Order them: start1 <= end1 < start2 <= end2 ....
Force each empty shifts to be midnight -> midnight. Then start from there. But I would recommend more sets of fixed shifts.
Related
# total payments = the sum of monthly payments
# object-level method for calculation in Loan class
def totalPayments(self):
# the monthly payment might be different depending on the period
t = 0 # initialize the period
m_sum = 0 # initialize the sum
while t < self._term: # run until we reach the total term
m_sum += self.monthlyPayment(t) # sum up each monthly payment
t += 1 # go to next period
return m_sum
monthly payment might be different depending on different period, so instead of simply multiplying it by term, I chose to sum up each payment individually. Is there a easier way of doing this?
I thought to do this at first
sum(payment for payment in self.monthlyPayment(t) if term <= t)
But t is not initialized and won't be incremented to calculate each payment. So I was wondering if there is any easier approach that could possibly achieve the above functionality in a single line or so?
Your variable t increments by 1 each time, so why don't you use a range object?
for t in range(0, self._term): # You can omitt the 0
...
So, if you want to mantain your comprehension, the best way should be this:
sum(self.monthlyPayment(t) for t in range(self._term))
You're close, but you need to iterate over ts here, and range lets you bake in the end condition:
sum(self.monthlyPayment(t) for t in range(self._term))
or if you like using map (slightly less verbose since you've already got a method doing what you want, if less familiar to some, and perhaps trivially faster by avoiding bytecode execution during the loop):
sum(map(self.monthlyPayment, range(self._term)))
I think the proper statement would be
sum(self.monthlyPayment(t) for t in range(self._term))
self.monthlyPayment(t) doesn't return a sequence that you can iterate over. You need to loop over the range of arguments to this function and call it for each.
sum(self.monthyPayment(t) for t in range(self._term))
That should do it.
m_sum = sum(self.monthlyPayment(t) for t in range(self._term))
The project
I'm working on a small project where users can create events (such as Eat, Sleep, Watch a movie, etc.) and record log entries matching these events.
My data model looks like this (the application itself is in Python 3 / Django, but I don't think it matters here):
# a sample event
event = {
'id': 1,
'name': 'Eat',
}
# a sample entry
entry = {
'event_id': event['id'],
'user_id': 12,
# record date of the entry
'date': '2017-03-16T12:56:32.095465+00:00',
# user-given tags for the entry
'tags': ['home', 'delivery'],
# A user notation for the entry, it can be positive or negative
'score': 2,
'comment': 'That was a tasty meal',
}
Users can record an arbitrary number of entries for an arbitrary number of events, they can create new events when they need to. The data is stored in a relational database.
Now, I'd like to make data entry easier for users, by suggesting them relevant events when they visit the "Add an entry" form. At the moment, they can select the event corresponding to their entry in a dropdown, but I want to suggest them a few relevant events on top of that.
I was thinking that, given a user history (all the recorded entries), it should be possible to predict likely inputs, by identifying patterns in entries, e.g:
Eat usually happens every day, around noon and 7:00PM
Sleep usually occurs after 10:00 PM
Watch a movie usually occurs after on friday nights after 8:00 PM
Ideally, I'd like a function, that, given a user ID and a datetime, and using the user history, will return a list of events that are more likely to occur:
def get_events(user_id, datetime, max=3):
# implementation
# returns a list of up to max events
return events
So if I take the previous example (with more human dates), I would get the following results:
>>> get_events(user_id, 'Friday at 9:00 PM')
['Watch a movie', 'Sleep', 'Eat']
>>> get_events(user_id, 'Friday at 9:00 PM', max=2)
['Watch a movie', 'Sleep']
>>> get_events(user_id, 'Monday at 9:00 PM')
['Sleep', 'Eat', 'Watch a movie']
>>> get_events(user_id, 'Monday at noon')
['eat']
Of course, in the real life, I'll pass real datetimes, and I'd like to get an event ID so I can get corresponding data from the database.
My question
(Sorry, if it took some time to explain the whole thing)
My actual question is, what are the actual algorithms / tools / libraries required to achieve this? Is it even possible to do?
My current guess is that I'll need to use some fancy machine learning stuff, using something like scikit-learn and classifiers, feeding it with the user history to train it, then ask the whole thing to do its magic.
I'm not familiar at all with machine learning and I fear I don't have enough mathematical/scientific background to get started on my own. Can you provide me some reference material to help me understand how to solve this, the algorithms / vocabulary I have to dig in, or some pseudocode ?
I think a k-nearest neighbours (kNN) approach would be a good starting point. The idea in this specific case is to look for the k events that happen closest to the given time-of-day and count the ones that occur the most often.
Example
Say you have as input Friday at 9:00 PM. Take the distance of all
events in the database to this date and rank them in ascending order.
For example if we take the distance in minutes for all elements in the
database, an example ranking could be as follows.
('Eat', 34)
('Sleep', 54)
('Eat', 76)
...
('Watch a movie', 93)
Next you take the first k = 3 of those and compute how often they
occur,
('Eat', 2)
('Sleep', 1)
so that the function returns ['Eat', 'Sleep'] (in that order).
Choosing a good k is important. Too small values will allow accidental outliers (doing something once on a specific moment) to have a large influence on the result. Choosing k too large will allow for unrelated events to be included in the count. One way to alleviate this is by using distance weighted kNN (see below).
Choosing a distance function
As mentioned in the comments, using a simple distance between two timestamps might lose you some information such as day-of-week. We can solve this by making the distance function d(e1, e2) slightly more complex. In this case, we can choose it to be a tradeoff between time-of-day and day-of-week, e.g.
d(e1, e2) = a * |timeOfDay(e1) - timeOfDay(e2)| * (1/1440) +
b * |dayOfWeek(e1) - dayOfWeek(e2)| * (1/7)
where we normalised both differences by the maximum difference possible between times in a day (in minutes) and days of the week. a and b are parameters that can be used to give more weight to one of these differences. For example if we choose a = 3 and b = 1 we say that occurring at the same day is three times more important than occurring at the same time.
Distance weighted kNN
You can increase complexity (and hopefully performance) by not simply selecting the k closest elements but by assigning a weight (such as the distance) to all events according to their distance of the given point. Let e be the input example and o be an example from the database. Then we compute o's weight with respect to e as
1
w_o = ---------
d(e, o)^2
We see that points lose weight faster than their distance to e increases. In your case, a number of elements is then to be selected from the final ranking. This can be done by summing the weights for identical events to compute a final ranking of event types.
Implementation
The nice thing about kNN is that it is very easy to implement. You'll roughly need the following components.
An implementation of the distance function d(e1, e2).
A function that ranks all elements in the database according to this function and a given input example.
def rank(e, db, d):
""" Rank the examples in db with respect to e using
distance function d.
"""
return sorted([(o, d(e, o)) for o in db],
key=lambda x: x[1])
A function that selects some elements from this ranking.
I would like to produce a random data of baby sleep time, but I want the random data to behave similarly (not necessarily equally) to the following graph:
(This is just an imaginary data, please don't conclude anything from this, specially not when your baby should sleep...)
The output that I want to produce is something like:
Baby name Sleep start Sleep end
Noah 2016/03/21 08:38 2016/03/21 09:28
Liam 2016/03/21 12:43 2016/03/21 15:00
Emma 2016/03/21 19:45 2016/03/22 06:03
So I thought I will create a weights table of time of day and weight (for the chance that a baby will sleep).
The question is how would I generate from this weights table a random data of a range of time to sleep?
(Think about if a baby start to sleep at around 8am, most likely he/she will wake in the next two hours and not continue to sleep more, and almost certainly won't sleep till 7am).
Is there another way you would build this (without the weights table)?
I prefer to build this in Python(3), but I would appreciate the general algorithm or lead to the solution.
Given the weights table data, you could use numpy.random.choice:
np.random.choice(list_of_times,
num_babies,
p=list_of_weights_for_those_times)
Without using a weights table, you would need to find the function that describes your distribution. Then see the answer to this question.
Let me start with answering the reverse of your question, since I misunderstood it; but it gave me the answer too.
Assume that, you already have a list of intervals dispersed around the 24 hours. You would like to find a number of the intervals that overlaps any given minute of the day; which you refer to as weight.
I can think of two approaches. But, first you should convert your time intervals into minutes, so the times in your list becomes:
# Note the 19:45-06:03 has been split into two intervals (1185,1440) and (0,363)
st = sorted(list(to_parts(sleep_table))
>>> [(0, 363), (518, 568), (763, 900), (1185, 1440)]
First, a simple solution will be to convert all intervals into a bunch of 1s and sum over all the intervals:
eod = 60*24
weights = reduce(lambda c,x: [l+r for l,r in zip(c, [0]*x[0] + [1]*(x[1]-x[0]) + [0]*(eod-x[1]))] ,st,[0]*eod)
This will give you a list of size 1440, where each entry is the weight for a given minute of the day.
Second, is a tiny bit more complex line sweep algorithm, which will give you the same values in O(nlogn) time for n segments. All you need is to just take the start and end times of the intervals and sort them, while keeping track of whether a time is a start or end time:
def start_end(st):
for t in st:
yield (t[0],1)
yield (t[1],-1)
sorted(list(start_end(st)))
#perform a line sweep to find changes in the weights
map(lambda (i,l):(i,sum(map(itemgetter(1),l))),groupby(sorted(list(start_end(st))), itemgetter(0)))
#compute the running sum of weights
#See question 35605014 for this part of the answer
Now, if you start from the weights themselves. You can easily convert it into a list of starts and end times, which are not coupled into intervals. All you need to do is to convert the smooth spline in the post into a step function. Whenever the step function increases in value, you add a sleep start time, and whenever it goes down you add a sleep stop time. Finally you perform a line sweep to match the sleep start time to a sleep end time. There is a bit of wiggle room here; as you can match any start time with any end time. If you want more data points, you can introduce additional sleep start and end times, as long as they are at the same point in time.
I have a few datasets each of which has time intervals as following:
configStartDate configEndDate
2012-06-07 10:38:01.000 2012-06-11 13:35:25.000
2012-07-12 20:00:55.000 2012-07-17 10:17:53.000
2012-07-18 12:44:15.000 2012-07-20 02:15:47.000
2012-07-20 02:15:47.000 2012-10-05 10:35:19.000
2012-10-05 10:35:19.000 2012-11-13 10:44:24.000
I need to write a query function (in R, but I am just figuring out the logic right now; prototyping in Python) which would take two custom start and end dates and sum up the intervals in between.
The issue is that the query dates could start in the middle, or outside of the time chunks. So, for instance, in the above example, my query could be for time interval 2012-06-09 and 2012-11-11, in which case I’d have to modify the start and end dates of the first and last chunk. But, the first interval could also start in the middle of the second chunk, etc., etc.
The code to add up chunks is trivial:
diff_days = (pd.to_datetime(df_41.configEndDate) - pd.to_datetime(df_41.configStartDate)).astype('timedelta64[h]') / 24
print(sum(diff_days))
# 126.541666667 days
But now I am looking for the most efficient way to do the custom query starts and ends.
Right now what I am thinking is:
Loop through each configStartDate, configEndDate combination:
If query_start is before the particular chunk’s end date, set that chunk as the first to be included in the calculation and set its start date as the max(query_start, current_start_date). Exit the loop.
Then do the same but opposite for the query_end (replace the chunk's end date with start date, before with after, and maximum with minimum). Set that chunk as the last to be included for the calculation. Exit the loop.
In R-style pseudocode, it would look like:
ix = which(end_date > start_query)[1]
start_date[ix] = max(start_date[ix], start_query)
chunk[ix] -> first chunk
repeat with end_query, opposite signs
Is there an easier way to implement this? (I am not looking for the specific code help; just the logic advise.) Thanks.
I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins: