Calculate time intervals with custom start/end points - python

I have a few datasets each of which has time intervals as following:
configStartDate configEndDate
2012-06-07 10:38:01.000 2012-06-11 13:35:25.000
2012-07-12 20:00:55.000 2012-07-17 10:17:53.000
2012-07-18 12:44:15.000 2012-07-20 02:15:47.000
2012-07-20 02:15:47.000 2012-10-05 10:35:19.000
2012-10-05 10:35:19.000 2012-11-13 10:44:24.000
I need to write a query function (in R, but I am just figuring out the logic right now; prototyping in Python) which would take two custom start and end dates and sum up the intervals in between.
The issue is that the query dates could start in the middle, or outside of the time chunks. So, for instance, in the above example, my query could be for time interval 2012-06-09 and 2012-11-11, in which case I’d have to modify the start and end dates of the first and last chunk. But, the first interval could also start in the middle of the second chunk, etc., etc.
The code to add up chunks is trivial:
diff_days = (pd.to_datetime(df_41.configEndDate) - pd.to_datetime(df_41.configStartDate)).astype('timedelta64[h]') / 24
print(sum(diff_days))
# 126.541666667 days
But now I am looking for the most efficient way to do the custom query starts and ends.
Right now what I am thinking is:
Loop through each configStartDate, configEndDate combination:
If query_start is before the particular chunk’s end date, set that chunk as the first to be included in the calculation and set its start date as the max(query_start, current_start_date). Exit the loop.
Then do the same but opposite for the query_end (replace the chunk's end date with start date, before with after, and maximum with minimum). Set that chunk as the last to be included for the calculation. Exit the loop.
In R-style pseudocode, it would look like:
ix = which(end_date > start_query)[1]
start_date[ix] = max(start_date[ix], start_query)
chunk[ix] -> first chunk
repeat with end_query, opposite signs
Is there an easier way to implement this? (I am not looking for the specific code help; just the logic advise.) Thanks.

Related

OR-Tools employee scheduling: maximising consecutive shifts

I've been studying the OR-Tools Employee Scheduling example and would like to change it so that the model would allow employees to be assigned to multiple shifts per day and at the same time give preference to solutions
where an employee is assigned consecutive shifts within the span of a given day. Allowing multiple shifts seems straightforward, but how can I define an objective function that would prefer consecutive shifts? I don't want consecutiveness to be a hard constraint for the solution.
First, ignore the nurse rostering example, look at the shift_scheduling_sat.py example.
You can try having more types of shifts.
If you really want flexible shifts, define a maximum number of continuous shifts, each with start and end.
Order them: start1 <= end1 < start2 <= end2 ....
Force each empty shifts to be midnight -> midnight. Then start from there. But I would recommend more sets of fixed shifts.

Algorithm for random range of time in day with weights

I would like to produce a random data of baby sleep time, but I want the random data to behave similarly (not necessarily equally) to the following graph:
(This is just an imaginary data, please don't conclude anything from this, specially not when your baby should sleep...)
The output that I want to produce is something like:
Baby name Sleep start Sleep end
Noah 2016/03/21 08:38 2016/03/21 09:28
Liam 2016/03/21 12:43 2016/03/21 15:00
Emma 2016/03/21 19:45 2016/03/22 06:03
So I thought I will create a weights table of time of day and weight (for the chance that a baby will sleep).
The question is how would I generate from this weights table a random data of a range of time to sleep?
(Think about if a baby start to sleep at around 8am, most likely he/she will wake in the next two hours and not continue to sleep more, and almost certainly won't sleep till 7am).
Is there another way you would build this (without the weights table)?
I prefer to build this in Python(3), but I would appreciate the general algorithm or lead to the solution.
Given the weights table data, you could use numpy.random.choice:
np.random.choice(list_of_times,
num_babies,
p=list_of_weights_for_those_times)
Without using a weights table, you would need to find the function that describes your distribution. Then see the answer to this question.
Let me start with answering the reverse of your question, since I misunderstood it; but it gave me the answer too.
Assume that, you already have a list of intervals dispersed around the 24 hours. You would like to find a number of the intervals that overlaps any given minute of the day; which you refer to as weight.
I can think of two approaches. But, first you should convert your time intervals into minutes, so the times in your list becomes:
# Note the 19:45-06:03 has been split into two intervals (1185,1440) and (0,363)
st = sorted(list(to_parts(sleep_table))
>>> [(0, 363), (518, 568), (763, 900), (1185, 1440)]
First, a simple solution will be to convert all intervals into a bunch of 1s and sum over all the intervals:
eod = 60*24
weights = reduce(lambda c,x: [l+r for l,r in zip(c, [0]*x[0] + [1]*(x[1]-x[0]) + [0]*(eod-x[1]))] ,st,[0]*eod)
This will give you a list of size 1440, where each entry is the weight for a given minute of the day.
Second, is a tiny bit more complex line sweep algorithm, which will give you the same values in O(nlogn) time for n segments. All you need is to just take the start and end times of the intervals and sort them, while keeping track of whether a time is a start or end time:
def start_end(st):
for t in st:
yield (t[0],1)
yield (t[1],-1)
sorted(list(start_end(st)))
#perform a line sweep to find changes in the weights
map(lambda (i,l):(i,sum(map(itemgetter(1),l))),groupby(sorted(list(start_end(st))), itemgetter(0)))
#compute the running sum of weights
#See question 35605014 for this part of the answer
Now, if you start from the weights themselves. You can easily convert it into a list of starts and end times, which are not coupled into intervals. All you need to do is to convert the smooth spline in the post into a step function. Whenever the step function increases in value, you add a sleep start time, and whenever it goes down you add a sleep stop time. Finally you perform a line sweep to match the sleep start time to a sleep end time. There is a bit of wiggle room here; as you can match any start time with any end time. If you want more data points, you can introduce additional sleep start and end times, as long as they are at the same point in time.

Dividing Pandas DataFrame rows into similar time-based groups

I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:

fill missing values in python array

Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.
When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.
Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)

Get Intervals Between Two Times

A User will specify a time interval of n secs/mins/hours and then two times (start / stop).
I need to be able to take this interval, and then step through the start and stop times, in order to get a list of these times. Then after this, I will perform a database look up via a table.objects.filter, in order to retrieve the data corresponding to each time.
I'm making some ridiculously long algorithms at the moment and I'm positive there could be an easier way to do this. That is, a more pythonic way. Thoughts?
it fits nicely as a generator, too:
def timeseq(start,stop,interval):
while start <= stop:
yield start
start += interval
used as:
for t in timeseq(start,stop,interval):
table.objects.filter(t)
or:
data = [table.objects.filter(t) for t in timeseq(start,stop,interval)]
Are you looking for something like this? (pseudo code)
t = start
while t != stop:
t += interval
table.objects.filter(t)
What about ...
result = RelevantModel.objects.filter(relavant_field__in=[
start + interval * i
for i in xrange((start - end).seconds / interval.seconds)
])
... ?
I can't imagine this is very different from what you're already doing, but perhaps it's more compact (particularly if you weren't using foo__in=[bar] or a list comprehension). Of course start and end would be datetime.datetime objects and interval would be a datetime.timedelta object.

Categories