Python DateTime In and DateTime Out - Calculations regarding a time Window - python

I have a dataframe of two DateTime Object columns (one representing a surgery clocking in and the other when it is clocked out). For each row (ie case), I need to create a column of total time within business hours (07:00 - 17:30) and another column of total time outside of business hours. I am not sure the best approach.
Reproducible segment of my dataframe:
Actual Room In DateTime Actual Room Out DateTime
0 2013-11-01 02:16 2013-11-01 04:35
1 2016-06-10 16:42 2016-06-10 19:28
2 2014-12-13 09:15 2014-12-13 10:55
3 2014-01-03 19:46 2014-01-03 22:54
4 2015-01-12 18:13 2015-01-12 19:58
5 2017-03-24 18:55 2017-03-24 19:57
6 2015-08-07 18:46 2015-08-07 19:42
7 2016-03-18 20:43 2016-03-19 00:40
8 2017-02-23 15:21 2017-02-23 17:35
9 2013-11-29 17:08 2013-11-29 17:42
10 2014-05-28 18:17 2014-05-28 19:12
11 2017-07-15 17:04 2017-07-15 18:19
12 2017-02-16 09:14 2017-02-16 21:29
13 2014-07-11 12:04 2014-07-11 17:40
14 2017-07-05 12:27 2017-07-05 20:08
15 2014-08-18 17:55 2014-08-18 19:50
16 2015-01-23 15:41 2015-01-23 19:41
17 2015-01-12 16:59 2015-01-12 17:49
18 2014-02-23 11:24 2014-02-23 15:06
19 2017-09-21 13:40 2017-09-21 18:11
pd.read_clipboard(sep=',')
The maximum amount of time between the two columns is:
df['Room Difference'] = df['Actual Room Out DateTime'] - df['Actual Room In DateTime']
max(df['Room Difference'])
Timedelta('1 days 01:17:00')
Which helps me think about the problem and the algorithm I want to write.
I guess it would go something like this (as pseudocode):
if 00:00:00 <= 'Actual Room In DateTime' < 07:00:00 and 00:00:00 <= 'Actual Room Out DateTime' < 07:00:00:
'After-hours' = 'Actual Room Out DateTime' - 'Actual Room In DateTime'
... to cover all the possible cases.
Is there an easier way or some sort of framework/tool for this exact kind of problem?

Subtract In time from Business Starting time , to get outside hours before day start
Subtract Out time from Business End time, to get outside hours after day end
Subtract In time from Out time, to get total hours of surgery
Add the two outside hours, to get total outside hours
Subtract total outside hours from total hours, to get total within business hours
Make a separate column for every calculation

Related

Resample pandas dataframe by two columns

I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1

calculate effective time of a process by subtracting non-working-time

I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:
>>> off_time
date (index) start end
2020-07-04 18:00:00 23:50:00
2020-08-24 00:00:00 08:00:00
2020-08-24 14:00:00 16:00:00
2020-09-04 00:00:00 23:59:59
2020-10-05 18:00:00 22:00:00
I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:
>>> data
process-name start-time end-time duration
name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 day 14:00:00
name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 14:00:00
name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 12:00:00
name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 02:50:00
name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 day 09:00:00
In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.
I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.
Set up sample data (I added a couple of rows to your samples to include some edge cases):
######### OFF TIMES
off = pd.DataFrame([
["2020-07-04", dt.time(18), dt.time(23,50)],
["2020-08-24", dt.time(0), dt.time(8)],
["2020-08-24", dt.time(14), dt.time(16)],
["2020-09-04", dt.time(0), dt.time(23,59,59)],
["2020-10-04", dt.time(15), dt.time(18)],
["2020-10-05", dt.time(18), dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")
### Convert start and end times to datetimes in UTC timezone, since that is much
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")
off
>>
start end
date
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00
######### PROCESS TIMES
data = pd.DataFrame([
["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])
data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])
data["duration"] = data.end -data.start
data
>>
process start end duration
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00
As you can see, I added a row to off on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2)
I also added the process name6 which is multiple days long.
This is my solution, which will be applied to each row in data
def get_relevant_off(pr):
relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
if not relevant.empty:
relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
to_subtract = (relevant.end - relevant.start).sum()
return pr["duration"] - to_subtract
else: return pr.duration
Explanation:
first row in the function subsets the relevant rows of off, based on the row pr
replace off starts that are lower than process starts with process starts and do the same with ends, since we don't want to sum the whole off time, but only what is actually at the same time as the process.
get the duration of off times by subtracting off starts from off ends and sum those
then subtract that from the total duration.
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
process start end duration effective
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00 0 days 04:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00 0 days 00:00:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00 1 days 04:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00 1 days 19:00:01
Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.

get each shift values group by python data frame

I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM.
Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM).
If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions.
If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition.
For production: get the total production of the shift partitions
For RPM: get average of the shift partions
If No, get the appropriate values for each Brand.
(For more understanding, Please check the expected output.)
Sample of the Raw dataframe:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 17:07 Day A 5 50
7/9/2020 17:07 7/9/2020 17:58 Day A 10 100
7/9/2020 17:58 7/9/2020 19:00 Day A 5 60
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/9/2020 22:40 Night B 5 20
7/9/2020 22:40 7/10/2020 7:00 Night B 5 30
7/10/2020 7:00 7/10/2020 18:27 Day C 15 20
7/10/2020 18:27 7/10/2020 19:00 Day C 5 40
Expected Output:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 19:00 Day A 20 70
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/10/2020 7:00 Night B 10 25
7/10/2020 7:00 7/10/2020 19:00 Day C 20 30
Thanks in advance.
Here's a suggestion:
Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)):
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
Then
df['Day'] = df['Start'].dt.strftime('%Y-%m-%d')
df = (df.groupby(['Day', 'Shift', 'Brand'])
.agg(Start = pd.NamedAgg(column='Start', aggfunc='min'),
End = pd.NamedAgg(column='End', aggfunc='max'),
Production = pd.NamedAgg(column='Production', aggfunc='sum'),
RPM = pd.NamedAgg(column='RPM', aggfunc='mean'))
.reset_index()[df.columns]
.drop('Day', axis='columns'))
gives you
Start End Shift Brand Production RPM
0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50
1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70
2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10
3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25
4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30
which seems to be your desired output (if I'm not mistaken).
If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding):
df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M')
df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')

How do I get the mean of celsius with based in the measued_at column?

I want to get the expected output below. How do I use groupby or resampling to get the mean celcius by hour but still keep the minute values in the measured_at column?
My input:
measured_at celsius
0 2020-05-19 01:13:40+00:00 15.00
1 2020-05-19 01:14:40+00:00 16.50
1 2020-05-20 02:13:26+00:00 30.00
2 2020-05-20 02:14:57+00:00 15.35
3 2020-05-20 02:15:19+00:00 14.00
4 2020-05-20 12:06:39+00:00 20.00
5 2020-05-21 03:13:07+00:00 15.50
6 2020-05-22 12:09:37+00:00 15.00
df['measured_at'] = pd.to_datetime(df.measured_at)
df1 = df.resample('60T', on='measured_at')['celsius'].mean().dropna().reset_index()
My output:
measured_at celsius
0 2020-05-19 01:00:00+00:00 15.750000
1 2020-05-20 02:00:00+00:00 19.783333
2 2020-05-20 12:00:00+00:00 20.000000
3 2020-05-21 03:00:00+00:00 15.500000
4 2020-05-22 12:00:00+00:00 15.000000
Expected output:
measured_at celsius
0 2020-05-19 01:13:00+00:00 15.750000
1 2020-05-20 02:13:00+00:00 19.783333
2 2020-05-20 12:06:00+00:00 20.000000
3 2020-05-21 03:13:00+00:00 15.500000
4 2020-05-22 12:09:00+00:00 15.000000
Here's the code for your use case.
I took out the minutes and seconds part so that they could be averaged and add back after the resampling.
Not sure what the +00:00 is for, if it is for better precision and you need it, you can convert into microseconds or nanoseconds instead.
import pandas as pd
from datetime import datetime
# Convert to datetime object
df['measured_at'] = df['measured_at'].apply(pd.to_datetime)
# Extract minutes and seconds as total seconds
df['seconds'] = df['measured_at'].apply(lambda x: (x.minute*60)+x.second)
# Resample to periods of one hour
df = df.resample('60T', on='measured_at').mean().dropna().reset_index()
# Add back average minutes for each period
df['measured_at'] = df['measured_at'] + pd.to_timedelta(df['seconds'].apply(int),'s')
# Remove seconds column
df = df.drop(columns='seconds')

python pandas mean by hour of day

I'm working with the following dataset with hourly counts in columns. The dataframe has more than 1400 columns and 100 rows.
My dataset looks like this:
CITY 2019-10-01 00:00 2019-10-01 01:00 2019-10-01 02:00 .... 2019-12-01 12:00
Wien 15 16 16 .... 14
Graz 11 11 11 .... 10
Innsbruck 12 12 10 .... 12
....
How can I convert this datatime to datetime such as this:
CITY 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
(or 1 day) (or 2 day) (or 3 day) (or 72 day)
Wien 14 15 16 .... 12
Graz 13 12 14 .... 10
Innsbruck 13 12 12 .... 12
....
I would like the average of all hours of the day to be in the column of the one day.
The data type is:
type(df.columns[0])
out: str
type(df.columns[1])
out: pandas._libs.tslibs.timestamps.Timestamp
Thanks for your help!
I would do something like this:
days = df.columns[1:].to_series().dt.normalize()
df.set_index('CITY').groupby(days, axis=1).mean()
Output:
2019-10-01 2019-12-01
CITY
Wien 15.666667 14.0
Salzburg 12.000000 14.0
Graz 11.000000 10.0
Innsbruck 11.333333 12.0

Categories