I have the following pandas dataframe, the duration is espressed in Minutes:
Start Date Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
I would like to calculate the duration of each event for a single day. The problem is that there are some event like the one in line 3 that are across multiple days.
What I would like to obtain is something like this:
Date Event Duration
2021.01.01 1 880
2021.01.01 2 560
2021.01.02 1 760
2021.01.02 2 60
In general the sum of all events in a specific day cannot exceed 1440 which is 24 hours * 60 minutes. The event are continuous so there is alway an event, there are never times without events.
For some weird reasons I could not convert your dates right away but needed to replace whitespaces. Nonetheless, let’s start by converting your Date column to pandas dates and set it as an index:
>>> df['Start Date'] = pd.to_datetime(df['Start Date'].str.replace(r'\s+', ' ', regex=True))
>>> df = df.set_index('Start Date')
>>> df
Event Duration
2021-01-01 00:00:00 2 540
2021-01-01 09:00:00 1 180
2021-01-01 12:00:00 2 20
2021-01-01 12:20:00 1 1440
2021-01-02 12:20:00 2 60
2021-01-02 13:20:00 1 20
We can then compute which splits need to be done, aka timestamps where the day changes but that don’t appear as Start Date, and add those to the index:
>>> splits = pd.date_range(df.index.min().floor(freq='D') + pd.Timedelta(days=1), df.index.max().ceil(freq='D') - pd.Timedelta(days=1), freq='D')
>>> df = df.reindex(df.index.append(splits).drop_duplicates().sort_values())
>>> df
Event Duration
2021-01-01 00:00:00 2.0 540.0
2021-01-01 09:00:00 1.0 180.0
2021-01-01 12:00:00 2.0 20.0
2021-01-01 12:20:00 1.0 1440.0
2021-01-02 00:00:00 NaN NaN
2021-01-02 12:20:00 2.0 60.0
2021-01-02 13:20:00 1.0 20.0
At this point we know it’s the difference between indexes that’s the time we want. Fill in the blanks from Duration, then we can simply group by day/event and sum without any unexpected behaviour:
>>> minutes = df.index.to_series().diff().shift(-1).astype('timedelta64[m]').fillna(df['Duration'])
>>> minutes
2021-01-01 00:00:00 540.0
2021-01-01 09:00:00 180.0
2021-01-01 12:00:00 20.0
2021-01-01 12:20:00 700.0
2021-01-02 00:00:00 740.0
2021-01-02 12:20:00 60.0
2021-01-02 13:20:00 20.0
dtype: float64
>>> minutes.groupby([df.index.date, df['Event'].ffill()]).sum()
Event
2021-01-01 1.0 880.0
2.0 560.0
2021-01-02 1.0 760.0
2.0 60.0
dtype: float64
Note that we also made sure to propagate event ids to the split lines with .ffill()
This solution has the advantage of not generating huge dataframes with 1 entry per minute, and without limits on how many days can be contained in a single Duration value.
This is not the most elegant solution but it is a starting point
# convert to datetime
df['Start Date'] = pd.to_datetime(df['Start Date'])
# calculate the end date
df['End Date'] = df['Start Date'] + df['Duration'].apply(pd.Timedelta, unit='min')
# create a mask to filter your frame where start date is not the same as end date
same_day_mask = df['Start Date'].dt.date != df['End Date'].dt.date
# create two new frames
same_day_df = df[~same_day_mask].copy()
not_same_day_df = df[same_day_mask].copy()
# calculate the time it takes to get to midnight the next day
not_same_day_df['day1'] = (not_same_day_df['End Date'].dt.normalize() - not_same_day_df['Start Date']).dt.total_seconds()/60
# Calculate the remaining time from duration
not_same_day_df['day2'] = not_same_day_df['Duration'] - not_same_day_df['day1']
# reasign value to duration
not_same_day_df['Duration'] = not_same_day_df['day1']
not_same_day_df['Event2'] = not_same_day_df['Event']
new = not_same_day_df[['End Date', 'day2', 'Event2']].rename(columns={'End Date': 'Start Date',
'day2': 'Duration',
'Event2': 'Event'})
# append the data frames together
final_df = same_day_df.append(not_same_day_df[not_same_day_df.columns[:3]].append(new))
# groupby and sum
print(final_df.groupby([final_df['Start Date'].dt.normalize(), 'Event'])['Duration'].sum().reset_index())
Start Date Event Duration
0 2021-01-01 1 880.0
1 2021-01-01 2 560.0
2 2021-01-02 1 760.0
3 2021-01-02 2 60.0
You can do it by creating date ranges with pd.date_range, explode and groupby:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["TimeRange"] = [
pd.date_range(s, periods=m, freq="T")
for s, m in zip(df["Start Date"], df["Duration"])
]
df_out = (
df.explode("TimeRange")
.groupby(["Event", pd.Grouper(key="TimeRange", freq="D")]['Event']
.count().rename('Duration').reset_index()
)
df_out
Output:
Event TimeRange Duration
0 1 2021-01-01 880
1 1 2021-01-02 760
2 2 2021-01-01 560
3 2 2021-01-02 60
Create an record by minute starting with Start Date then count the records and groupby event and date.
have a look at DataFrame.groupby
for example you could calculate the sum of all durations on a day like this:
import pandas as pd
import io
df = """
Date Time Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
"""
df = df = pd.read_csv(io.StringIO(df), sep=r"\s+")
df.reset_index().groupby(["index", "Event"]).sum()
>>> Duration
>>> index Event
>>> 2021.01.01 1 1620
>>> 2 560
>>> 2021.01.02 1 20
>>> 2 60
Related
I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1
I've been reading the forum, investigating on internet. But can't figure out how to apply a pandas functions to resume this whole code:
def get_time_and_date(schedule, starting_date, position):
# calculate time and date for each start and ending time if the ending time < starting time, add one day to the ending.
my_time = datetime.strptime(schedule.split('-')[position], '%H:%M')
my_date = datetime.strptime(starting_date, '%Y-%m-%d')
# get the starting hour for the range if we are calculating last interval
if position == 1:
starting_hour = datetime.strptime(schedule.split('-')[0], '%H:%M')
starting_hour = datetime(my_date.year, my_date.month, my_date.day, starting_hour.hour, 0)
# convert unify my_time and my_date normalizing the minutes
if hora.minute >= 30:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, my_time.hour, 30)
else:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, hora.hour, 0)
# if final time of the day < than starting time, means there is a day jump, so we add a day
if position == 1 and my_hour_and_date < starting_hour: my_hour_and_date += timedelta(days=1)
return my_hour_and_date
def get_time_interval_ranges(schedule, my_date):
# get all match schedules if there are any
schedules = schedule.split('/')
intervals_list = []
# loop through al the schedules and add the range with the split separator "Separa aquí"
for my_schedule in schedules:
current_range = pd.date_range(start=get_time_and_date(my_schedule, my_date, 0), end=get_time_and_date(my_schedule, my_date, 1), freq="30min").strftime('%Y-%m-%d, %H:%M').to_list()
intervals_list += current_range
intervals_list.append('separate_range_here')
return intervals_list
def generate_time_intervals(df, column_to_process, new_column):
#generate range of times column
df[new_column] = df.apply(lambda row: get_time_interval_ranges(row[column_to_process], row['my_date'], True), axis=1)
return df
I believe there is a better way to do this, but I can't find out how. What I'm giving to the first function(generate_time_intervals) is a dataFrame with some columns but only Date (yyyy-mm-dd) and schedule are important.
When the schedule is 09:00-15:00 it's easy, just split by the "-" and give it to the builtint function data_range. The problem comes to handle horrendous times like the one on the title or the likes of 09:17-16:24.
Is there any way to handle this without so much looping and the sorts in my code?
Edit:
With this input:
Worker
Date
Schedule
Worker1
2022-05-01
09:00-10:00/11:00-14:00/15:00-18:00
Worker2
2022-05-01
09:37-15:38
I would like this output:
Date
Interval
Working Minutes
2022-05-01
09:00
30
2022-05-01
09:30
53
2022-05-01
10:00
30
2022-05-01
10:30
30
2022-05-01
11:00
60
2022-05-01
11:30
60
2022-05-01
12:00
60
2022-05-01
12:30
60
2022-05-01
13:00
60
2022-05-01
13:30
60
2022-05-01
14:00
30
2022-05-01
14:30
30
2022-05-01
15:00
60
2022-05-01
15:30
38
2022-05-01
16:00
30
2022-05-01
16:30
30
2022-05-01
17:00
30
2022-05-01
17:30
30
2022-05-01
18:00
0
Working with datetime:
df= pd.DataFrame({'schedule':['09:17-16:24','19:40-21:14']})
schedules = df.schedule.str.split('-',expand=True)
start = pd.to_datetime(schedules[0]).dt.round('H')
end = pd.to_datetime(schedules[1]).dt.round('H')
df['interval_out'] = start.dt.hour.astype(str) + ':00 - ' + end.dt.hour.astype(str) + ':00'
And result:
>>> df
schedule
0 09:17-16:24
1 19:40-21:14
>>> schedules
0 1
0 09:17 16:24
1 19:40 21:14
>>> start
0 2022-05-18 09:00:00
1 2022-05-18 20:00:00
Name: 0, dtype: datetime64[ns]
>>> end
0 2022-05-18 16:00:00
1 2022-05-18 21:00:00
Name: 1, dtype: datetime64[ns]
>>> df
schedule interval_out
0 09:17-16:24 9:00 - 16:00
1 19:40-21:14 20:00 - 21:00
>>>
Of course the rounding should be floor & ceil if you want to expand it...
EDIT: Trying the original question... It also helps if you read about datetime functions in Pandas (which now I learnt...):facepalm:
Expand the blocks into individual items start/stop
Floor / ceil them for the start/stop
Calculate the intervals using a convenient pandas function...
Explode the intervals as rows
Calculate the late start
Calculate soon stop
Calculate how many people were actually in the office
Group data on slots, adding lost minutes and worked minutes * worker
Do the calculation
df['timeblocks']= df.Schedule.str.split('/')
df2 = df.explode('timeblocks')
timeblocks = df2.timeblocks.str.split('-',expand=True)
df2['start'] = pd.to_datetime(df2.Date + " " + timeblocks[0])
df2['stop'] = pd.to_datetime(df2.Date + " " + timeblocks[1])
df2['start_slot'] = df2.start.dt.floor('30min')
df2['stop_slot'] = df2.stop.dt.ceil('30min')
df2['intervals'] = df2.apply(lambda x: pd.date_range(x.start_slot, x.stop_slot, freq='30min'), axis=1)
df3 = df2.explode('intervals')
df3['late_start'] = (df3.start>df3.intervals)*(df3.start-df3.intervals).dt.seconds/60
df3['soon_stop']= ((df3.stop>df3.intervals) & (df3.stop<(df3.intervals+pd.Timedelta('30min'))))*((df3.intervals+pd.Timedelta('30min'))-df3.stop).dt.seconds/60
df3['someone'] = (df3.start<df3.intervals+pd.Timedelta('30min'))&(df3.stop>df3.intervals)#+pd.Timedelta('30min'))
df4 = df3.groupby('intervals').agg({'late_start':sum, 'soon_stop':sum, 'someone':sum})
df4['worked_time'] = df4.someone*30 - df4.late_start - df4.soon_stop
df4
>>> df4
late_start soon_stop someone worked_time
intervals
2022-05-01 09:00:00 0.0 0.0 1 30.0
2022-05-01 09:30:00 7.0 0.0 2 53.0
2022-05-01 10:00:00 0.0 0.0 1 30.0
2022-05-01 10:30:00 0.0 0.0 1 30.0
2022-05-01 11:00:00 0.0 0.0 2 60.0
2022-05-01 11:30:00 0.0 0.0 2 60.0
2022-05-01 12:00:00 0.0 0.0 2 60.0
2022-05-01 12:30:00 0.0 0.0 2 60.0
2022-05-01 13:00:00 0.0 0.0 2 60.0
2022-05-01 13:30:00 0.0 0.0 2 60.0
2022-05-01 14:00:00 0.0 0.0 1 30.0
2022-05-01 14:30:00 0.0 0.0 1 30.0
2022-05-01 15:00:00 0.0 0.0 2 60.0
2022-05-01 15:30:00 0.0 22.0 2 38.0
2022-05-01 16:00:00 0.0 0.0 1 30.0
2022-05-01 16:30:00 0.0 0.0 1 30.0
2022-05-01 17:00:00 0.0 0.0 1 30.0
2022-05-01 17:30:00 0.0 0.0 1 30.0
2022-05-01 18:00:00 0.0 0.0 0 0.0
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
Having the following DF:
group_id timestamp
A 2020-09-29 06:00:00 UTC
A 2020-09-29 08:00:00 UTC
A 2020-09-30 09:00:00 UTC
B 2020-09-01 04:00:00 UTC
B 2020-09-01 06:00:00 UTC
I would like to count the deltas between records using all groups, not counting deltas between groups. Result for the above example:
delta count
2 2
25 1
Explanation: In group A the deltas are
06:00:00 -> 08:00:00 (2 hours)
08:00:00 -> 09:00:00 on the next day (25 hours)
And in group B:
04:00:00 -> 06:00:00 (2 hours)
How can I achieve this using Python Pandas?
Use DataFrameGroupBy.diff for differencies per groups, convert to seconds by Series.dt.total_seconds, divide by 3600 for hours and last count values by Series.value_counts with convert Series to 2 columns DataFrame:
df1 = (df.groupby("group_id")['timestamp']
.diff()
.dt.total_seconds()
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count'))
print (df1)
delta count
0 2.0 2
1 25.0 1
Code
df_out = df.groupby("group_id").diff().groupby("timestamp").size()
# convert to dataframe
df_out = df_out.to_frame().reset_index().rename(columns={"timestamp": "delta", 0: "count"})
Result
print(df_out)
delta count
0 0 days 02:00:00 2
1 1 days 01:00:00 1
The NaT's (missing values) produced by groupby-diff were ignored automatically.
To represent timedelta in hours, just call total_seconds() method.
df_out["delta"] = df_out["delta"].dt.total_seconds() / 3600
print(df_out)
delta count
0 2.0 2
1 25.0 1
I have got two dataframes (i.e. df1 and df2).
df1 contains date and time columns. Time columns contains 30 minutes interval of time series:
df1:
date time
0 2015-04-01 00:00:00
1 2015-04-01 00:30:00
2 2015-04-01 01:00:00
3 2015-04-01 01:30:00
4 2015-04-01 02:00:00
df2 contains date, start-time, end-time, value:
df2
INCIDENT_DATE INTERRUPTION_TIME RESTORE_TIME WASTED_MINUTES
0 2015-04-01 00:32 01:15 1056.0
1 2015-04-01 01:20 02:30 3234.0
2 2015-04-01 01:22 03:30 3712.0
3 2015-04-01 01:30 03:15 3045.0
Now I want to copy the wasted_minutes column from df2 to df1 when date columns of both data frames are the same and Interruption_time of the column of df2 lies in the time column of df1. So the output should look like:
df1:
date time Wasted_columns
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
I tried merge command (on the basis of date column), but didn't produce the desired result, because I am not sure how to check whether time falls in 30 minutes intervals or not? Could anyone guide how to fix the issue?
Convert time to timedelta and assign back to df1. Convert INTERRUPTION_TIME to timedelta and floor it to 30-minute interval and assign to s. Groupby df2 by INCIDENT_DATE, s and call sum of WASTED_MINUTES. Finally, join the result of groupby back to df1
df1['time'] = pd.to_timedelta(df1['time'].astype(str)) #cast to str before calling `to_timedelta`
s = pd.to_timedelta(df2.INTERRUPTION_TIME+':00').dt.floor('30Min')
df_final = df1.join(df2.groupby(['INCIDENT_DATE', s]).WASTED_MINUTES.sum(),
on=['date', 'time'])
Out[631]:
date time WASTED_MINUTES
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
You can do this
df1['time']=pd.to_datetime(df1['time'])
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= x['time']) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< x['time']+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
df1['time']=df1['time'].dt.time
If you convert the 'time' column in the lambda function itself, then it is just one line of code as below
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= pd.to_datetime(x['time'])) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< pd.to_datetime(x['time'])+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
Output
date time Wasted_columns
0 2015-04-01 00:00:00 0.0
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 0.0
The idea:
+ Convert to datetime
+ Round to nearest 30 mins
+ Merge
from datetime import datetime, timedelta
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
# Convert
df1['dt'] = (df1['date'] + ' ' + df1['time']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M:%S'])
df2['dt'] = (df2['INCIDENT_DATE '] + ' ' + df2['INTERRUPTION_TIME']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M'])
# Round
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df2['dt'] = df2['dt'].apply(ceil_dt, args=[timedelta(minutes=30)])
# Merge
final = df1.merge(df2.loc[:, ['dt', 'wasted_column'], on='dt', how='left'])
Also if multiple incidents happens in 30 mins timeframe, you would want to group by on df2 with rounded dt col first to sum up wasted then merge