I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.
Related
I have attached the example of a dataframe which is based quarterly. I wish to resample it to per minute without any aggregation
Input dataframe:
Date (CET)
Price
2020-01-01 11:00
50
2020-01-01 11:15
60
2020-01-01 11:15
100
The output I want is this:
Date (CET)
Price
2020-01-01 11:00
50
2020-01-01 11:01
50
2020-01-01 11:02
50
2020-01-01 11:03
50
2020-01-01 11:04
50
2020-01-01 11:05
50
2020-01-01 11:06
50
2020-01-01 11:07
50
2020-01-01 11:08
50
2020-01-01 11:09
50
2020-01-01 11:10
50
2020-01-01 11:11
50
2020-01-01 11:12
50
2020-01-01 11:13
50
2020-01-01 11:14
50
2020-01-01 11:15
60
I tried using df.resample, but it requires me to aggregated based on the mean() or sum(), which I don't want. I want the values to remain the same for a particular quarter. Like in the output table the price remains 50 from 11:00 to 11:14
Use:
#convert to DatetimeIndex
df['Date (CET)'] = pd.to_datetime(df['Date (CET)'])
#remove duplicates
df = df.drop_duplicates('Date (CET)')
df = df.set_index('Date (CET)')
#forward filling values - upsample
df.resample('Min').ffill()
I've been reading the forum, investigating on internet. But can't figure out how to apply a pandas functions to resume this whole code:
def get_time_and_date(schedule, starting_date, position):
# calculate time and date for each start and ending time if the ending time < starting time, add one day to the ending.
my_time = datetime.strptime(schedule.split('-')[position], '%H:%M')
my_date = datetime.strptime(starting_date, '%Y-%m-%d')
# get the starting hour for the range if we are calculating last interval
if position == 1:
starting_hour = datetime.strptime(schedule.split('-')[0], '%H:%M')
starting_hour = datetime(my_date.year, my_date.month, my_date.day, starting_hour.hour, 0)
# convert unify my_time and my_date normalizing the minutes
if hora.minute >= 30:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, my_time.hour, 30)
else:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, hora.hour, 0)
# if final time of the day < than starting time, means there is a day jump, so we add a day
if position == 1 and my_hour_and_date < starting_hour: my_hour_and_date += timedelta(days=1)
return my_hour_and_date
def get_time_interval_ranges(schedule, my_date):
# get all match schedules if there are any
schedules = schedule.split('/')
intervals_list = []
# loop through al the schedules and add the range with the split separator "Separa aquĆ"
for my_schedule in schedules:
current_range = pd.date_range(start=get_time_and_date(my_schedule, my_date, 0), end=get_time_and_date(my_schedule, my_date, 1), freq="30min").strftime('%Y-%m-%d, %H:%M').to_list()
intervals_list += current_range
intervals_list.append('separate_range_here')
return intervals_list
def generate_time_intervals(df, column_to_process, new_column):
#generate range of times column
df[new_column] = df.apply(lambda row: get_time_interval_ranges(row[column_to_process], row['my_date'], True), axis=1)
return df
I believe there is a better way to do this, but I can't find out how. What I'm giving to the first function(generate_time_intervals) is a dataFrame with some columns but only Date (yyyy-mm-dd) and schedule are important.
When the schedule is 09:00-15:00 it's easy, just split by the "-" and give it to the builtint function data_range. The problem comes to handle horrendous times like the one on the title or the likes of 09:17-16:24.
Is there any way to handle this without so much looping and the sorts in my code?
Edit:
With this input:
Worker
Date
Schedule
Worker1
2022-05-01
09:00-10:00/11:00-14:00/15:00-18:00
Worker2
2022-05-01
09:37-15:38
I would like this output:
Date
Interval
Working Minutes
2022-05-01
09:00
30
2022-05-01
09:30
53
2022-05-01
10:00
30
2022-05-01
10:30
30
2022-05-01
11:00
60
2022-05-01
11:30
60
2022-05-01
12:00
60
2022-05-01
12:30
60
2022-05-01
13:00
60
2022-05-01
13:30
60
2022-05-01
14:00
30
2022-05-01
14:30
30
2022-05-01
15:00
60
2022-05-01
15:30
38
2022-05-01
16:00
30
2022-05-01
16:30
30
2022-05-01
17:00
30
2022-05-01
17:30
30
2022-05-01
18:00
0
Working with datetime:
df= pd.DataFrame({'schedule':['09:17-16:24','19:40-21:14']})
schedules = df.schedule.str.split('-',expand=True)
start = pd.to_datetime(schedules[0]).dt.round('H')
end = pd.to_datetime(schedules[1]).dt.round('H')
df['interval_out'] = start.dt.hour.astype(str) + ':00 - ' + end.dt.hour.astype(str) + ':00'
And result:
>>> df
schedule
0 09:17-16:24
1 19:40-21:14
>>> schedules
0 1
0 09:17 16:24
1 19:40 21:14
>>> start
0 2022-05-18 09:00:00
1 2022-05-18 20:00:00
Name: 0, dtype: datetime64[ns]
>>> end
0 2022-05-18 16:00:00
1 2022-05-18 21:00:00
Name: 1, dtype: datetime64[ns]
>>> df
schedule interval_out
0 09:17-16:24 9:00 - 16:00
1 19:40-21:14 20:00 - 21:00
>>>
Of course the rounding should be floor & ceil if you want to expand it...
EDIT: Trying the original question... It also helps if you read about datetime functions in Pandas (which now I learnt...):facepalm:
Expand the blocks into individual items start/stop
Floor / ceil them for the start/stop
Calculate the intervals using a convenient pandas function...
Explode the intervals as rows
Calculate the late start
Calculate soon stop
Calculate how many people were actually in the office
Group data on slots, adding lost minutes and worked minutes * worker
Do the calculation
df['timeblocks']= df.Schedule.str.split('/')
df2 = df.explode('timeblocks')
timeblocks = df2.timeblocks.str.split('-',expand=True)
df2['start'] = pd.to_datetime(df2.Date + " " + timeblocks[0])
df2['stop'] = pd.to_datetime(df2.Date + " " + timeblocks[1])
df2['start_slot'] = df2.start.dt.floor('30min')
df2['stop_slot'] = df2.stop.dt.ceil('30min')
df2['intervals'] = df2.apply(lambda x: pd.date_range(x.start_slot, x.stop_slot, freq='30min'), axis=1)
df3 = df2.explode('intervals')
df3['late_start'] = (df3.start>df3.intervals)*(df3.start-df3.intervals).dt.seconds/60
df3['soon_stop']= ((df3.stop>df3.intervals) & (df3.stop<(df3.intervals+pd.Timedelta('30min'))))*((df3.intervals+pd.Timedelta('30min'))-df3.stop).dt.seconds/60
df3['someone'] = (df3.start<df3.intervals+pd.Timedelta('30min'))&(df3.stop>df3.intervals)#+pd.Timedelta('30min'))
df4 = df3.groupby('intervals').agg({'late_start':sum, 'soon_stop':sum, 'someone':sum})
df4['worked_time'] = df4.someone*30 - df4.late_start - df4.soon_stop
df4
>>> df4
late_start soon_stop someone worked_time
intervals
2022-05-01 09:00:00 0.0 0.0 1 30.0
2022-05-01 09:30:00 7.0 0.0 2 53.0
2022-05-01 10:00:00 0.0 0.0 1 30.0
2022-05-01 10:30:00 0.0 0.0 1 30.0
2022-05-01 11:00:00 0.0 0.0 2 60.0
2022-05-01 11:30:00 0.0 0.0 2 60.0
2022-05-01 12:00:00 0.0 0.0 2 60.0
2022-05-01 12:30:00 0.0 0.0 2 60.0
2022-05-01 13:00:00 0.0 0.0 2 60.0
2022-05-01 13:30:00 0.0 0.0 2 60.0
2022-05-01 14:00:00 0.0 0.0 1 30.0
2022-05-01 14:30:00 0.0 0.0 1 30.0
2022-05-01 15:00:00 0.0 0.0 2 60.0
2022-05-01 15:30:00 0.0 22.0 2 38.0
2022-05-01 16:00:00 0.0 0.0 1 30.0
2022-05-01 16:30:00 0.0 0.0 1 30.0
2022-05-01 17:00:00 0.0 0.0 1 30.0
2022-05-01 17:30:00 0.0 0.0 1 30.0
2022-05-01 18:00:00 0.0 0.0 0 0.0
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:
>>> off_time
date (index) start end
2020-07-04 18:00:00 23:50:00
2020-08-24 00:00:00 08:00:00
2020-08-24 14:00:00 16:00:00
2020-09-04 00:00:00 23:59:59
2020-10-05 18:00:00 22:00:00
I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:
>>> data
process-name start-time end-time duration
name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 day 14:00:00
name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 14:00:00
name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 12:00:00
name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 02:50:00
name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 day 09:00:00
In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.
I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.
Set up sample data (I added a couple of rows to your samples to include some edge cases):
######### OFF TIMES
off = pd.DataFrame([
["2020-07-04", dt.time(18), dt.time(23,50)],
["2020-08-24", dt.time(0), dt.time(8)],
["2020-08-24", dt.time(14), dt.time(16)],
["2020-09-04", dt.time(0), dt.time(23,59,59)],
["2020-10-04", dt.time(15), dt.time(18)],
["2020-10-05", dt.time(18), dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")
### Convert start and end times to datetimes in UTC timezone, since that is much
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")
off
>>
start end
date
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00
######### PROCESS TIMES
data = pd.DataFrame([
["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])
data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])
data["duration"] = data.end -data.start
data
>>
process start end duration
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00
As you can see, I added a row to off on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2)
I also added the process name6 which is multiple days long.
This is my solution, which will be applied to each row in data
def get_relevant_off(pr):
relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
if not relevant.empty:
relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
to_subtract = (relevant.end - relevant.start).sum()
return pr["duration"] - to_subtract
else: return pr.duration
Explanation:
first row in the function subsets the relevant rows of off, based on the row pr
replace off starts that are lower than process starts with process starts and do the same with ends, since we don't want to sum the whole off time, but only what is actually at the same time as the process.
get the duration of off times by subtracting off starts from off ends and sum those
then subtract that from the total duration.
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
process start end duration effective
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00 0 days 04:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00 0 days 00:00:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00 1 days 04:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00 1 days 19:00:01
Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.
I have an indexed dataframe (indexed by type then date) and would like to carry out a subtraction between the end time of the top row and start time of the next row in hours :
type date start_time end_time code
A 01/01/2018 01/01/2018 9:00 01/01/2018 14:00 525
01/02/2018 01/02/2018 5:00 01/02/2018 17:00 524
01/04/2018 01/04/2018 8:00 01/04/2018 10:00 528
B 01/01/2018 01/01/2018 5:00 01/01/2018 14:00 525
01/04/2018 01/04/2018 2:00 01/04/2018 17:00 524
01/05/2018 01/05/2018 7:00 01/05/2018 10:00 528
I would like to get the resulting table with a new column['interval']:
type date interval
A 01/01/2018 -
01/02/2018 15
01/04/2018 39
B 01/01/2018 -
01/04/2018 60
01/05/2018 14
The interval column is in hours
You can convert start_time and end_time to datetime format, then use apply to subtract the end_time of the previous row in each group (using groupby). To convert to hours, divide by pd.Timedelta('1 hour'):
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['interval'] = (df.groupby(level=0,sort=False).apply(lambda x: x.start_time-x.end_time.shift(1)) / pd.Timedelta('1 hour')).values
>>> df
start_time end_time code interval
type date
A 01/01/2018 2018-01-01 09:00:00 2018-01-01 14:00:00 525 NaN
01/02/2018 2018-01-02 05:00:00 2018-01-02 17:00:00 524 15.0
01/04/2018 2018-01-04 08:00:00 2018-01-04 10:00:00 528 39.0
B 01/01/2018 2018-01-01 05:00:00 2018-01-01 14:00:00 525 NaN
01/04/2018 2018-01-04 02:00:00 2018-01-04 17:00:00 524 60.0
01/05/2018 2018-01-05 07:00:00 2018-01-05 10:00:00 528 14.0