Subtracting between rows of different columns in an indexed dataframe in python - python

I have an indexed dataframe (indexed by type then date) and would like to carry out a subtraction between the end time of the top row and start time of the next row in hours :
type date start_time end_time code
A 01/01/2018 01/01/2018 9:00 01/01/2018 14:00 525
01/02/2018 01/02/2018 5:00 01/02/2018 17:00 524
01/04/2018 01/04/2018 8:00 01/04/2018 10:00 528
B 01/01/2018 01/01/2018 5:00 01/01/2018 14:00 525
01/04/2018 01/04/2018 2:00 01/04/2018 17:00 524
01/05/2018 01/05/2018 7:00 01/05/2018 10:00 528
I would like to get the resulting table with a new column['interval']:
type date interval
A 01/01/2018 -
01/02/2018 15
01/04/2018 39
B 01/01/2018 -
01/04/2018 60
01/05/2018 14
The interval column is in hours

You can convert start_time and end_time to datetime format, then use apply to subtract the end_time of the previous row in each group (using groupby). To convert to hours, divide by pd.Timedelta('1 hour'):
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['interval'] = (df.groupby(level=0,sort=False).apply(lambda x: x.start_time-x.end_time.shift(1)) / pd.Timedelta('1 hour')).values
>>> df
start_time end_time code interval
type date
A 01/01/2018 2018-01-01 09:00:00 2018-01-01 14:00:00 525 NaN
01/02/2018 2018-01-02 05:00:00 2018-01-02 17:00:00 524 15.0
01/04/2018 2018-01-04 08:00:00 2018-01-04 10:00:00 528 39.0
B 01/01/2018 2018-01-01 05:00:00 2018-01-01 14:00:00 525 NaN
01/04/2018 2018-01-04 02:00:00 2018-01-04 17:00:00 524 60.0
01/05/2018 2018-01-05 07:00:00 2018-01-05 10:00:00 528 14.0

Related

Pandas : Dataframe Output System Down Time

I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.

Sum of individual labels over a month of granular data

I have a dataframe which contains life logging data gathered over several years from 44 unique individuals.
Int64Index: 77171 entries, 0 to 4279
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 start 77171 non-null datetime64[ns]
1 end 77171 non-null datetime64[ns]
2 labelName 77171 non-null category
3 id 77171 non-null int64
The start column contains granular datetimes of the format 2020-11-01 11:00:00, in intervals of 30 minutes. The labelName column has 14 different categories.
Categories (14, object): ['COOK', 'EAT', 'GO WALK', 'GO TO BATHROOM', ..., 'DRINK', 'WAKE UP', 'SLEEP', 'WATCH TV']
Here's a sample user's head, which is [2588 rows x 4 columns] and spans from 2020 to 2021. There are also gaps in the data, occasionally.
start end labelName id
0 2020-08-05 00:00:00 2020-08-05 00:30:00 GO TO BATHROOM 486
1 2020-08-05 06:00:00 2020-08-05 06:30:00 WAKE UP 486
2 2020-08-05 09:00:00 2020-08-05 09:30:00 COOK 486
3 2020-08-05 11:00:00 2020-08-05 11:30:00 EAT 486
4 2020-08-05 12:00:00 2020-08-05 12:30:00 EAT 486
.. ... ... ... ...
859 2021-03-10 12:30:00 2021-03-10 13:00:00 GO TO BATHROOM 486
861 2021-03-10 13:30:00 2021-03-10 14:00:00 GO TO BATHROOM 486
862 2021-03-10 18:30:00 2021-03-10 19:00:00 COOK 486
864 2021-03-11 08:00:00 2021-03-11 08:30:00 EAT 486
865 2021-03-11 12:30:00 2021-03-11 13:00:00 COOK 486
I want a sum of each unique labelNames per user per month, but I'm not sure how to do this.
I would first split the data frame by id, which is easy. But how do you split these start datetimes when it records every 30 minutes over several years of data— and then create 14 new columns which record the sums?
The final data frame might look something like this (with fake values):
user
month
SLEEP
...
WATCH TV
486
jun20
324
...
23
486
jul20
234
...
12
The use-case for this data frame is training a few statistical and machine-learning models.
How do I achieve something like this?
Because there are 30 minutes data you can count them by crosstab per months by months periods by Series.dt.to_period and then multiple by 0.5 for output in hours:
If start is 2020-09-30 23:30:00 and end is 2020-10-01 00:00:00 then if need count this record for October use df['end'] in crosstab, if for September use df['start'] .
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df1 = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName']).mul(0.5)
.rename_axis(columns=None, index=['id','month'])
.rename(columns=str)
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df1)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0.0 0.0 1.0 0.5 1.0
1 650 Mar2021 0.5 1.0 0.5 0.5 0.0
For ouput in 30 minutes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName'])
.rename_axis(columns=None, index=['id','month'])
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0 0 2 1 2
1 650 Mar2021 1 2 1 1 0
Use:
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)
Output:
Demonstration:
#Preparing Data
string = """start end labelName id
2020-09-21 14:30:00 2020-09-21 15:00:00 WAKE UP 650
2020-09-21 15:00:00 2020-09-21 15:30:00 GO TO BATHROOM 650
2020-09-21 15:30:00 2020-09-21 16:00:00 SLEEP 650
2020-09-29 17:00:00 2020-09-29 17:30:00 WAKE UP 650
2020-09-29 17:30:00 2020-09-29 18:00:00 GO TO BATHROOM 650
2021-03-11 13:00:00 2021-03-11 13:30:00 EAT 650
2021-03-11 14:30:00 2021-03-11 15:00:00 GO TO BATHROOM 650
2021-03-11 15:00:00 2021-03-11 15:30:00 COOK 650
2021-03-11 15:30:00 2021-03-11 16:00:00 EAT 650
2021-03-11 16:00:00 2021-03-11 16:30:00 SLEEP 650"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
df['start'] = pd.to_datetime(df['start'])
#Solution
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)

Python datetime problem converting time format

I would like to convert the following time format which is located in a panda dataframe column
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
I would like to transform the previous time format into a standard time format of HH:MM as follow
01:00
02:00
03:00
...
15:00
16:00
...
22:00
23:00
00:00
How can I do it in python?
Thank you in advance
This will give you a df with a datetime64[ns] and object dtype column for your data:
import pandas as pd
df = pd.read_csv('hm.txt', sep=r"[ ]{2,}", engine='python', header=None, names=['pre'])
df['pre_1'] = df['pre'].astype(str).str.replace('00', '')
df['datetime_dtype'] = pd.to_datetime(df['pre_1'], format='%H', exact=False)
df['str_dtype'] = df['datetime_dtype'].astype(str).str[11:16]
print(df.head(5))
pre datetime_dtype str_dtype
0 1 1900-01-01 01:00:00 01:00
1 2 1900-01-01 02:00:00 02:00
2 3 1900-01-01 03:00:00 03:00
3 4 1900-01-01 04:00:00 04:00
4 5 1900-01-01 05:00:00 05:00
print(df.dtypes)
pre object
datetime_dtype datetime64[ns]
str_dtype object
dtype: object

Fixing inconsistent formatting of 24-hour to 12-hour

I need to essentially measure how much each employee gets paid during each hour of work. There was some data cleaning to do and so I'm trying to make the formatting consistent.
It is a homework problem and its proving tough. I am new to python so please feel free to compress the code. I'm trying to use the pandas database.
csv file in pandas
break_notes end_time pay_rate start_time
0 15-18 23:00 10.0 10:00
1 18.30-19.00 23:00 12.0 18:00
2 4PM-5PM 22:30 14.0 12:00
3 3-4 18:00 10.0 09:00
4 4-4.10PM 23:00 20.0 09:00
5 15 - 17 23:00 10.0 11:00
6 11 - 13 16:00 10.0 10:00
'''
import pandas as pd
import datetime
import numpy as np
work_shifts = pd.read_csv('work_shifts.csv')
break_shifts = work_shifts['break_notes'].str.extract('(?P<start>[\d\.]+)?\D*(?P<end>[\d\.]+)?')
print(work_shifts)
for i in range(len(break_shifts['start'])):
if '.' not in break_shifts['start'][i]:
break_shifts['start'][i] = break_shifts['start'][i] + ':00'
else:
break_shifts['start'][i] = break_shifts['start'][i].replace('.',':')
for i in range(len(break_shifts['end'])):
if '.' in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i].replace('.',':')
elif '.' not in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i] + ':00'
for i in range(len(break_shifts['end'])):
break_shifts['end'][i] = datetime.datetime.strptime(break_shifts['end'][i], '%H:%M').time()
break_shifts['start'][i] = datetime.datetime.strptime(break_shifts['start'][i], '%H:%M').time()
work_shifts[['start_break','end_break']] = break_shifts[['start', 'end']]
for i in range(len(work_shifts['end_time'])):
work_shifts['end_time'][i] = datetime.datetime.strptime(work_shifts['end_time'][i], '%H:%M').time()
for i in range(len(work_shifts['start_time'])):
work_shifts['start_time'][i] = datetime.datetime.strptime(work_shifts['start_time'][i], '%H:%M').time()
print(work_shifts)
this is the result
break_notes end_time pay_rate start_time start_break end_break
0 15-18 23:00:00 10.0 10:00:00 15:00:00 18:00:00
1 18.30-19.00 23:00:00 12.0 18:00:00 18:30:00 19:00:00
2 4PM-5PM 22:30:00 14.0 12:00:00 04:00:00 05:00:00
3 3-4 18:00:00 10.0 09:00:00 03:00:00 04:00:00
4 4-4.10PM 23:00:00 20.0 09:00:00 04:00:00 04:10:00
5 15 - 17 23:00:00 10.0 11:00:00 15:00:00 17:00:00
6 11 - 13 16:00:00 10.0 10:00:00 11:00:00 13:00:00
I tried adding the times but they are inconsistent types. If theres a different approach then please provide guidance. I need to calculate how many employees are working at what time and then calculate how much pay is given to the employees per hour.
My approach was to convert the formatting of the break notes into time then convert the 12-hour to 12 provided both end_break and start_break was before datetime.datetime(12,0,0).
I'm not sure how to calculate the money per hour. Maybe using if statements?

Pandas : merge on date and hour from datetime index

I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0

Categories