Summarize active time of user activity log data in hourly buckets - python

I am trying to find the hourly active time of a user from the user activity data. Below is the sample i/p & o/p
Input
ID Status Datetime
A Online 24/09/2017 7:00:00 AM
A Offline 24/09/2017 7:30:00 AM
A Online 24/09/2017 9:30:00 AM
A Offline 24/09/2017 10:00:00 AM
B Online 24/09/2017 6:00:00 AM
B Offline 24/09/2017 7:30:00 AM
B Online 24/09/2017 9:10:00 AM
B Offline 24/09/2017 9:30:00 AM
B Online 24/09/2017 9:40:00 AM
B Offline 24/09/2017 10:00:00 AM
Expected Output
ID Hour_start Hour_end Online_time
A 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
A 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
A 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 1800
B 24/09/2017 6:00:00 AM 24/09/2017 7:00:00 AM 3600
B 24/09/2017 7:00:00 AM 24/09/2017 8:00:00 AM 1800
B 24/09/2017 8:00:00 AM 24/09/2017 9:00:00 AM 0
B 24/09/2017 9:00:00 AM 24/09/2017 10:00:00 AM 2400
Please help me out. TIA

My solution from Pandas Grouper calculate time elapsed between events gives proper
results also for this source data.
The result is:
ID Hour_start Hour_end Online_time
0 A 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
1 A 2017-09-24 08:00:00 2017-09-24 09:00:00 0
2 A 2017-09-24 09:00:00 2017-09-24 10:00:00 1800
3 B 2017-09-24 06:00:00 2017-09-24 07:00:00 3600
4 B 2017-09-24 07:00:00 2017-09-24 08:00:00 1800
5 B 2017-09-24 08:00:00 2017-09-24 09:00:00 0
6 B 2017-09-24 09:00:00 2017-09-24 10:00:00 2400
Just as your expected result. So I don't see any error in my solution.
If you have any source data for which my solution gives wrong result,
add this data to your post.

Related

Measure different between timestamps using conditions - python

I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]

Set the time as 0 which is having value in specific column using python

I have a dataset with three inputs X1,X2,X3 including date and time.
Here In X3 column contain with 0 and 5. Here what I want to code is first 5 value contain in X3 column time take as start time and it will be equal to 0 time.
Other time is not changing if 5 value contain in X3 column. Only I want is first time of every day put it as 0 time.
date time x3
10/3/2018 6:15:00 0
10/3/2018 6:45:00 5
10/3/2018 7:45:00 0
10/3/2018 9:00:00 0
10/3/2018 9:25:00 0
10/3/2018 9:30:00 0
10/3/2018 11:00:00 0
10/3/2018 11:30:00 0
10/3/2018 13:30:00 0
10/3/2018 13:50:00 5
10/3/2018 15:00:00 0
10/3/2018 15:25:00 0
10/3/2018 16:25:00 0
10/3/2018 18:00:00 0
10/3/2018 19:00:00 0
10/3/2018 19:30:00 0
10/3/2018 20:00:00 0
10/3/2018 22:05:00 0
10/3/2018 22:15:00 5
10/3/2018 23:40:00 0
10/4/2018 6:58:00 5
10/4/2018 13:00:00 0
10/4/2018 16:00:00 0
10/4/2018 17:00:00 0
As you see I have X3 column data with values 0 and 5 with date and time.
First taking the value of 5
desired output
10/3/208 6:45:00 5 start time 6:45:00 convert 00:00:00
10/3/2018 13:50:00 5 Not taking
10/3/2018 22:15:00 5 Not taking
10/4/2018 6:58:00 5 start time 6:58:00 convert 00:00:00
I just want to code like this. Can anyone help me to solve this problem?
when we used this code it is giving with time difference of each row. I just don't want the difference of time in each rows. I just want to read start time and it should be converted to the 0 time.
I tried this code, and it gave the time difference of each rows also
df['time_diff']= pd.to_datetime(df['date'] + " " + df['time'],
format='%d/%m/%Y %H:%M:%S', dayfirst=True)
mask = df['x3'].ne(0)
df['Duration'] = df[mask].groupby(['date','x3'])['time_diff'].transform('first')
df['Duration'] = df['time_diff'].sub(df['Duration']).dt.total_seconds().div(3600)
This gave me time duration each of 5 values.
Here what I exactly want:
For filter only first values of 5 per groups add DataFrame.drop_duplicates:
df['time_diff']= pd.to_datetime(df['date'] + " " + df['time'],
format='%d/%m/%Y %H:%M:%S', dayfirst=True)
mask = df['x3'].eq(5)
df['Duration'] = (df[mask].drop_duplicates(['date','x3'])
.groupby(['date','x3'])['time_diff']
.transform('first'))
df['Duration'] = df['time_diff'].sub(df['Duration']).dt.total_seconds().div(3600)
print (df)
date time x3 time_diff Duration
0 10/3/2018 6:15:00 0 2018-03-10 06:15:00 NaN
1 10/3/2018 6:45:00 5 2018-03-10 06:45:00 0.0
2 10/3/2018 7:45:00 0 2018-03-10 07:45:00 NaN
3 10/3/2018 9:00:00 0 2018-03-10 09:00:00 NaN
4 10/3/2018 9:25:00 0 2018-03-10 09:25:00 NaN
5 10/3/2018 9:30:00 0 2018-03-10 09:30:00 NaN
6 10/3/2018 11:00:00 0 2018-03-10 11:00:00 NaN
7 10/3/2018 11:30:00 0 2018-03-10 11:30:00 NaN
8 10/3/2018 13:30:00 0 2018-03-10 13:30:00 NaN
9 10/3/2018 13:50:00 5 2018-03-10 13:50:00 NaN
10 10/3/2018 15:00:00 0 2018-03-10 15:00:00 NaN
11 10/3/2018 15:25:00 0 2018-03-10 15:25:00 NaN
12 10/3/2018 16:25:00 0 2018-03-10 16:25:00 NaN
13 10/3/2018 18:00:00 0 2018-03-10 18:00:00 NaN
14 10/3/2018 19:00:00 0 2018-03-10 19:00:00 NaN
15 10/3/2018 19:30:00 0 2018-03-10 19:30:00 NaN
16 10/3/2018 20:00:00 0 2018-03-10 20:00:00 NaN
17 10/3/2018 22:05:00 0 2018-03-10 22:05:00 NaN
18 10/3/2018 22:15:00 5 2018-03-10 22:15:00 NaN
19 10/3/2018 23:40:00 0 2018-03-10 23:40:00 NaN
20 10/4/2018 6:58:00 5 2018-04-10 06:58:00 0.0
21 10/4/2018 13:00:00 0 2018-04-10 13:00:00 NaN
22 10/4/2018 16:00:00 0 2018-04-10 16:00:00 NaN
23 10/4/2018 17:00:00 0 2018-04-10 17:00:00 NaN

How to fetch hours from m8[ns] object in pandas?

I have a dataframe like shown below
df = pd.DataFrame({'time':['2166-01-09 14:00:00','2166-01-09 14:08:00','2166-01-09 16:00:00','2166-01-09 20:00:00',
'2166-01-09 04:00:00','2166-01-10 05:00:00','2166-01-10 06:00:00','2166-01-10 07:00:00','2166-01-10 11:00:00',
'2166-01-10 11:30:00','2166-01-10 12:00:00','2166-01-10 13:00:00','2166-01-10 13:30:00']})
I am trying to find a time difference between rows. For which I did the below
df['time2'] = df['time'].shift(-1)
df['tdiff'] = (df['time2'] - df['time'])
So, my result looks like as shown below
I found out that there exists a function like dt.days and I tried
df['tdiff'].dt.days
but it only gives the day component but am looking for something like 'hours` component
However, I would like to have my output like as shown below
I am sorry that I am not sure how to calculate the hour equivalent of negative time in row no 3. Might be that's an data issue.
In pandas is possible convert timedeltas to seconds by Series.dt.total_seconds and then divide by 3600:
df['tdiff'] = (df['time2'] - df['time']).dt.total_seconds() / 3600
print (df)
time time2 tdiff
0 2166-01-09 14:00:00 2166-01-09 14:08:00 0.133333
1 2166-01-09 14:08:00 2166-01-09 16:00:00 1.866667
2 2166-01-09 16:00:00 2166-01-09 20:00:00 4.000000
3 2166-01-09 20:00:00 2166-01-09 04:00:00 -16.000000
4 2166-01-09 04:00:00 2166-01-10 05:00:00 25.000000
5 2166-01-10 05:00:00 2166-01-10 06:00:00 1.000000
6 2166-01-10 06:00:00 2166-01-10 07:00:00 1.000000
7 2166-01-10 07:00:00 2166-01-10 11:00:00 4.000000
8 2166-01-10 11:00:00 2166-01-10 11:30:00 0.500000
9 2166-01-10 11:30:00 2166-01-10 12:00:00 0.500000
10 2166-01-10 12:00:00 2166-01-10 13:00:00 1.000000
11 2166-01-10 13:00:00 2166-01-10 13:30:00 0.500000
12 2166-01-10 13:30:00 NaT NaN

Pandas DataFrame Calculate time difference between 2 columns on specific time range

I want to calculate time difference between two columns on specific time range.
I try df.between_time but it only works on index.
Ex. Time range: between 18.00 - 8.00
Data :
start stop
0 2018-07-16 16:00:00 2018-07-16 20:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00
2 2018-07-13 17:54:00 2018-07-13 21:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00
4 2018-07-20 00:21:00 2018-07-20 04:21:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00
Expect Result :
start stop time_diff
0 2018-07-16 16:00:00 2018-07-16 20:00:00 02:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00 0
2 2018-07-13 17:54:00 2018-07-13 21:54:00 03:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00 0
4 2018-07-20 00:21:00 2018-07-20 04:21:00 04:00:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00 14:00:00
Note: If time_diff > 1 days, I already deal with that case.
Question: Should I build a function to do this or there are pandas build-in function to do this? Any help or guide would be appreciated.
I think this can be a solution
tmp = pd.DataFrame({'time1': pd.to_datetime(['2018-07-16 16:00:00', '2018-07-11 08:03:00',
'2018-07-13 17:54:00', '2018-07-14 13:09:00',
'2018-07-20 00:21:00', '2018-07-20 17:00:00']),
'time2': pd.to_datetime(['2018-07-16 20:00:00', '2018-07-11 12:03:00',
'2018-07-13 21:54:00', '2018-07-14 17:09:00',
'2018-07-20 04:21:00', '2018-07-21 09:00:00'])})
time1_date = tmp.time1.dt.date.astype(str)
tmp['rule18'], tmp['rule08'] = pd.to_datetime(time1_date + ' 18:00:00'), pd.to_datetime(time1_date + ' 08:00:00')
# if stop exceeds 18:00:00, compute time difference from this hour
tmp['time_diff_rule1'] = np.where(tmp.time2 > tmp.rule18, (tmp.time2 - tmp.rule18), (tmp.time2 - tmp.time1))
# rearrange the dataframe with your second rule
tmp['time_diff_rule2'] = np.where((tmp.time2 < tmp.rule18) & (tmp.time1 > tmp.rule08), 0, tmp['time_diff_rule1'])
time_diff_rule1 time_diff_rule2
0 02:00:00 02:00:00
1 04:00:00 00:00:00
2 03:54:00 03:54:00
3 04:00:00 00:00:00
4 04:00:00 04:00:00
5 15:00:00 15:00:00

Incrementing dates in pandas groupby

I'm building a basic rota/schedule for staff, and have a DataFrame from a MySQL cursor which gives a list of IDs, dates and class
id the_date class
0 195593 2017-09-12 14:00:00 3
1 193972 2017-09-13 09:15:00 2
2 195594 2017-09-13 14:00:00 3
3 195595 2017-09-15 14:00:00 3
4 193947 2017-09-16 17:30:00 3
5 195627 2017-09-17 08:00:00 2
6 193948 2017-09-19 11:30:00 2
7 195628 2017-09-21 08:00:00 2
8 193949 2017-09-21 11:30:00 2
9 195629 2017-09-24 08:00:00 2
10 193950 2017-09-24 10:00:00 2
11 193951 2017-09-27 11:30:00 2
12 195644 2017-09-28 06:00:00 1
13 194400 2017-09-28 08:00:00 1
14 195630 2017-09-28 08:00:00 2
15 193952 2017-09-29 11:30:00 2
16 195631 2017-10-01 08:00:00 2
17 194401 2017-10-06 08:00:00 1
18 195645 2017-10-06 10:00:00 1
19 195632 2017-10-07 13:30:00 3
If the class == 1, I need that instance duplicated 5 times.
first_class = df[df['class'] == 1]
non_first_class = df[df['class'] != 1]
first_class_replicated = pd.concat([tests_df]*5,ignore_index=True).sort_values(['the_date'])
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-28 06:00:00 1
4 195644 2017-09-28 06:00:00 1
12 195644 2017-09-28 06:00:00 1
8 195644 2017-09-28 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-28 08:00:00 1
9 194400 2017-09-28 08:00:00 1
5 194400 2017-09-28 08:00:00 1
1 194400 2017-09-28 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-06 08:00:00 1
10 194401 2017-10-06 08:00:00 1
14 194401 2017-10-06 08:00:00 1
2 194401 2017-10-06 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-06 10:00:00 1
15 195645 2017-10-06 10:00:00 1
7 195645 2017-10-06 10:00:00 1
19 195645 2017-10-06 10:00:00 1
I then merge non_first_class and first_class_replicated. Before that though, I need the dates in first_class_replicated to increment by one day, grouped by id. Below is how I need it to look. Is there an elegant Pandas solution to this, or should I be looking at looping over a groupby series to modify the dates?
Desired:
id
0 195644 2017-09-28 6:00:00
16 195644 2017-09-29 6:00:00
4 195644 2017-09-30 6:00:00
12 195644 2017-10-01 6:00:00
8 195644 2017-10-02 6:00:00
17 194400 2017-09-28 8:00:00
13 194400 2017-09-29 8:00:00
9 194400 2017-09-30 8:00:00
5 194400 2017-10-01 8:00:00
1 194400 2017-10-02 8:00:00
6 194401 2017-10-06 8:00:00
18 194401 2017-10-07 8:00:00
10 194401 2017-10-08 8:00:00
14 194401 2017-10-09 8:00:00
2 194401 2017-10-10 8:00:00
11 195645 2017-10-06 10:00:00
3 195645 2017-10-07 10:00:00
15 195645 2017-10-08 10:00:00
7 195645 2017-10-09 10:00:00
19 195645 2017-10-10 10:00:00
You can use cumcount for count categories, then convert to_timedelta and add to column:
#another solution for repeat
first_class_replicated = first_class.loc[np.repeat(first_class.index, 5)]
.sort_values(['the_date'])
df1 = first_class_replicated.groupby('id').cumcount()
first_class_replicated['the_date'] += pd.to_timedelta(df1, unit='D')
print (first_class_replicated)
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-29 06:00:00 1
4 195644 2017-09-30 06:00:00 1
12 195644 2017-10-01 06:00:00 1
8 195644 2017-10-02 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-29 08:00:00 1
9 194400 2017-09-30 08:00:00 1
5 194400 2017-10-01 08:00:00 1
1 194400 2017-10-02 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-07 08:00:00 1
10 194401 2017-10-08 08:00:00 1
14 194401 2017-10-09 08:00:00 1
2 194401 2017-10-10 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-07 10:00:00 1
15 195645 2017-10-08 10:00:00 1
7 195645 2017-10-09 10:00:00 1
19 195645 2017-10-10 10:00:00 1

Categories