Sum of individual labels over a month of granular data

Sum of individual labels over a month of granular data - python

I have a dataframe which contains life logging data gathered over several years from 44 unique individuals.
Int64Index: 77171 entries, 0 to 4279
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 start 77171 non-null datetime64[ns]
1 end 77171 non-null datetime64[ns]
2 labelName 77171 non-null category
3 id 77171 non-null int64
The start column contains granular datetimes of the format 2020-11-01 11:00:00, in intervals of 30 minutes. The labelName column has 14 different categories.
Categories (14, object): ['COOK', 'EAT', 'GO WALK', 'GO TO BATHROOM', ..., 'DRINK', 'WAKE UP', 'SLEEP', 'WATCH TV']
Here's a sample user's head, which is [2588 rows x 4 columns] and spans from 2020 to 2021. There are also gaps in the data, occasionally.
start end labelName id
0 2020-08-05 00:00:00 2020-08-05 00:30:00 GO TO BATHROOM 486
1 2020-08-05 06:00:00 2020-08-05 06:30:00 WAKE UP 486
2 2020-08-05 09:00:00 2020-08-05 09:30:00 COOK 486
3 2020-08-05 11:00:00 2020-08-05 11:30:00 EAT 486
4 2020-08-05 12:00:00 2020-08-05 12:30:00 EAT 486
.. ... ... ... ...
859 2021-03-10 12:30:00 2021-03-10 13:00:00 GO TO BATHROOM 486
861 2021-03-10 13:30:00 2021-03-10 14:00:00 GO TO BATHROOM 486
862 2021-03-10 18:30:00 2021-03-10 19:00:00 COOK 486
864 2021-03-11 08:00:00 2021-03-11 08:30:00 EAT 486
865 2021-03-11 12:30:00 2021-03-11 13:00:00 COOK 486
I want a sum of each unique labelNames per user per month, but I'm not sure how to do this.
I would first split the data frame by id, which is easy. But how do you split these start datetimes when it records every 30 minutes over several years of data— and then create 14 new columns which record the sums?
The final data frame might look something like this (with fake values):
user
month
SLEEP
...
WATCH TV
486
jun20
324
...
23
486
jul20
234
...
12
The use-case for this data frame is training a few statistical and machine-learning models.
How do I achieve something like this?

Because there are 30 minutes data you can count them by crosstab per months by months periods by Series.dt.to_period and then multiple by 0.5 for output in hours:
If start is 2020-09-30 23:30:00 and end is 2020-10-01 00:00:00 then if need count this record for October use df['end'] in crosstab, if for September use df['start'] .
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df1 = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName']).mul(0.5)
.rename_axis(columns=None, index=['id','month'])
.rename(columns=str)
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df1)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0.0 0.0 1.0 0.5 1.0
1 650 Mar2021 0.5 1.0 0.5 0.5 0.0
For ouput in 30 minutes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName'])
.rename_axis(columns=None, index=['id','month'])
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0 0 2 1 2
1 650 Mar2021 1 2 1 1 0

Use:
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)
Output:
Demonstration:
#Preparing Data
string = """start end labelName id
2020-09-21 14:30:00 2020-09-21 15:00:00 WAKE UP 650
2020-09-21 15:00:00 2020-09-21 15:30:00 GO TO BATHROOM 650
2020-09-21 15:30:00 2020-09-21 16:00:00 SLEEP 650
2020-09-29 17:00:00 2020-09-29 17:30:00 WAKE UP 650
2020-09-29 17:30:00 2020-09-29 18:00:00 GO TO BATHROOM 650
2021-03-11 13:00:00 2021-03-11 13:30:00 EAT 650
2021-03-11 14:30:00 2021-03-11 15:00:00 GO TO BATHROOM 650
2021-03-11 15:00:00 2021-03-11 15:30:00 COOK 650
2021-03-11 15:30:00 2021-03-11 16:00:00 EAT 650
2021-03-11 16:00:00 2021-03-11 16:30:00 SLEEP 650"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
df['start'] = pd.to_datetime(df['start'])
#Solution
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)

Related

Aggregate time series data on weekly basis

I have a dataframe that consists of 3 years of data and two columns remaining useful life and predicted remaining useful life.
I am aggregating rul and pred_rul of 3 years data for each machineID for the maximum date they have. The original dataframe looks like this-
rul pred_diff machineID datetime
10476749 870 312.207825 408 2021-05-25 00:00:00
11452943 68 288.517578 447 2023-03-01 12:00:00
12693829 381 273.159698 493 2021-09-16 16:00:00
3413787 331 291.326416 133 2022-10-26 12:00:00
464093 77 341.506195 19 2023-10-10 16:00:00
... ... ... ... ...
11677555 537 310.586090 456 2022-04-07 00:00:00
2334804 551 289.307129 92 2021-09-04 20:00:00
5508311 35 293.721771 214 2023-01-06 04:00:00
12319704 348 322.199219 479 2021-11-11 20:00:00
4777501 87 278.089417 186 2021-06-29 12:00:00
1287421 rows × 4 columns
And I am aggregating it based on this code-
y_test_grp = y_test.groupby('machineID').agg({'datetime':'max', 'rul':'mean', 'pred_diff':'mean'})[['datetime','rul', 'pred_diff']].reset_index()
which gives the following output-
machineID datetime rul pred_diff
0 1 2023-10-03 20:00:00 286.817681 266.419401
1 2 2023-11-14 00:00:00 225.561953 263.372531
2 3 2023-10-25 00:00:00 304.736237 256.933351
3 4 2023-01-13 12:00:00 204.084899 252.476066
4 5 2023-09-07 00:00:00 208.702431 252.487156
... ... ... ... ...
495 496 2023-10-11 00:00:00 302.445285 298.836798
496 497 2023-08-26 04:00:00 281.601613 263.479885
497 498 2023-11-28 04:00:00 292.593906 263.985034
498 499 2023-06-29 20:00:00 260.887529 263.494844
499 500 2023-11-08 20:00:00 160.223614 257.326034
500 rows × 4 columns
Since this is grouped by on machineID, it is giving just 500 rows which is less. I want to aggregate rul and pred_rul on weekly basis such that for each machineID I get 52weeks*3years=156 rows. I am not able to identify which function to use for taking 7 days as interval and aggregating rul and pred_rul on that.

You can use Grouper:
pd.groupby(['machineID', pd.Grouper(key='datetime', freq='7D')]).mean()

Python datetime problem converting time format

I would like to convert the following time format which is located in a panda dataframe column
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
I would like to transform the previous time format into a standard time format of HH:MM as follow
01:00
02:00
03:00
...
15:00
16:00
...
22:00
23:00
00:00
How can I do it in python?
Thank you in advance

This will give you a df with a datetime64[ns] and object dtype column for your data:
import pandas as pd
df = pd.read_csv('hm.txt', sep=r"[ ]{2,}", engine='python', header=None, names=['pre'])
df['pre_1'] = df['pre'].astype(str).str.replace('00', '')
df['datetime_dtype'] = pd.to_datetime(df['pre_1'], format='%H', exact=False)
df['str_dtype'] = df['datetime_dtype'].astype(str).str[11:16]
print(df.head(5))
pre datetime_dtype str_dtype
0 1 1900-01-01 01:00:00 01:00
1 2 1900-01-01 02:00:00 02:00
2 3 1900-01-01 03:00:00 03:00
3 4 1900-01-01 04:00:00 04:00
4 5 1900-01-01 05:00:00 05:00
print(df.dtypes)
pre object
datetime_dtype datetime64[ns]
str_dtype object
dtype: object

Pandas : merge on date and hour from datetime index

I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?

Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0

Folding pandas time series into single day

I have a time series of events that spans multiple days-I'm mostly interested in counts/10min interval. So currently, after resampling, it looks like this
2018-02-27 16:20:00 5
2018-02-27 16:30:00 4
2018-02-27 16:40:00 0
2018-02-27 16:50:00 0
2018-02-27 17:00:00 0
...
2018-06-19 05:30:00 0
2018-06-19 05:40:00 0
2018-06-19 05:50:00 1
How can I "fold" this data over to have just one "day" of data, with the counts added up? So it would look something like this
00:00:00 0
00:10:00 0
...
11:00:00 47
11:10:00 36
11:20:00 12
...
23:40:00 1
23:50:00 0

If your series index is a DatetimeIndex, you can use the attribute time -- if it's a DataFrame and your datetimes are a column, you can use .dt.time. For example:
In [19]: times = pd.date_range("2018-02-27 16:20:00", "2018-06-19 05:50:00", freq="10 min")
...: ser = pd.Series(np.random.randint(0, 6, len(times)), index=times)
...:
...:
In [20]: ser.head()
Out[20]:
2018-02-27 16:20:00 0
2018-02-27 16:30:00 1
2018-02-27 16:40:00 4
2018-02-27 16:50:00 5
2018-02-27 17:00:00 0
Freq: 10T, dtype: int32
In [21]: out = ser.groupby(ser.index.time).sum()
In [22]: out.head()
Out[22]:
00:00:00 285
00:10:00 293
00:20:00 258
00:30:00 263
00:40:00 307
dtype: int32
In [23]: out.tail()
Out[23]:
23:10:00 280
23:20:00 291
23:30:00 236
23:40:00 303
23:50:00 299
dtype: int32

If i understand correctly, you want a sum of values per 10 min intervals in the first time column. You can perhaps try something like:
df.groupby('columns')['value'].agg(['count'])

Check if Pandas column values are not in list

I have a dataframe consisting of counts within 10 minute time intervals, how would I set count = 0 if the time interval doesn't exist?
DF1
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)
COUNT City Day Time
0 441 PHOENIX Thursday 10:20:00
1 641 ATLANTA Monday 14:30:00
2 661 PHOENIX Saturday 03:50:00
3 570 MIAMI Tuesday 21:00:00
4 222 CHICAGO Friday 15:00:00
DF2 - My approach is to create all the 10 minute time slots in a day (6*24 = 144 entries) and then use "not in"
df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
TIME_BIN
0 00:00:00
1 00:10:00
2 00:20:00
3 00:30:00
4 00:40:00
5 00:50:00
6 01:00:00
7 01:10:00
8 01:20:00
How do I check if the timeslots in DF2 do not exist in DF1 for each city and day and if so, set count = 0? I basically just need to fill in all the missing time slots in DF1.
Attempt:
for each_city in df.City.unique():
for each_day in df.Day.unique():
df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)

I think need reindex by MultiIndex from_product:
np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
.dt.round('10min')
.dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
df['Day'].unique(),
times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
.reindex(mux, fill_value=0)
.reset_index())
print (df.head(20))
City Day Time COUNT
0 CHICAGO Wednesday 00:00:00 66
1 CHICAGO Wednesday 00:10:00 205
2 CHICAGO Wednesday 00:20:00 260
3 CHICAGO Wednesday 00:30:00 127
4 CHICAGO Wednesday 00:40:00 594
5 CHICAGO Wednesday 00:50:00 683
6 CHICAGO Wednesday 01:00:00 203
7 CHICAGO Wednesday 01:10:00 0
8 CHICAGO Wednesday 01:20:00 372
9 CHICAGO Wednesday 01:30:00 109
10 CHICAGO Wednesday 01:40:00 32
11 CHICAGO Wednesday 01:50:00 184
12 CHICAGO Wednesday 02:00:00 630
13 CHICAGO Wednesday 02:10:00 108
14 CHICAGO Wednesday 02:20:00 35
15 CHICAGO Wednesday 02:30:00 604
16 CHICAGO Wednesday 02:40:00 500
17 CHICAGO Wednesday 02:50:00 367
18 CHICAGO Wednesday 03:00:00 118
19 CHICAGO Wednesday 03:10:00 546

One way is to convert to categories and use groupby to calculate Cartesian product.
In fact, given your data is largely categorical, this is a good idea and would yield memory benefits for large number of Time-City-Day combinations.
for col in ['Time', 'City', 'Day']:
df[col] = df[col].astype('category')
bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
.dt.round('10min').dt.strftime('%H:%M:%S')))
df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)
res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)
# Time City Day COUNT
# 0 00:00:00 ATLANTA Friday 521
# 1 00:00:00 ATLANTA Monday 767
# 2 00:00:00 ATLANTA Saturday 474
# 3 00:00:00 ATLANTA Sunday 1126
# 4 00:00:00 ATLANTA Thursday 157
# 5 00:00:00 ATLANTA Tuesday 720
# 6 00:00:00 ATLANTA Wednesday 0
# 7 00:00:00 CHICAGO Friday 1114
# 8 00:00:00 CHICAGO Monday 813
# 9 00:00:00 CHICAGO Saturday 137
# 10 00:00:00 CHICAGO Sunday 134
# 11 00:00:00 CHICAGO Thursday 0
# 12 00:00:00 CHICAGO Tuesday 168
# ..........

Then you can try following
df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum of individual labels over a month of granular data - python

Related

Aggregate time series data on weekly basis

Python datetime problem converting time format

Pandas : merge on date and hour from datetime index

Folding pandas time series into single day

Check if Pandas column values are not in list

Categories

Resources