I have a dataframe that consists of 3 years of data and two columns remaining useful life and predicted remaining useful life.
I am aggregating rul and pred_rul of 3 years data for each machineID for the maximum date they have. The original dataframe looks like this-
rul pred_diff machineID datetime
10476749 870 312.207825 408 2021-05-25 00:00:00
11452943 68 288.517578 447 2023-03-01 12:00:00
12693829 381 273.159698 493 2021-09-16 16:00:00
3413787 331 291.326416 133 2022-10-26 12:00:00
464093 77 341.506195 19 2023-10-10 16:00:00
... ... ... ... ...
11677555 537 310.586090 456 2022-04-07 00:00:00
2334804 551 289.307129 92 2021-09-04 20:00:00
5508311 35 293.721771 214 2023-01-06 04:00:00
12319704 348 322.199219 479 2021-11-11 20:00:00
4777501 87 278.089417 186 2021-06-29 12:00:00
1287421 rows × 4 columns
And I am aggregating it based on this code-
y_test_grp = y_test.groupby('machineID').agg({'datetime':'max', 'rul':'mean', 'pred_diff':'mean'})[['datetime','rul', 'pred_diff']].reset_index()
which gives the following output-
machineID datetime rul pred_diff
0 1 2023-10-03 20:00:00 286.817681 266.419401
1 2 2023-11-14 00:00:00 225.561953 263.372531
2 3 2023-10-25 00:00:00 304.736237 256.933351
3 4 2023-01-13 12:00:00 204.084899 252.476066
4 5 2023-09-07 00:00:00 208.702431 252.487156
... ... ... ... ...
495 496 2023-10-11 00:00:00 302.445285 298.836798
496 497 2023-08-26 04:00:00 281.601613 263.479885
497 498 2023-11-28 04:00:00 292.593906 263.985034
498 499 2023-06-29 20:00:00 260.887529 263.494844
499 500 2023-11-08 20:00:00 160.223614 257.326034
500 rows × 4 columns
Since this is grouped by on machineID, it is giving just 500 rows which is less. I want to aggregate rul and pred_rul on weekly basis such that for each machineID I get 52weeks*3years=156 rows. I am not able to identify which function to use for taking 7 days as interval and aggregating rul and pred_rul on that.
You can use Grouper:
pd.groupby(['machineID', pd.Grouper(key='datetime', freq='7D')]).mean()
I would like to convert the following time format which is located in a panda dataframe column
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
I would like to transform the previous time format into a standard time format of HH:MM as follow
01:00
02:00
03:00
...
15:00
16:00
...
22:00
23:00
00:00
How can I do it in python?
Thank you in advance
This will give you a df with a datetime64[ns] and object dtype column for your data:
import pandas as pd
df = pd.read_csv('hm.txt', sep=r"[ ]{2,}", engine='python', header=None, names=['pre'])
df['pre_1'] = df['pre'].astype(str).str.replace('00', '')
df['datetime_dtype'] = pd.to_datetime(df['pre_1'], format='%H', exact=False)
df['str_dtype'] = df['datetime_dtype'].astype(str).str[11:16]
print(df.head(5))
pre datetime_dtype str_dtype
0 1 1900-01-01 01:00:00 01:00
1 2 1900-01-01 02:00:00 02:00
2 3 1900-01-01 03:00:00 03:00
3 4 1900-01-01 04:00:00 04:00
4 5 1900-01-01 05:00:00 05:00
print(df.dtypes)
pre object
datetime_dtype datetime64[ns]
str_dtype object
dtype: object
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
I have a time series of events that spans multiple days-I'm mostly interested in counts/10min interval. So currently, after resampling, it looks like this
2018-02-27 16:20:00 5
2018-02-27 16:30:00 4
2018-02-27 16:40:00 0
2018-02-27 16:50:00 0
2018-02-27 17:00:00 0
...
2018-06-19 05:30:00 0
2018-06-19 05:40:00 0
2018-06-19 05:50:00 1
How can I "fold" this data over to have just one "day" of data, with the counts added up? So it would look something like this
00:00:00 0
00:10:00 0
...
11:00:00 47
11:10:00 36
11:20:00 12
...
23:40:00 1
23:50:00 0
If your series index is a DatetimeIndex, you can use the attribute time -- if it's a DataFrame and your datetimes are a column, you can use .dt.time. For example:
In [19]: times = pd.date_range("2018-02-27 16:20:00", "2018-06-19 05:50:00", freq="10 min")
...: ser = pd.Series(np.random.randint(0, 6, len(times)), index=times)
...:
...:
In [20]: ser.head()
Out[20]:
2018-02-27 16:20:00 0
2018-02-27 16:30:00 1
2018-02-27 16:40:00 4
2018-02-27 16:50:00 5
2018-02-27 17:00:00 0
Freq: 10T, dtype: int32
In [21]: out = ser.groupby(ser.index.time).sum()
In [22]: out.head()
Out[22]:
00:00:00 285
00:10:00 293
00:20:00 258
00:30:00 263
00:40:00 307
dtype: int32
In [23]: out.tail()
Out[23]:
23:10:00 280
23:20:00 291
23:30:00 236
23:40:00 303
23:50:00 299
dtype: int32
If i understand correctly, you want a sum of values per 10 min intervals in the first time column. You can perhaps try something like:
df.groupby('columns')['value'].agg(['count'])
I have a dataframe consisting of counts within 10 minute time intervals, how would I set count = 0 if the time interval doesn't exist?
DF1
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)
COUNT City Day Time
0 441 PHOENIX Thursday 10:20:00
1 641 ATLANTA Monday 14:30:00
2 661 PHOENIX Saturday 03:50:00
3 570 MIAMI Tuesday 21:00:00
4 222 CHICAGO Friday 15:00:00
DF2 - My approach is to create all the 10 minute time slots in a day (6*24 = 144 entries) and then use "not in"
df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
TIME_BIN
0 00:00:00
1 00:10:00
2 00:20:00
3 00:30:00
4 00:40:00
5 00:50:00
6 01:00:00
7 01:10:00
8 01:20:00
How do I check if the timeslots in DF2 do not exist in DF1 for each city and day and if so, set count = 0? I basically just need to fill in all the missing time slots in DF1.
Attempt:
for each_city in df.City.unique():
for each_day in df.Day.unique():
df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)
I think need reindex by MultiIndex from_product:
np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
.dt.round('10min')
.dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
df['Day'].unique(),
times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
.reindex(mux, fill_value=0)
.reset_index())
print (df.head(20))
City Day Time COUNT
0 CHICAGO Wednesday 00:00:00 66
1 CHICAGO Wednesday 00:10:00 205
2 CHICAGO Wednesday 00:20:00 260
3 CHICAGO Wednesday 00:30:00 127
4 CHICAGO Wednesday 00:40:00 594
5 CHICAGO Wednesday 00:50:00 683
6 CHICAGO Wednesday 01:00:00 203
7 CHICAGO Wednesday 01:10:00 0
8 CHICAGO Wednesday 01:20:00 372
9 CHICAGO Wednesday 01:30:00 109
10 CHICAGO Wednesday 01:40:00 32
11 CHICAGO Wednesday 01:50:00 184
12 CHICAGO Wednesday 02:00:00 630
13 CHICAGO Wednesday 02:10:00 108
14 CHICAGO Wednesday 02:20:00 35
15 CHICAGO Wednesday 02:30:00 604
16 CHICAGO Wednesday 02:40:00 500
17 CHICAGO Wednesday 02:50:00 367
18 CHICAGO Wednesday 03:00:00 118
19 CHICAGO Wednesday 03:10:00 546
One way is to convert to categories and use groupby to calculate Cartesian product.
In fact, given your data is largely categorical, this is a good idea and would yield memory benefits for large number of Time-City-Day combinations.
for col in ['Time', 'City', 'Day']:
df[col] = df[col].astype('category')
bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
.dt.round('10min').dt.strftime('%H:%M:%S')))
df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)
res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)
# Time City Day COUNT
# 0 00:00:00 ATLANTA Friday 521
# 1 00:00:00 ATLANTA Monday 767
# 2 00:00:00 ATLANTA Saturday 474
# 3 00:00:00 ATLANTA Sunday 1126
# 4 00:00:00 ATLANTA Thursday 157
# 5 00:00:00 ATLANTA Tuesday 720
# 6 00:00:00 ATLANTA Wednesday 0
# 7 00:00:00 CHICAGO Friday 1114
# 8 00:00:00 CHICAGO Monday 813
# 9 00:00:00 CHICAGO Saturday 137
# 10 00:00:00 CHICAGO Sunday 134
# 11 00:00:00 CHICAGO Thursday 0
# 12 00:00:00 CHICAGO Tuesday 168
# ..........
Then you can try following
df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())