Choosing the minumum distance part 2

Choosing the minumum distance part 2 - python

This question is already here, but now I have added an extra part to the previous question.
I have the following dataframe:
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
I have been trying to calculate the short time difference between the orders each 15 minutes, e.g. I take a time window 15 minutes and take only its half 7:30 which means I would like to calculate the difference between the first order '2019-01-01 0:00:00' and 00:07:30 and between the second order '2019-01-01 0:11:00' and 00:07:30 and take only the order that is closer to 00:07:30 each day.
I did the following:
t = 0
s = pd.Time.datetime.fromtimestamp(t).strftime('%H:%M:%S')
#x = '00:00:00'
#y = '00:15:00'
tw = 900
g = 0
a = []
for k in range(30):
begin = pd.Timestamp(s).to_pydatetime()
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
for i in range(1, len(df_data)):
#g +=1
if x <= df_data.iat[i-1, 4] <= y:
half_time = (pd.Timestamp(y) - pd.Timstamp(x).to_pydatetime()) / 2
half_window = (half_time + pd.Timestamp(x).to_pydatetime()).strftime('%H:%M:%S')
for l in df_data['day_order']:
for k in df_data['time_order']:
if l == k.strftime('%Y-%m-%d')
distance1 = abs(pd.Timestamp(df_data.iat[i-1, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
distance2 = abs(pd.Timestamp(df_data.iat[i, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
if distance1 < distance2:
d = distance1
else:
d = distance2
a.append(d.seconds)
so the expected result for the first day is abs(00:11:00 - 00:07:30) = 00:03:30 which is less than abs(00:00:00 - 00:07:30) = 00:07:30 and by doing so I would like to consider only the short time distance which means the 00:03:30 and ignor the first order at that day. I would like to do it for each day. I tried it with my code above, it doesn't work. Any idea would be very appreciated. Thanks in advance.
Update:
I just have added an extra command to the code above, so that I move the time window each minute, e.g. from 00:00:00 - 00:15:00 to 00:01:00- 00:16:00 and look inside this period for the short distance, as previously discribed, and ignor other times that does not belong to that window. I tired to do this procedure for 30 minutes and it worked with your suggested solution. However, it took other times that does not belong to that period of time.

import pandas as pd
import datetime
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
x = '00:00:00'
y = '00:15:00'
s = '00:00:00'
tw = 900
begin = pd.Timestamp(s).to_pydatetime()
for k in range(10): # 10 times shift will happen
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
print('\n========\n',x,y)
diff = (pd.Timedelta(y)-pd.Timedelta(x))/2
df_data2 = df_data[(last>=pd.to_datetime(df_data['time'])) & (pd.to_datetime(df_data['time'])>begin1)].copy()
#print(df_data2)
df_data2['diff'] = abs(df_data2['time'] - (diff + pd.Timedelta(x)))
mins = df_data2.groupby('day_order').apply(lambda z: z[z['diff']==min(z['diff'])])
mins.reset_index(drop=True, inplace=True)
print(mins)
Output after first 10 shifts:
========
00:00:00 00:15:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:03:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:00:30
========
00:01:00 00:16:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:04:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:02:00 00:17:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:05:30
2 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:05:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:03:00 00:18:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:04:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:04:00 00:19:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:03:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:05:00 00:20:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:02:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:02:30
========
00:06:00 00:21:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:03:30
========
00:07:00 00:22:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:04:30
========
00:08:00 00:23:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:04:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:05:30
========
00:09:00 00:24:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:05:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:06:30
Now if you see, there were some iteration where there were 4 rows generated in output. If you see in the diff column you would find that, there could be pairs of rows that can have same time difference. This is due to the fact that we are considering positive and negative time difference as same.
So for example in the above output, the second iteration i.e. 00:01:00 to 00:16:00 we can see that there are two entries for 2019-01-03
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
And this is because both of their difference are of 00:01:30.
The mid for this range will be at 00:01:00 + 00:07:30 = 00:08:30
00:07:30 <----(- 01:30)----00:08:30---(+ 01:30)--->00:10:00
And that's why both orders were displayed

Related

Add missing timestamps for each different ID in dataframe

I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.

Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0

Cumsum by day of two dataframes considering repeated hours

I have the following two dataframes:
print(df_diff)
print(df_census_occupation)
pacients
2019-01-01 00:10:00 1
2019-01-01 00:20:00 1
2019-01-01 00:30:00 -1
2019-01-02 10:00:00 1
2019-01-02 11:30:00 1
2019-01-03 00:00:00 -1
2019-01-03 15:00:00 -1
2019-01-03 23:30:00 -1
2019-01-04 00:00:00 1
2019-01-04 00:00:00 1
2019-01-04 10:00:00 -1
2019-01-04 10:00:00 -1
pacients_census
2019-01-01 10
2019-01-02 20
2019-01-03 30
2019-01-04 10
And I need to transform them into:
pacients
2019-01-01 00:00:00 10
2019-01-01 00:10:00 11
2019-01-01 00:20:00 12
2019-01-01 00:30:00 11
2019-01-02 00:00:00 20
2019-01-02 10:00:00 21
2019-01-02 11:30:00 22
2019-01-03 00:00:00 30
2019-01-03 00:00:00 29
2019-01-03 15:00:00 28
2019-01-03 23:30:00 27
2019-01-04 00:00:00 10
2019-01-04 00:00:00 11
2019-01-04 00:00:00 12
2019-01-04 10:00:00 11
2019-01-04 10:00:00 10
It's like a cumsum by day, where each day starts over again based on another dataframe (df_census_occupation). Attention must be taken to consider repeated hours, there may be days where we have exactly the same hour in df_diff, and such hours may also coincide with the start of the day in df_census_occupation. This is what happens in 2019-01-04 00:00:00 for example.
I tried using cumsum with masks and shifts, and also some groupby operations, but the code was becoming difficult to understand and it was not considering the repeated hours issue.
Auxiliary code to generate the two dataframes:
import datetime
df_diff_index = [
"2019-01-01 00:10:00",
"2019-01-01 00:20:00",
"2019-01-01 00:30:00",
"2019-01-02 10:00:00",
"2019-01-02 11:30:00",
"2019-01-03 00:00:00",
"2019-01-03 15:00:00",
"2019-01-03 23:30:00",
"2019-01-04 00:00:00",
"2019-01-04 00:00:00",
"2019-01-04 10:00:00",
"2019-01-04 10:00:00",
]
df_diff_index = [datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S") for date in df_diff_index]
df_census_occupation_index = [
"2019-01-01",
"2019-01-02",
"2019-01-03",
"2019-01-04",
]
df_census_occupation_index = [datetime.datetime.strptime(date, "%Y-%m-%d") for date in df_census_occupation_index]
df_diff = pd.DataFrame({"pacients": [1, 1, -1, 1, 1, -1, -1, -1, 1, 1, -1, -1]}, index=df_diff_index)
df_census_occupation = pd.DataFrame({"pacients_census": [10, 20, 30, 10]}, index=df_census_occupation_index)

Concatenate to data, sort by index, then groupby on the day and cumsum:
out = pd.concat([df_census_occupation.rename(columns={'pacients_census':'pacients'}), df_diff]
).sort_index().groupby(pd.Grouper(freq='D')).cumsum()
Output:
pacients
2019-01-01 00:00:00 10
2019-01-01 00:10:00 11
2019-01-01 00:20:00 12
2019-01-01 00:30:00 11
2019-01-02 00:00:00 20
2019-01-02 10:00:00 21
2019-01-02 11:30:00 22
2019-01-03 00:00:00 30
2019-01-03 00:00:00 29
2019-01-03 15:00:00 28
2019-01-03 23:30:00 27
2019-01-04 00:00:00 10
2019-01-04 00:00:00 11
2019-01-04 00:00:00 12
2019-01-04 10:00:00 11
2019-01-04 10:00:00 10
note you may want to pass kind='mergesort' to sort_index so the sort is stable, i.e. concensus goes before the data.

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.

Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000

Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

Indentify cells by condition within the same day

Let's say I have the below Dataframe. How would I do to get an extra column 'flag' with 1's where a day has a age bigger than 90 and only if it happens in 2 consecutive days (48h in this case)? The output should contain 1' on 2 or more days, depending on how many days the condition is met The dataset is much bigger, but I put here just a small portion so you get an idea.
Age
Dates
2019-01-01 00:00:00 29
2019-01-01 01:00:00 56
2019-01-01 02:00:00 82
2019-01-01 03:00:00 13
2019-01-01 04:00:00 35
2019-01-01 05:00:00 53
2019-01-01 06:00:00 25
2019-01-01 07:00:00 23
2019-01-01 08:00:00 21
2019-01-01 09:00:00 12
2019-01-01 10:00:00 15
2019-01-01 11:00:00 9
2019-01-01 12:00:00 13
2019-01-01 13:00:00 87
2019-01-01 14:00:00 9
2019-01-01 15:00:00 63
2019-01-01 16:00:00 62
2019-01-01 17:00:00 52
2019-01-01 18:00:00 43
2019-01-01 19:00:00 77
2019-01-01 20:00:00 95
2019-01-01 21:00:00 79
2019-01-01 22:00:00 77
2019-01-01 23:00:00 5
2019-01-02 00:00:00 78
2019-01-02 01:00:00 41
2019-01-02 02:00:00 10
2019-01-02 03:00:00 10
2019-01-02 04:00:00 88
2019-01-02 05:00:00 19
This would be the desired output:
Dates Age flag
0 2019-01-01 00:00:00 29 1
1 2019-01-01 01:00:00 56 1
2 2019-01-01 02:00:00 82 1
3 2019-01-01 03:00:00 13 1
4 2019-01-01 04:00:00 35 1
5 2019-01-01 05:00:00 53 1
6 2019-01-01 06:00:00 25 1
7 2019-01-01 07:00:00 23 1
8 2019-01-01 08:00:00 21 1
9 2019-01-01 09:00:00 12 1
10 2019-01-01 10:00:00 15 1
11 2019-01-01 11:00:00 9 1
12 2019-01-01 12:00:00 13 1
13 2019-01-01 13:00:00 87 1
14 2019-01-01 14:00:00 9 1
15 2019-01-01 15:00:00 63 1
16 2019-01-01 16:00:00 62 1
17 2019-01-01 17:00:00 52 1
18 2019-01-01 18:00:00 43 1
19 2019-01-01 19:00:00 77 1
20 2019-01-01 20:00:00 95 1
21 2019-01-01 21:00:00 79 1
22 2019-01-01 22:00:00 77 1
23 2019-01-01 23:00:00 5 1
24 2019-01-02 00:00:00 78 0
25 2019-01-02 01:00:00 41 0
26 2019-01-02 02:00:00 10 0
27 2019-01-02 03:00:00 10 0
28 2019-01-02 04:00:00 88 0
29 2019-01-02 05:00:00 19 0
The dates is the index of the dataframe and is incremented by 1h.
thanks

You can first compare column by Series.gt, then grouping by DatetimeIndex.date and ccheck if at least one True per groups by GroupBy.transform with GroupBy.any, last cast mask to integers for True/False to 1/0 mapping, then combinae it with previous answer:
df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='5H', periods=24))
#for test 1H timestamp use
#df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='H', periods=24 * 5))
df.loc[pd.Timestamp('2019-01-02 01:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-03 02:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-05 19:00:00'), 'Age'] = 95
#print (df)
#for test 48 consecutive values change N = 48
N = 10
s = df['Age'].gt(90)
s1 = (s.groupby(df.index.date).transform('any'))
g1 = s1.ne(s1.shift()).cumsum()
df['flag'] = (s.groupby(g1).transform('size').ge(N) & s1).astype(int)
print (df)
Age flag
2019-01-01 00:00:00 10 0
2019-01-01 05:00:00 10 0
2019-01-01 10:00:00 10 0
2019-01-01 15:00:00 10 0
2019-01-01 20:00:00 10 0
2019-01-02 01:00:00 95 1
2019-01-02 06:00:00 10 1
2019-01-02 11:00:00 10 1
2019-01-02 16:00:00 10 1
2019-01-02 21:00:00 10 1
2019-01-03 02:00:00 95 1
2019-01-03 07:00:00 10 1
2019-01-03 12:00:00 10 1
2019-01-03 17:00:00 10 1
2019-01-03 22:00:00 10 1
2019-01-04 03:00:00 10 0
2019-01-04 08:00:00 10 0
2019-01-04 13:00:00 10 0
2019-01-04 18:00:00 10 0
2019-01-04 23:00:00 10 0
2019-01-05 04:00:00 10 0
2019-01-05 09:00:00 10 0
2019-01-05 14:00:00 10 0
2019-01-05 19:00:00 95 0

Apparently, this could be a solution to the first version of the question: how to add a column whose row values are 1 if at least one of the rows with the same date (y-m-d) has an Age value greater than 90.
import pandas as pd
df = pd.DataFrame({
'Dates':['2019-01-01 00:00:00',
'2019-01-01 01:00:00',
'2019-01-01 02:00:00',
'2019-01-02 00:00:00',
'2019-01-02 01:00:00',
'2019-01-03 02:00:00',
'2019-01-03 03:00:00',],
'Age':[29, 56, 92, 13, 1, 2, 93],})
df.set_index('Dates', inplace=True)
df.index = pd.to_datetime(df.index)
df['flag'] = pd.DatetimeIndex(df.index).day
df['flag'] = df.flag.isin(df['flag'][df['Age']>90]).astype(int)
It returns:
Age flag
Dates
2019-01-01 00:00:00 29 1
2019-01-01 01:00:00 56 1
2019-01-01 02:00:00 92 1
2019-01-02 00:00:00 13 0
2019-01-02 01:00:00 1 0
2019-01-03 02:00:00 2 1
2019-01-03 03:00:00 93 1

Pandas .resample() or .asfreq() fill forward times

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?

Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Choosing the minumum distance part 2 - python

Related

Add missing timestamps for each different ID in dataframe

Cumsum by day of two dataframes considering repeated hours

Applying start and endtime as filters to dataframe

Indentify cells by condition within the same day

Pandas .resample() or .asfreq() fill forward times

Categories

Resources