I have a df1
Date Open Expiry Entry Strike
0 2020-01-03 12261.10 2020-01-09 12200.0
1 2020-01-10 12271.00 2020-01-16 12200.0
2 2020-01-17 12328.40 2020-01-23 12300.0
3 2020-01-24 12174.55 2020-01-30 12100.0
4 2020-01-31 12100.40 2020-02-06 12100.0
i want to add values from df2
Date Expiry Type Strike Price Open Close
0 2020-01-03 2020-01-09 CE 13100 0.0 65.85
1 2020-01-03 2020-01-09 CE 13150 0.0 59.40
2 2020-01-03 2020-01-09 CE 13200 0.0 53.55
3 2020-01-03 2020-01-09 CE 13250 0.0 48.15
4 2020-01-03 2020-01-09 CE 13300 0.0 43.25
i want to compare elements of column Date , Expiry and Entry Price with Date ,Expiry and Strike Price of df2 and add corresponding Open column element to df1 if the condition matches. when i directly compare columns i get errors like .
ValueError: Can only compare identically-labeled Series objects
thanks for the help
Did you try to do a simple merge and see if it solves what you want.
Do something like this:
pd.merge(df1,df2,how='left', left_on=['Date','Expiry','Entry Strike'], right_on=['Date','Expiry','Strike Price'])
Your output will be as shown below. For this, I modified the first row to match. Otherwise, the data has no matching records.
Date Open_x Expiry ... Strike Price Open_y Close
0 2020-01-03 12261.10 2020-01-09 ... 12200.0 0.0 43.25
1 2020-01-10 12271.00 2020-01-16 ... NaN NaN NaN
2 2020-01-17 12328.40 2020-01-23 ... NaN NaN NaN
3 2020-01-24 12174.55 2020-01-30 ... NaN NaN NaN
4 2020-01-31 12100.40 2020-02-06 ... NaN NaN NaN
You can then delete all columns that you don't want.
You can try with apply function aswell
def check(row):
dt = row['Date']
ex = row["Expiry"]
sp = row["Entry Strike"]
return df2[(df2['Date']==dt) & (df2['Expiry']==ex) & (df2["Strike Price"]==sp)]['Open']
df1['new_col'] = df1.apply(lambda x: check(x), axis = 1)
this will also work, check the output below I have changed one of the value to match one row.
df1
Date Open Expiry Entry Strike new_col
0 2020-01-03 12261.10 2020-01-09 13100.0 0.0
1 2020-01-10 12271.00 2020-01-16 12200.0 NaN
2 2020-01-17 12328.40 2020-01-23 12300.0 NaN
3 2020-01-24 12174.55 2020-01-30 12100.0 NaN
4 2020-01-31 12100.40 2020-02-06 12100.0 NaN
Related
Suppose we have two dataframes, one with a timestamp and the other with start and end timestamps. df1 and df2 as:
DateTime
Value1
StartDateTime
EnddDateTime
Value2
2020-01-11 12:30:00
1
2020-01-11 12:23:12
2020-01-11 13:10:00
a
2020-01-11 13:00:00
2
2020-01-11 14:12:20
2020-01-11 14:20:34
b
2020-02-11 13:30:00
3
2020-01-11 15:20:00
2020-01-11 15:28:10
c
2020-02-11 14:00:00
4
2020-01-11 15:45:20
2020-01-11 16:26:23
d
2020-02-11 14:30:00
5
2020-02-11 15:00:00
6
2020-02-11 15:30:00
7
2020-02-11 16:00:00
8
The timestamp of df1 represents half an hour starting from the time in the DateTime column. I want to match df2 start and end time with these 20 minutes periods. A value of df2 may fall in two rows of df1 if its period (the time between start and end) matches with two DateTime in df1, even for only one second. The outcome should be a dataframe as below.
DateTime
Value1
Value2
2020-01-11 12:30:00
1
a
2020-01-11 13:00:00
2
a
2020-02-11 13:30:00
3
Nan
2020-02-11 14:00:00
4
b
2020-02-11 14:30:00
5
Nan
2020-02-11 15:00:00
6
c
2020-02-11 15:30:00
7
d
2020-02-11 16:00:00
8
d
Any suggestions to efficiently merge large data?
There maybe shorter better answers out there because I am going longhand.
melt the second data frame
df3=pd.melt(df2, id_vars=['Value2'], value_vars=['StartDateTime', 'EnddDateTime'],value_name='DateTime').sort_values(by='DateTime')
Create temp columns on both dfs. The reason is, you want to get the time from datetime, append that time to a uniform date to be used in the merge
df1['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df1['DateTime']).dt.time.astype(str)
df3['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df3['DateTime']).dt.time.astype(str)
Convert the new column date times computed above to datetime
df3["DateTime1"]=pd.to_datetime(df3["DateTime1"])
df1["DateTime1"]=pd.to_datetime(df1["DateTime1"])
Finally, mergeasof with a time tolerance
final = pd.merge_asof(df1, df3, on="DateTime1",tolerance=pd.Timedelta("39M"),suffixes=('_', '_df2')).drop(columns=['DateTime1','variable','DateTime_df2'])
DateTime_ Value1 Value2
0 2020-01-11 13:00:00 2 a
1 2020-02-11 13:30:00 3 a
2 2020-02-11 14:00:00 4 NaN
3 2020-02-11 14:30:00 5 b
4 2020-02-11 15:00:00 6 NaN
5 2020-02-11 15:30:00 7 c
6 2020-02-11 16:00:00 8 d
I would like to identify the maximum value in a column that occurs within the following X days from the current date.
This is a subselect of the data frame showing the daily values for 2020.
Date Data
6780 2020-01-02 323.540009
6781 2020-01-03 321.160004
6782 2020-01-06 320.489990
6783 2020-01-07 323.019989
6784 2020-01-08 322.940002
... ... ...
7028 2020-12-24 368.079987
7029 2020-12-28 371.739990
7030 2020-12-29 373.809998
7031 2020-12-30 372.339996
I would like to find a way to identify the max value within the following 30 days. e.g.
Date Data Max
6780 2020-01-02 323.540009 323.019989
6781 2020-01-03 321.160004 323.019989
6782 2020-01-06 320.489990 323.730011
6783 2020-01-07 323.019989 323.540009
6784 2020-01-08 322.940002 325.779999
... ... ... ...
7028 2020-12-24 368.079987 373.809998
7029 2020-12-28 371.739990 373.809998
7030 2020-12-29 373.809998 372.339996
7031 2020-12-30 372.339996 373.100006
I tried calculating the start and end dates and storing them in the columns. e.g.
df['startDate'] = df['Date'] + pd.to_timedelta(1, unit='d')
df['endDate'] = df['Date'] + pd.to_timedelta(30, unit='d')
before trying to calculate the max. e.g,
df['Max'] = df.loc[(df['Date'] > df['startDate']) & (df['Date'] < df['endDate'])]['Data'].max()
But this results in;
Date Data startDate endDate Max
6780 2020-01-02 323.540009 2020-01-03 2020-01-29 NaN
6781 2020-01-03 321.160004 2020-01-04 2020-01-30 NaN
6782 2020-01-06 320.489990 2020-01-07 2020-02-02 NaN
6783 2020-01-07 323.019989 2020-01-08 2020-02-03 NaN
6784 2020-01-08 322.940002 2020-01-09 2020-02-04 NaN
... ... ... ... ... ...
7027 2020-12-23 368.279999 2020-12-24 2021-01-19 NaN
7028 2020-12-24 368.079987 2020-12-25 2021-01-20 NaN
7029 2020-12-28 371.739990 2020-12-29 2021-01-24 NaN
7030 2020-12-29 373.809998 2020-12-31 2021-01-26 NaN
If I statically add dates to the loc[] statement, it partially works, filling the max for this static range however this just gives me the same value for every field.
Any help on the correct panda way to achieve this would be appreciated.
Kind Regards
df.rolling can do this if you make the date a datetime object as the axis:
df["Date"] = pd.to_datetime(df.Date)
df.set_index("Date").rolling("2d").max()
output:
Data
Date
2020-01-02 323.540009
2020-01-03 323.540009
2020-01-06 320.489990
2020-01-07 323.019989
2020-01-08 323.019989
I have a data in below format
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:55:00 18 1047
Expected Output:
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:20:00 0 41
xyz 01-01-2020 00:25:00 0 41
xyz 01-01-2020 00:30:00 0 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:50:00 0 1029
xyz 01-01-2020 11:55:00 18 1047
So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)
My efforts:
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')
df = df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)
How do I fill flowers column with zeros and total_flower column with ffill and I am also getting values in timestamp column as Nan
Actual Output:
user timestamp flowers total_flowers
xyz Nan 15 15
xyz Nan 5 20
xyz Nan 21 41
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan 1 42
...
xyz Nan 57 1029
xyz Nan Nan Nan
xyz Nan 18 1047
Reindex and refill
If you construct the dates such that you can reindex your timestamps, you can then just do some fillna and ffill operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:
d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
1: Timestamp('2020-01-01 00:10:00'),
2: Timestamp('2020-01-01 00:15:00'),
3: Timestamp('2020-01-01 00:35:00')},
'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}
df = pd.DataFrame(d)
# user timestamp flowers total_flowers
#0 xyz 2020-01-01 00:05:00 15 15
#1 xyz 2020-01-01 00:10:00 5 20
#2 xyz 2020-01-01 00:15:00 21 41
#3 xyz 2020-01-01 00:35:00 1 42
#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")
#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)
Output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
7 2020-01-01 00:40:00 xyz 0.0 42.0
8 2020-01-01 00:45:00 xyz 0.0 42.0
9 2020-01-01 00:50:00 xyz 0.0 42.0
10 2020-01-01 00:55:00 xyz 0.0 42.0
Resample and refill
You can also use resample here using asfreq(), then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):
# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)
Same output:
timestamp flowers total_flowers user
0 2020-01-01 00:05:00 15 15.0 xyz
1 2020-01-01 00:10:00 5 20.0 xyz
2 2020-01-01 00:15:00 21 41.0 xyz
3 2020-01-01 00:20:00 0 41.0 xyz
4 2020-01-01 00:25:00 0 41.0 xyz
5 2020-01-01 00:30:00 0 41.0 xyz
6 2020-01-01 00:35:00 1 42.0 xyz
I couldn't find a way to do the filling during the resampling. For instance, using
df = df.resample('5T').agg({'flowers':'sum',
'total_flowers':'ffill',
'user':'ffill'})
does not work (it gets you to the same place as asfreq, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq and doing the filling after the fact seems fine to me with few columns.
crossed with #Tom
You are almost there:
df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
'flowers':[15, 5, 21, 1],
'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
leads to the following output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I'm trying to merge two dataframes by time with multiple matches. I'm looking for all the instances of df2 whose timestamp falls 7 days or less before endofweek in df1. There may be more than one record that fits the case, and I want all of the matches, not just the first or last (which pd.merge_asof does).
import pandas as pd
df1 = pd.DataFrame({'endofweek': ['2019-08-31', '2019-08-31', '2019-09-07', '2019-09-07', '2019-09-14', '2019-09-14'], 'GroupCol': [1234,8679,1234,8679,1234,8679]})
df2 = pd.DataFrame({'timestamp': ['2019-08-30 10:00', '2019-08-30 10:30', '2019-09-07 12:00', '2019-09-08 14:00'], 'GroupVal': [1234, 1234, 8679, 1234], 'TextVal': ['1234_1', '1234_2', '8679_1', '1234_3']})
df1['endofweek'] = pd.to_datetime(df1['endofweek'])
df2['timestamp'] = pd.to_datetime(df2['timestamp'])
I've tried
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
but that gets me
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 NaT NaN NaN
3 2019-09-07 8679 NaT NaN NaN
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
I'm losing the text 1234_1. Is there way to do a sort of outer join for pd.merge_asof, where I can keep all of the instances of df2 and not just the first or last?
My ideal result would look like this (assuming that the endofweek times are treated like 00:00:00 on that date):
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
pd.merge_asof only does a left join. After a lot of frustration trying to speed up the groupby/merge_ordered example, it's more intuitive and faster to do pd.merge_asof on both data sources in different directions, and then do an outer join to combine them.
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
In addition, it is much faster than my other answer:
import time
n=1000
start=time.time()
for i in range(n):
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
end = time.time()
end-start
15.040804386138916
One way I tried is using groupby on one data frame, and then subsetting the other one in a pd.merge_ordered:
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
However it seems to me the result is very slow:
import time
n=1000
start=time.time()
for i in range(n):
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
end = time.time()
end-start
40.72932052612305
I would greatly appreciate any improvements!