I have a ProductDf which have many versions of the same product. I want to filter the last iteration of the product. So I did this as below:
productIndexDf= ProductDf.groupby('productId').apply(lambda
x:x['startDtTime'].reset_index()).reset_index()
productToPick = productIndexDf.groupby('productId')['index'].max()
get the value of productToPick into a string
productIndex = productToPick.to_string(header=False,
index=False).replace('\n',' ')
productIndex = productIndex.split()
productIndex = list(map(int, productIndex))
productIndex.sort()
productIndexStr = ','.join(str(e) for e in productIndex)
Once I get that in a Series, I call loc function manually and it works:
filteredProductDf = ProductDf.iloc[[7,8],:]
If I pass it the string, I get an error:
filteredProductDf = ProductDf.iloc[productIndexStr,:]
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
I also tried this:
filteredProductDf = ProductDf[productIndexStr]
But then I get this issue:
KeyError: '7,8'
Pandas Dataframe iloc method works only with integer type indexed value. If you want to use string value as index for accessing data from pandas dataframe then you have to use Pandas Dataframe loc method.
Know more about these method from these link.
Use of Pandas Dataframe iloc method
Use of Pandas Dataframe loc method
Ok I think you are confusing it.
Given a dataframe that look like this:
avgPrice productId startDtTime totalSold
0 42.5 A001 01/05/2018 100
1 55.5 A001 02/05/2018 150
2 48.5 A001 03/05/2018 300
3 42.5 A002 01/05/2018 220
4 53.5 A002 02/05/2018 250
I assume that you are interested in row 2 and 4 (the last value of respective productId). In pandas the easiest way would be to use drop_duplicates() with the param keep='last'. Consider this example:
import pandas as pd
d = {'startDtTime': {0: '01/05/2018', 1: '02/05/2018',
2: '03/05/2018', 3: '01/05/2018', 4: '02/05/2018'},
'totalSold': {0: 100, 1: 150, 2: 300, 3: 220, 4: 250},
'productId': {0: 'A001', 1: 'A001', 2: 'A001', 3: 'A002', 4: 'A002'},
'avgPrice': {0: 42.5, 1: 55.5, 2: 48.5, 3: 42.5, 4: 53.5}
}
# Recreate dataframe
ProductDf = pd.DataFrame(d)
# Convert column with dates to datetime objects
ProductDf['startDtTime'] = pd.to_datetime(ProductDf['startDtTime'])
# Sort values by productId and startDtTime to ensure correct order
ProductDf.sort_values(by=['productId','startDtTime'], inplace=True)
# Drop the duplicates
ProductDf.drop_duplicates(['productId'], keep='last', inplace=True)
print(ProductDf)
And you get:
avgPrice productId startDtTime totalSold
2 48.5 A001 2018-03-05 300
4 53.5 A002 2018-02-05 250
Related
I have this main dataframe that I wish to populate (blanks are NaNs):
final_outcome =
bankId latestPeriodEndDate tier1Ratio-mrq-0 tier1Ratio-mrq-1 leverageRatio-mrq-0 leverageRatio-mrq-1
0 1004381
The other two dataframes I wish to use to populate this one are:
mrq_0:
bankId tier1Ratio-mrq-0 leverageRatio-mrq-0
0 1004381 21.36 11.45
mrq_1:
bankId tier1Ratio-mrq-1 leverageRatio-mrq-1
0 1004381 15.82 8.65
What I have tried is like a cascade of merges like this:
final_outcome = final_outcome.merge(mrq_0, on = 'bankId').merge(mrq_1, on = 'bankId')
Or using this:
final_outcome.merge(mrq_0, on =['bankId', 'tier1Ratio-mrq-0', 'leverageRatio-mrq-0']).merge(mrq_1, on = ['bankId', 'tier1Ratio-mrq-1', 'leverageRatio-mrq-1'])
But unfortunately, the outcome adds additional columns with suffixes (for this one I'll paste a screenshot for better readability):
Now, the outcome I desire is just a "more populated" version of final_outcome, ideally would look like something like this:
bankId latestPeriodEndDate tier1Ratio-mrq-0 tier1Ratio-mrq-1 leverageRatio-mrq-0 leverageRatio-mrq-1
0 1004381 21.36 15.82 11.45 8.65
How can I achieve this? Thanks in advance.
Given:
d0 = {'bankId': {0: 1004381}, 'tier1Ratio-mrq-0': {0: 21.36}, 'leverageRatio-mrq-0': {0: 11.45}}
df0 = pd.DataFrame(d0)
d1 = {'bankId': {0: 1004381}, 'tier1Ratio-mrq-1': {0: 15.82}, 'leverageRatio-mrq-1': {0: 8.65}}
df1 = pd.DataFrame(d1)
Doing:
final_outcome = df0.merge(df1)
final_outcome['latestPeriodEndDate'] = np.nan
print(final_outcome)
Output:
bankId tier1Ratio-mrq-0 leverageRatio-mrq-0 tier1Ratio-mrq-1 leverageRatio-mrq-1 latestPeriodEndDate
0 1004381 21.36 11.45 15.82 8.65 NaN
I have the following pandas DataFrame:
df = pd.DataFrame({
'id': [1, 1, 1, 2],
'r': [1000, 1300, 1400, 1100],
's': [650, 720, 565, 600]
})
I'd like to aggregate the DataFrame and create a new column which is a range of r values - 25th and 75th percentiles. The aggregate for s columns is mean.
If there is only one observation for a group, then keep the observations as it is.
Expected output:
id r s
1 1075 - 1325 645
2 1100 600
Here is one option, using Groupby.agg, quantile, and a regex.
NB. I am not fully sure which interpolation method you expect for the quantiles (check the linked documentation, there are several options).
import re
out = (df
.groupby('id')
.agg({'r': lambda x: re.sub(r'(\d+(?:\.\d+)?) - \1', r'\1',
x.quantile([0.25, 0.75])
.astype(str).str.cat(sep=' - ')),
's': 'mean'})
)
Output:
r s
id
1 1150.0 - 1350.0 645.0
2 1100.0 600.0
Option two:
g_id = df.groupby('id')
g_id['r'].quantile([.25, .76])\
.unstack()\
.assign(s=g_id['s'].agg('mean'))
Output:
0.25 0.76 s
id
1 1150.0 1352.0 645.0
2 1100.0 1100.0 600.0
Details:
Create a groupby object g_id, which we will use a twice.
g_id['r'].quantile([.25,.75]) returns a multiindex Series with out level as id, and the inner level as the label for percentile 25 and 5. You can then unstack this inner level to create columns. Lastly, we assign a new column to this dataframe with the g_id of S aggregated using mean.
Say I have three lists of paired numeric data in Python. The lists are for the day of year (number between 1-365), the hour of day (number between 0-24), and the corresponding temperature at that time. I have provided example lists below:
day_of_year = [1,1,1,1,1,1,1,1,1,1,1,1,1,1] #day = Jan 1 in this example
hour_of_day = [2,4,6,8,10,12,14,16,18,20,22,24]
temperature =[23.1,22.0,24.1,26.5,23.8,40.1,32.7,41.3,29.4,36.4,22.0,24.1]
I have these hourly paired data for a location for an entire year (I've just shown simplified lists above). So for each day I have 24 day_of_year values (that are just the same number repeated, in this example the day = 1) and 24 temperature values since they're hourly data. I'm trying to design a for loop that allows me to iterate through these data to calculate and use the maximum and minimum temperature for each day of year, since another function that my code uses needs to call on those values. What would be the best way to reference all the temperature values where day_of_year are the same to calculate max and min temperatures for every day.
I have a function that takes the following inputs:
minimum_temp_today, minimum_temp_tomorrow, maximum_temp_today, maximum_temp_yesterday
I need to figure out how to pull out those values for each day of the year. I am looking for suggestions on the best way to do this. Any suggestions/tips would be super appreciated!
There are a lot of ways you could approach this, depending on what data structures you want to use. If you don't care about when the min and max occur, then personally I'd do something like this.
from collections import defaultdict
daily_temps = defaultdict(list)
for day, value in zip(day_of_year, temperature):
daily_temps[day].append(value)
ranges = dict()
for day, values in temps.items():
ranges[day] = (min(values), max(values))
Basically, you're constructing an intermediate dict that maps each day of the year to a list of all the measurements for that day. Then in the second step you use that dict to create your final dict which maps each day of the year to a tuple which is the minimum and maximum value recorded for that day.
You could use pandas which does this quite efficiently. I am using pandas 1.0.1. We end up using named aggregation for this task.
import pandas as pd
df = pd.DataFrame({'day_of_year': day_of_year, 'hour_of_day': hour_of_day, 'temperature': temperature})
print(df)
day_of_year hour_of_day temperature
0 1 2 23.1
1 1 4 22.0
2 1 6 24.1
3 1 8 26.5
4 1 10 23.8
5 1 12 40.1
6 1 14 32.7
7 1 16 41.3
8 1 18 29.4
9 1 20 36.4
10 1 22 22.0
11 1 24 24.1
df.groupby('day_of_year').agg( \
min_temp=('temperature', 'min'),
max_temp=('temperature', 'max')) \
.reset_index() \
.to_dict('records')
[{'day_of_year': 1, 'min_temp': 22.0, 'max_temp': 41.3}]
Now suppose we have data for more than one day.
day_of_year min_temp max_temp
0 1.0 22.0 41.3
1 2.0 24.0 26.0
2 3.0 24.5 42.3
grouped = df.groupby('day_of_year') \
.agg(min_temp=('temperature', 'min'),
max_temp=('temperature', 'max')) \
.reset_index()
tmrw = grouped.shift(-1) \
.rename( \
columns={'min_temp': 'min_temp_tmrw',
'max_temp': 'max_temp_tmrw'}) \
.drop('day_of_year', axis=1)
pd.concat([grouped, tmrw], axis=1).to_dict('records')
[{'day_of_year': 1.0,
'min_temp': 22.0,
'max_temp': 41.3,
'min_temp_tmrw': 24.0,
'max_temp_tmrw': 26.0},
{'day_of_year': 2.0,
'min_temp': 24.0,
'max_temp': 26.0,
'min_temp_tmrw': 24.5,
'max_temp_tmrw': 42.3},
{'day_of_year': 3.0,
'min_temp': 24.5,
'max_temp': 42.3,
'min_temp_tmrw': nan,
'max_temp_tmrw': nan}]
I have two dataframes with different timeseries data (see example below). Whereas Dataframe1 contains multiple daily observations per month, Dataframe2 only contains one observation per month.
What I want to do now is to align the data in Dataframe2 with the last day every month in Dataframe1. The last day per month in Dataframe1 does not necessarily have to be the last day of that respective calendar month.
I'm grateful for every hint how to tackle this problem in an efficient manner (as dataframes can be quite large)
Dataframe1
----------------------------------
date A B
1980-12-31 152.799 209.132
1981-01-01 152.799 209.132
1981-01-02 152.234 209.517
1981-01-05 152.895 211.790
1981-01-06 155.131 214.023
1981-01-07 152.596 213.044
1981-01-08 151.232 211.810
1981-01-09 150.518 210.887
1981-01-12 149.899 210.340
1981-01-13 147.588 207.621
1981-01-14 148.231 208.076
1981-01-15 148.521 208.676
1981-01-16 148.931 209.278
1981-01-19 149.824 210.372
1981-01-20 149.849 210.454
1981-01-21 150.353 211.644
1981-01-22 149.398 210.042
1981-01-23 148.748 208.654
1981-01-26 148.879 208.355
1981-01-27 148.671 208.431
1981-01-28 147.612 207.525
1981-01-29 147.153 206.595
1981-01-30 146.330 205.558
1981-02-02 145.779 206.635
Dataframe2
---------------------------------
date C D
1981-01-13 53.4 56.5
1981-02-15 52.2 60.0
1981-03-15 51.8 58.0
1981-04-14 51.8 59.5
1981-05-16 50.7 58.0
1981-06-15 50.3 59.5
1981-07-15 50.6 53.5
1981-08-17 50.1 44.5
1981-09-12 50.6 38.5
To provide a readable example, I prepared test data as follows:
df1 - A couple of observations from January and February:
date A B
0 1981-01-02 152.234 209.517
1 1981-01-07 152.596 213.044
2 1981-01-13 147.588 207.621
3 1981-01-20 151.232 211.810
4 1981-01-27 150.518 210.887
5 1981-02-05 149.899 210.340
6 1981-02-14 152.895 211.790
7 1981-02-16 155.131 214.023
8 1981-02-21 180.000 200.239
df2 - Your data, also from January and February:
date C D
0 1981-01-13 53.4 56.5
1 1981-02-15 52.2 60.0
Both dataframes have date column of datetime type.
Start from getting the last observation in each month from df1:
res1 = df1.groupby(df1.date.dt.to_period('M')).tail(1)
The result, for my data, is:
date A B
4 1981-01-27 150.518 210.887
8 1981-02-21 180.000 200.239
Then, to join observations, the join must be performed on the
whole month period, not the exact date. To do this, run:
res = pd.merge(res1.assign(month=res1['date'].dt.to_period('M')),
df2.assign(month=df2['date'].dt.to_period('M')),
how='left', on='month', suffixes=('_1', '_2'), )
The result is:
date_1 A B month date_2 C D
0 1981-01-27 150.518 210.887 1981-01 1981-01-13 53.4 56.5
1 1981-02-21 180.000 200.239 1981-02 1981-02-15 52.2 60.0
If you want the merge to include data only for months where there
is at least one observation in both df1 and df2, drop how parameter.
Its default value is inner, which is the correct mode in this case.
When you have a sample dataframe, you can provide code for doing so. Simply select a column as a list (step 1 and 2) and use that list to build the dataframe with code (step 3 and 4).
import pandas as pd
# Step 1: create your dataframe, and print each column as a list, copy-paste into code example below.
df_1 = pd.read_csv('dataset1.csv')
print(list(df_1['date']))
print(list(df_1['A']))
print(list(df_1['B']))
# Step 2: create your dataframe, and print each column as a list, copy-paste into code example below.
df_2 = pd.read_csv('dataset2.csv')
print(list(df_2['date']))
print(list(df_2['C']))
print(list(df_2['D']))
# Step 3: create sample dataframe ... good if you can provide this in your future questions
df_1 = pd.DataFrame({
'date': ['12/31/1980', '1/1/1981', '1/2/1981', '1/5/1981', '1/6/1981',
'1/7/1981', '1/8/1981', '1/9/1981', '1/12/1981', '1/13/1981',
'1/14/1981', '1/15/1981', '1/16/1981', '1/19/1981', '1/20/1981',
'1/21/1981', '1/22/1981', '1/23/1981', '1/26/1981', '1/27/1981',
'1/28/1981', '1/29/1981', '1/30/1981', '2/2/1981'],
'A': [152.799, 152.799, 152.234, 152.895, 155.131,
152.596, 151.232, 150.518, 149.899, 147.588,
148.231, 148.521, 148.931, 149.824, 149.849,
150.353, 149.398, 148.748, 148.879, 148.671,
147.612, 147.153, 146.33, 145.779],
'B': [209.132, 209.132, 209.517, 211.79, 214.023,
213.044, 211.81, 210.887, 210.34, 207.621,
208.076, 208.676, 209.278, 210.372, 210.454,
211.644, 210.042, 208.654, 208.355, 208.431,
207.525, 206.595, 205.558, 206.635]
})
# Step 4: create sample dataframe ... good if you can provide this in your future questions
df_2 = pd.DataFrame({
'date': ['1/13/1981', '2/15/1981', '3/15/1981', '4/14/1981', '5/16/1981',
'6/15/1981', '7/15/1981', '8/17/1981', '9/12/1981'],
'C': [53.4, 52.2, 51.8, 51.8, 50.7, 50.3, 50.6, 50.1, 50.6],
'D': [56.5, 60.0, 58.0, 59.5, 58.0, 59.5, 53.5, 44.5, 38.5]
})
# Step 5: make sure the date field is actually a date, not a string
df_1['date'] = pd.to_datetime(df_1['date']).dt.date
# Step 6: create new colum with year and month
df_1['date_year_month'] = pd.to_datetime(df_1['date']).dt.to_period('M')
# Step 7: create boolean mask that grabs the max date for each year-month
mask_last_day_month = df_1.groupby('date_year_month')['date'].transform(max) == df_1['date']
# Step 8: create new dataframe with only last day of month
df_1_max = df_1.loc[mask_last_day_month]
print('here is dataframe 1 with only last day in the month')
print(df_1_max)
print()
# Step 9: make sure the date field is actually a date, not a string
df_2['date'] = pd.to_datetime(df_2['date']).dt.date
# Step 10: create new colum with year and month
df_2['date_year_month'] = pd.to_datetime(df_2['date']).dt.to_period('M')
print('here is the original dataframe 2')
print(df_2)
print()
I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. The challenge is to replace these NaNs with values from a secondary DataFrame.
Below I'll share the data and code used to implement my approach. Let me explain the scenario: merged_df is the original DataFrame with a few columns and some of them have rows with NaN values:
As you can see from the image above, columns day_of_week and holiday_flg are of particular interest. I would like to fill the NaN values of these columns by looking into a second DataFrame called date_info_df, which looks like this:
By using the values from column visit_date in merged_df it is possible to search the second DataFrame on calendar_date and find equivalent matches. This method allows to get the values for day_of_week and holiday_flg from the second DataFrame.
The end result for this exercise is a DataFrame that looks like this:
You'll notice the approach I'm using relies on apply() to execute a custom function on every row of merged_df:
For every row, search for NaN values in day_of_week and holiday_flg;
When a NaN is found on any or both of these columns, use the date available in from that row's visit_date to find an equivalent match in the second DataFrame, specifically the date_info_df['calendar_date'] column;
After a successful match, the value from date_info_df['day_of_week'] must be copied into merged_df['day_of_week'] and the value from date_info_df['holiday_flg'] must also be copied into date_info_df['holiday_flg'].
Here is a working source code:
import math
import pandas as pd
import numpy as np
from IPython.display import display
### Data for df
data = { 'air_store_id': [ 'air_a1', 'air_a2', 'air_a3', 'air_a4' ],
'area_name': [ 'Tokyo', np.nan, np.nan, np.nan ],
'genre_name': [ 'Japanese', np.nan, np.nan, np.nan ],
'hpg_store_id': [ 'hpg_h1', np.nan, np.nan, np.nan ],
'latitude': [ 1234, np.nan, np.nan, np.nan ],
'longitude': [ 5678, np.nan, np.nan, np.nan ],
'reserve_datetime': [ '2017-04-22 11:00:00', np.nan, np.nan, np.nan ],
'reserve_visitors': [ 25, 35, 45, np.nan ],
'visit_datetime': [ '2017-05-23 12:00:00', np.nan, np.nan, np.nan ],
'visit_date': [ '2017-05-23' , '2017-05-24', '2017-05-25', '2017-05-27' ],
'day_of_week': [ 'Tuesday', 'Wednesday', np.nan, np.nan ],
'holiday_flg': [ 0, np.nan, np.nan, np.nan ]
}
merged_df = pd.DataFrame(data)
display(merged_df)
### Data for date_info_df
data = { 'calendar_date': [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ],
'day_of_week': [ 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
'holiday_flg': [ 0, 0, 0, 0, 1, 1 ]
}
date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
display(date_info_df)
# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
weekday = row['day_of_week']
holiday = row['holiday_flg']
# search dataframe date_info_df for the appropriate value when weekday is NaN
if (type(weekday) == float and math.isnan(weekday)):
search_date = row['visit_date']
#print(' --> weekday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
weekday = date_info_df.at[idx,'day_of_week']
#print(' --> weekday search_date=', search_date, 'is', weekday)
row['day_of_week'] = weekday
# search dataframe date_info_df for the appropriate value when holiday is NaN
if (type(holiday) == float and math.isnan(holiday)):
search_date = row['visit_date']
#print(' --> holiday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
holiday = date_info_df.at[idx,'holiday_flg']
#print(' --> holiday search_date=', search_date, 'is', holiday)
row['holiday_flg'] = int(holiday)
return row
# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1)
# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)
display(merged_df)
I did a few measurements so you can understand the struggle:
On a DataFrame with 6 rows, apply() takes 3.01 ms;
On a DataFrame with ~250000 rows, apply() takes 2min 51s.
On a DataFrame with ~1215000 rows, apply() takes 4min 2s.
How do I improve the performance of this task?
you can use Index to speed up the lookup, use combine_first() to fill NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
the result:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
This is one solution. It should be efficient as there is no explicit merge or apply.
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
Result
air_store_id area_name day_of_week genre_name holiday_flg hpg_store_id \
0 air_a1 Tokyo Tuesday Japanese 0.0 hpg_h1
1 air_a2 NaN Wednesday NaN 0.0 NaN
2 air_a3 NaN Thursday NaN 0.0 NaN
3 air_a4 NaN Saturday NaN 1.0 NaN
latitude longitude reserve_datetime reserve_visitors visit_date \
0 1234.0 5678.0 2017-04-22 11:00:00 25.0 2017-05-23
1 NaN NaN NaN 35.0 2017-05-24
2 NaN NaN NaN 45.0 2017-05-25
3 NaN NaN NaN NaN 2017-05-27
visit_datetime
0 2017-05-23 12:00:00
1 NaN
2 NaN
3 NaN
Explanation
s is a pd.Series mapping calendar_date to day_of_week from date_info_df.
Use pd.Series.map, which takes pd.Series as an input, to update missing values, where possible.
Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also work
The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.
merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)
You can also try using pandas.Series.map. What it does is
Map values of Series using input correspondence (which can be a dict, Series, or function)
# set "calendar_date" as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
Note merged_df.visit_date originally was of string type. Thus, we use
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
to make it datetime.
Timings date_info_df dataset and merged_df provided by karlphillip.
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.
The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.
Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria