Merge FAST two DataFrames based on specific conditions, row by row - python

I have been struggling with my case for the past 10 days and I can't find a fast and efficient solution.
Here is the case. I have one DF containing web traffic data of a human resources website.
Every row of this Dataframe refers to an application (aka : someone reached the website via a specific web source, and apply to a specific job offer at a specific time).
Here is an example :
import pandas as pd
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
On the second DataFrame, I have an extract of a Human Resources internal tool where we gather all the received appplications with some complementary data. Here is an example :
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '343434','010101'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-09 12:10:41','2019-11-01 00:19:22'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
Here are the difficulties I am facing :
There is not a unique key I can use to merge those two tables. I have to use the "Job_id" (which is unique to every job) combined with the time when the application occured via the columns "d_date_hour_event" (on web_data DF) and "date_time_submission" (on hr_data DF).
For the same application, the time registered on the 2 tables might not be the same (difference of few seconds)
Some of web_data values might not be present in hr_data
In the end, I would like to get one DataFrame that looks like this :
result_dataframe.png
Actually, I already coded the function to realize this merge. It looks like this :
for i, row in web_data.iterrows() :
#we stock the needed value for hr_data research
date = row.d_date_hour_event
job = row.job_id
#we compute the time period
inf = date - timedelta(seconds=10)
sup = date + timedelta(seconds=10)
#we check if there a matching row in hr_data
temp_df = pd.DataFrame()
temp_df = hr_data[(hr_data.job_id == job) & \
(hr_data.date_time_submission >= inf) & (hr_data.date_time_submission <= sup)].tail(1)
#if there is a matching row, we merge them and update web_data table
if not temp_df.empty:
row = row.to_frame().transpose()
join = pd.merge(row, temp_df, how='inner', on='job_id',left_index=False, right_index=True)
web_data.update(join)
But, because my Web_data is over 250K rows and my HR_data is over 140k rows, it takes hours ! (estimation of 35hours running script...).
I am sure that the iterrows is not optimal and that this code can be optimized. I tried to use a custom function with .apply(lambda x: ...) but without success.
Any help would be more than welcome !
Please let me know if you need more explanations.
Many thanks !

Let's try this in a few steps:
1. Convert the datetime columns in both dataframes to pd.datetime format.
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
2. Rename the job_id of the hr_data dataframe so it will not provide any errors when merging.
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
3. Make numpy arrays from the columns containing timestamps and job_ids in both dataframes and check for rows in which the timestamp in the web_data dataframe is within 10 seconds of the timestamp in the hr_data dataframe and have the same job_id using numpy broadcasting.
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids)
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
4. Assign new columns to the original web_data dataframe, so we can update these rows with all the information in case any rows overlap
web_data = web_data.assign(candidate_id=np.nan, job_id_hr=np.nan, date_time_submission=np.datetime64('NaT'), job_label=np.nan)
Finally just update web_data dataframe (or first create a copy if you don't want to overwrite the original datframe)
web_data.update(overlapping_rows)
Must be a lot faster than iterating over all rows

This is the code I am using (which is not working unless you make the changes I described in comments)
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
#placed '010101' and '2019-11-01 00:19:22' in second position instead of third position like it used to be
#if you reswitch these values to 3rd position in respectively 'job_id' and 'date_time_submission' arrays, it should work
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '010101','343434'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-01 00:19:22','2019-11-09 12:10:41'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids[:, None])
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
overlapping_rows

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df
There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so
Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')
Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories