Merge two data-frames based on multiple conditions

Merge two data-frames based on multiple conditions - python

I am looking to compare two dataframes (df-a and df-b) and search for where a given ID and date from 1 dataframe (df-b) sits within a date range where the ID matches in the other dataframe (df-a). I then want to strip all the columns in df-a and concat them to df-b where they match. E.g
If I have a dataframe df-a, in the following format
df-a:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 20
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
and df-b in
ID Date
0 cd2 2020-05-12
1 cd2 2020-04-12
2 cd2 2020-06-10
3 cd15 2020-04-28
4 cd193 2020-04-13
I would like an output df like so df-c=
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 - - - - - - -
1 cd2 2020-04-12 - - - - - - -
2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 20
3 cd15 2020-04-28 - - - - - - -
4 cd193 2020-04-13 - - - - - - -
In a previous post I got a brilliant answer which allowed to compare the data-frames and drop wherever this condition was met, but I am struggling to figure out how to extract the information appropriately from df-a. Current attempts are below!
df_c=df_b.copy()
ar=[]
for i in range(df_c.shape[0]):
currentID = df_c.stafnum[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date
if (startDate <= currentDate <= endDate):
print(df_c.loc[i])
print(df_a_entriesForCurrentID.iloc[j,:])
#df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
#df_fin_2=df_fin.append(df_d, ignore_index=True)
#ar.append(df_d)

So you want to make a sort of "soft" match. Here's a solution that attempts to vectorize the date range matching.
# notice working with dates as strings, inequalities will only work if dates in format y-m-d
# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`
# create a groupby object once so we can efficiently filter df_b inside the loop
# good idea if df_b is considerably large and has many different IDs
gdf_b = df_b.groupby('ID')
b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}
matched = [] # so we can collect matched rows from df_b
# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))
for i, ID, date in df_a.itertuples():
if ID in b_IDs:
gID = gdf_b.get_group(ID) # get the filtered df_b
inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)
if any(inrange):
matched.append(
gID.loc[inrange.idxmax()] # get the first row with date inrange
.values[1:] # use the array without column indices and slice `ID` out
)
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs
df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))
print(df_c)
Output
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 NaN NaN NaN NaN NaN NaN NaN
1 cd2 2020-04-12 NaN NaN NaN NaN NaN NaN NaN
2 cd2 2020-06-10 2020-06-01 2020-06-24 a b c 10.0 20.0
3 cd15 2020-04-28 NaN NaN NaN NaN NaN NaN NaN
4 cd193 2020-04-13 NaN NaN NaN NaN NaN NaN NaN

Related

Is there a way to optimize this date range transformation? Conditional merge in pandas?

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID
Special Offer Start
Special Offer End
A
'2022-01-01'
'2022-01-03'
B
'2022-01-09'
'2022-01-11'
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID
Special Offer Start
Special Offer End
Special Offer ?
A
'2022-01-01'
'2022-01-03'
1
B
'2022-01-09'
'2022-01-11'
1
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'
creates:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
A
'2022-01-04'
NaN
A
'2022-01-05'
NaN
A
'2022-01-06'
NaN
A
'2022-01-07'
NaN
A
'2022-01-08'
NaN
A
'2022-01-09'
NaN
A
'2022-01-10'
NaN
A
'2022-01-11'
NaN
B
'2022-01-01'
NaN
B
'2022-01-02'
NaN
B
'2022-01-03'
NaN
B
'2022-01-04'
NaN
B
'2022-01-05'
NaN
B
'2022-01-06'
NaN
B
'2022-01-07'
NaN
B
'2022-01-08'
NaN
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)

Update
The Shop ID column is missing
You can use date_range to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1

Filling NaN values from another dataframe based on a condition

I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']

Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12

With Pandas, how do I subtract all recurring elements of a series with an element of another series?

I have a dataframe of this type:
arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
.
.
What I need to do is:
for every same item in station, subtract related items in dep_time with every single related item in arr_time (not considering the same item). For example:
for station a:
for i in range(len(arr_time)):
for j in range(len(dep_time)):
if i != j:
dep_time[j] - arr_time[i]
Result, for station a, must be:
result
-00:20:00
00:25:00
and so on, for all stations in station.
Need to write this with Pandas, due to the large amount of data. I will be very thankful to whoever can help me!

Here is one way. I used pd.merge to link every station 'a' to every other station 'a' (etc.). Then I filtered so we won't compare a station to itself, and performed the time arithmetic.
from io import StringIO
import pandas as pd
data = ''' arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
'''
df = pd.read_csv(StringIO(data), sep='\s+')
# create unique identifier for each row
df['id'] = df.reset_index().groupby('station')['index'].rank(method='first').astype(int)
# SQL-style self-join: all station 1's; all station 2's, etc.
t = pd.merge(left=df, right=df, how='inner', on='station', suffixes=('_l', '_r'))
# don't compare station to itself
t = t[ t['id_l'] != t['id_r'] ]
# compute elapsed time (as timedelta object)
t['elapsed'] = pd.to_timedelta(t['dep_time_l']) - pd.to_timedelta(t['arr_time_r'])
# convert elapsed time to minutes (may not be necessary)
t['elapsed'] = t['elapsed'] / pd.Timedelta(minutes=1) # convert to minutes
# create display
t = (t[['station', 'elapsed', 'id_l', 'id_r']]
.sort_values(['station', 'id_l', 'id_r']))
print(t)
station elapsed id_l id_r
1 a 25.0 1 2
2 a -20.0 1 3
3 a -20.0 2 1
5 a -40.0 2 3
6 a 25.0 3 1
7 a 50.0 3 2
10 b -5.0 1 2
11 b 17.0 2 1

How to fill periods in columns?

There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values for True, if this value is in the list in the period column?
enter image description here

Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)

My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True

Retrospectively Minimum Business Days from today till the next date in another column for different Codes

I just am unable to solve this without applying loops and I have pretty long data of timeseries. I want to know what is the closest next maturity date based on information we know today. Example below: Note the next expiry date should be for that specific code. There has got to be a more pythonic way of doing this.
date matdate code
2-Jan-2018 5-Jan-2018 A
3-Jan-2018 6-Jan-2018 A
8-Jan-2018 12-Jan-2018 B
10-Jan-2018 15-Jan-2018 A
11-Jan-2018 16-Jan-2018 B
15-Jan-2018 17-Jan-2018 A
And I am looking for the output to be in the below format - which takes all weekday dates in the output (the below could also be in pivot format, but should have all weekday dates as index)
date matdate code BusinessDaysToNextMat
2-Jan-2018 5-Jan-2018 A 3
2-Jan 2018 B 0
3-Jan-2018 8-Jan-2018 A 2
3-Jan-2018 B 0
4-Jan-2018 A 1
4-Jan-2018 B 0
5-Jan-2018 A 0
5-Jan-2018 B 0
8-Jan-2018 A 0
8-Jan-2018 17-Jan-2018 B 7
9-Jan-2018 A 0
9-Jan-2018 B 6
10-Jan-2018 16-Jan-2018 A 4
10-Jan-2018 B 6
11-Jan-2018 A 3
11-Jan-2018 16-Jan-2018 B 3
12-Jan-2018 A 4
12-Jan-2018 B 2
15-Jan-2018 17-Jan-2018 A 1
15-Jan-2018 B 1
Thank you very much for taking a look!

You can use numpy.busday_count to achieve that:
import numpy as np
df['BusinessDaysToNextMat'] = df[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
df
# date matdate code BusinessDaysToNextMat
#0 2018-01-01 2018-01-05 A 4
#1 2018-01-03 2018-01-06 A 3
#2 2018-01-08 2018-01-12 B 4
#3 2018-01-10 2018-01-15 A 3
#4 2018-01-11 2018-01-16 B 3
#5 2018-01-15 2018-01-17 A 2
#6 2018-01-20 2018-01-22 A 0
This doesn't seem completely what you had in your example, but does the most:
index = pd.MultiIndex.from_product(
[pd.date_range(
df['date'].min(),
df['date'].max(), freq='C').values,
df['code'].unique()],
names = ['date', 'code'])
resampled = pd.DataFrame(index=index).reset_index().merge(df, on=['date', 'code'], how='left')
calc = resampled.dropna()
calc['BusinessDaysToNextMat'] = calc[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
final = resampled.merge(calc, on=['date', 'code', 'matdate'], how='left')
final['BusinessDaysToNextMat'].fillna(0, inplace=True)
final
# date code matdate BusinessDaysToNextMat
#0 2018-01-02 A 2018-01-05 3.0
#1 2018-01-02 B NaT 0.0
#2 2018-01-03 A 2018-01-06 3.0
#3 2018-01-03 B NaT 0.0
#4 2018-01-04 A NaT 0.0
#5 2018-01-04 B NaT 0.0
#6 2018-01-05 A NaT 0.0
#7 2018-01-05 B NaT 0.0
#8 2018-01-08 A NaT 0.0
#9 2018-01-08 B 2018-01-12 4.0
#10 2018-01-09 A NaT 0.0
#11 2018-01-09 B NaT 0.0
#12 2018-01-10 A 2018-01-15 3.0
#13 2018-01-10 B NaT 0.0
#14 2018-01-11 A NaT 0.0
#15 2018-01-11 B 2018-01-16 3.0
#16 2018-01-12 A NaT 0.0
#17 2018-01-12 B NaT 0.0
#18 2018-01-15 A 2018-01-17 2.0
#19 2018-01-15 B NaT 0.0

here is what I am doing currently, which clearly isn't most efficient:
# Step1: Make a new df with data of just one code and fill up any blank matdates with the very first available matdate. After that:
temp_df['newmatdate'] = datetime.date(2014,1,1) # create a temp column to hold the current minimum maturity date
temp_df['BusinessDaysToNextMat'] = 0 # this is the column that we are after
mindates = [] # initiate a list to maintain any new maturity dates which come up and keep it min-sorted
mindates.append(dummy) # where dummy is the very first available maturity date (as of 1st date we only know one trade, which is this) Have written dummy here, but it is a longer code, which may not pertain here
x = mindates[0] # create a variable to be used in the loop
g = datetime.datetime.now()
for i in range(len(temp_df['matdate'])): # loop through every date
if np.in1d(temp_df['matdate'][i],mindates)[0]==False: # if the current maturity date found DOES NOT exist in the list of mindates, add it
mindates.append(temp_df['matdate'][i])
while min(mindates)< temp_df['date'][i]: # if the current date is greater than the min mindate held so far,
mindates.sort() # sort it so you are sure to remove the min mindate
x = mindates[0] # note the date which you are dropping before dropping it
del mindates[0] # drop the curr min mindate, so the next mindate, becomes the new min mindate
if temp_df['matdate'][i] != x: # I think this might be redundant, but it is basically checking if the new matdate which you may be adding, wasn't the one
mindates.append(temp_df['matdate'][i]) # which you just removed, if not, add this new one to the list
curr_min = min(mindates)
temp_df['newmatdate'][i] = curr_min # add the current min mindate to the column
h = datetime.datetime.now()
print('loop took '+str((h-g).seconds) + ' seconds')
date = [d.date() for d in temp_df['date']] # convert from 'date' to 'datetime' to be able to use np.busday_count()
newmatdate = [d.date() for d in temp_df['newmatdate']]
temp_df['BusinessDaysToNextMat'] = np.busday_count(date,newmatdate) # phew
Also this is just for a single code - and then i will loop it over as many codes there are

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two data-frames based on multiple conditions - python

Related

Is there a way to optimize this date range transformation? Conditional merge in pandas?

Filling NaN values from another dataframe based on a condition

With Pandas, how do I subtract all recurring elements of a series with an element of another series?

How to fill periods in columns?

Retrospectively Minimum Business Days from today till the next date in another column for different Codes

Categories

Resources