Merge two data-frames based on multiple conditions - python

I am looking to compare two dataframes (df-a and df-b) and search for where a given ID and date from 1 dataframe (df-b) sits within a date range where the ID matches in the other dataframe (df-a). I then want to strip all the columns in df-a and concat them to df-b where they match. E.g
If I have a dataframe df-a, in the following format
df-a:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 20
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
and df-b in
ID Date
0 cd2 2020-05-12
1 cd2 2020-04-12
2 cd2 2020-06-10
3 cd15 2020-04-28
4 cd193 2020-04-13
I would like an output df like so df-c=
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 - - - - - - -
1 cd2 2020-04-12 - - - - - - -
2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 20
3 cd15 2020-04-28 - - - - - - -
4 cd193 2020-04-13 - - - - - - -
In a previous post I got a brilliant answer which allowed to compare the data-frames and drop wherever this condition was met, but I am struggling to figure out how to extract the information appropriately from df-a. Current attempts are below!
df_c=df_b.copy()
ar=[]
for i in range(df_c.shape[0]):
currentID = df_c.stafnum[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date
if (startDate <= currentDate <= endDate):
print(df_c.loc[i])
print(df_a_entriesForCurrentID.iloc[j,:])
#df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
#df_fin_2=df_fin.append(df_d, ignore_index=True)
#ar.append(df_d)

So you want to make a sort of "soft" match. Here's a solution that attempts to vectorize the date range matching.
# notice working with dates as strings, inequalities will only work if dates in format y-m-d
# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`
# create a groupby object once so we can efficiently filter df_b inside the loop
# good idea if df_b is considerably large and has many different IDs
gdf_b = df_b.groupby('ID')
b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}
matched = [] # so we can collect matched rows from df_b
# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))
for i, ID, date in df_a.itertuples():
if ID in b_IDs:
gID = gdf_b.get_group(ID) # get the filtered df_b
inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)
if any(inrange):
matched.append(
gID.loc[inrange.idxmax()] # get the first row with date inrange
.values[1:] # use the array without column indices and slice `ID` out
)
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs
df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))
print(df_c)
Output
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 NaN NaN NaN NaN NaN NaN NaN
1 cd2 2020-04-12 NaN NaN NaN NaN NaN NaN NaN
2 cd2 2020-06-10 2020-06-01 2020-06-24 a b c 10.0 20.0
3 cd15 2020-04-28 NaN NaN NaN NaN NaN NaN NaN
4 cd193 2020-04-13 NaN NaN NaN NaN NaN NaN NaN

Related

Is there a way to optimize this date range transformation? Conditional merge in pandas?

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID
Special Offer Start
Special Offer End
A
'2022-01-01'
'2022-01-03'
B
'2022-01-09'
'2022-01-11'
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID
Special Offer Start
Special Offer End
Special Offer ?
A
'2022-01-01'
'2022-01-03'
1
B
'2022-01-09'
'2022-01-11'
1
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'
creates:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
A
'2022-01-04'
NaN
A
'2022-01-05'
NaN
A
'2022-01-06'
NaN
A
'2022-01-07'
NaN
A
'2022-01-08'
NaN
A
'2022-01-09'
NaN
A
'2022-01-10'
NaN
A
'2022-01-11'
NaN
B
'2022-01-01'
NaN
B
'2022-01-02'
NaN
B
'2022-01-03'
NaN
B
'2022-01-04'
NaN
B
'2022-01-05'
NaN
B
'2022-01-06'
NaN
B
'2022-01-07'
NaN
B
'2022-01-08'
NaN
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)
Update
The Shop ID column is missing
You can use date_range to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1

Filling NaN values from another dataframe based on a condition

I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']
Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12

With Pandas, how do I subtract all recurring elements of a series with an element of another series?

I have a dataframe of this type:
arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
.
.
What I need to do is:
for every same item in station, subtract related items in dep_time with every single related item in arr_time (not considering the same item). For example:
for station a:
for i in range(len(arr_time)):
for j in range(len(dep_time)):
if i != j:
dep_time[j] - arr_time[i]
Result, for station a, must be:
result
-00:20:00
00:25:00
and so on, for all stations in station.
Need to write this with Pandas, due to the large amount of data. I will be very thankful to whoever can help me!
Here is one way. I used pd.merge to link every station 'a' to every other station 'a' (etc.). Then I filtered so we won't compare a station to itself, and performed the time arithmetic.
from io import StringIO
import pandas as pd
data = ''' arr_time dep_time station
0 19:20:00 19:20:00 a
1 19:38:00 19:45:00 b
2 18:55:00 19:00:00 a
3 19:40:00 19:45:00 a
4 19:50:00 19:55:00 b
'''
df = pd.read_csv(StringIO(data), sep='\s+')
# create unique identifier for each row
df['id'] = df.reset_index().groupby('station')['index'].rank(method='first').astype(int)
# SQL-style self-join: all station 1's; all station 2's, etc.
t = pd.merge(left=df, right=df, how='inner', on='station', suffixes=('_l', '_r'))
# don't compare station to itself
t = t[ t['id_l'] != t['id_r'] ]
# compute elapsed time (as timedelta object)
t['elapsed'] = pd.to_timedelta(t['dep_time_l']) - pd.to_timedelta(t['arr_time_r'])
# convert elapsed time to minutes (may not be necessary)
t['elapsed'] = t['elapsed'] / pd.Timedelta(minutes=1) # convert to minutes
# create display
t = (t[['station', 'elapsed', 'id_l', 'id_r']]
.sort_values(['station', 'id_l', 'id_r']))
print(t)
station elapsed id_l id_r
1 a 25.0 1 2
2 a -20.0 1 3
3 a -20.0 2 1
5 a -40.0 2 3
6 a 25.0 3 1
7 a 50.0 3 2
10 b -5.0 1 2
11 b 17.0 2 1

How to fill periods in columns?

There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values ​​for True, if this value is in the list in the period column?
enter image description here
Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)
My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True

Retrospectively Minimum Business Days from today till the next date in another column for different Codes

I just am unable to solve this without applying loops and I have pretty long data of timeseries. I want to know what is the closest next maturity date based on information we know today. Example below: Note the next expiry date should be for that specific code. There has got to be a more pythonic way of doing this.
date matdate code
2-Jan-2018 5-Jan-2018 A
3-Jan-2018 6-Jan-2018 A
8-Jan-2018 12-Jan-2018 B
10-Jan-2018 15-Jan-2018 A
11-Jan-2018 16-Jan-2018 B
15-Jan-2018 17-Jan-2018 A
And I am looking for the output to be in the below format - which takes all weekday dates in the output (the below could also be in pivot format, but should have all weekday dates as index)
date matdate code BusinessDaysToNextMat
2-Jan-2018 5-Jan-2018 A 3
2-Jan 2018 B 0
3-Jan-2018 8-Jan-2018 A 2
3-Jan-2018 B 0
4-Jan-2018 A 1
4-Jan-2018 B 0
5-Jan-2018 A 0
5-Jan-2018 B 0
8-Jan-2018 A 0
8-Jan-2018 17-Jan-2018 B 7
9-Jan-2018 A 0
9-Jan-2018 B 6
10-Jan-2018 16-Jan-2018 A 4
10-Jan-2018 B 6
11-Jan-2018 A 3
11-Jan-2018 16-Jan-2018 B 3
12-Jan-2018 A 4
12-Jan-2018 B 2
15-Jan-2018 17-Jan-2018 A 1
15-Jan-2018 B 1
Thank you very much for taking a look!
You can use numpy.busday_count to achieve that:
import numpy as np
df['BusinessDaysToNextMat'] = df[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
df
# date matdate code BusinessDaysToNextMat
#0 2018-01-01 2018-01-05 A 4
#1 2018-01-03 2018-01-06 A 3
#2 2018-01-08 2018-01-12 B 4
#3 2018-01-10 2018-01-15 A 3
#4 2018-01-11 2018-01-16 B 3
#5 2018-01-15 2018-01-17 A 2
#6 2018-01-20 2018-01-22 A 0
This doesn't seem completely what you had in your example, but does the most:
index = pd.MultiIndex.from_product(
[pd.date_range(
df['date'].min(),
df['date'].max(), freq='C').values,
df['code'].unique()],
names = ['date', 'code'])
resampled = pd.DataFrame(index=index).reset_index().merge(df, on=['date', 'code'], how='left')
calc = resampled.dropna()
calc['BusinessDaysToNextMat'] = calc[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
final = resampled.merge(calc, on=['date', 'code', 'matdate'], how='left')
final['BusinessDaysToNextMat'].fillna(0, inplace=True)
final
# date code matdate BusinessDaysToNextMat
#0 2018-01-02 A 2018-01-05 3.0
#1 2018-01-02 B NaT 0.0
#2 2018-01-03 A 2018-01-06 3.0
#3 2018-01-03 B NaT 0.0
#4 2018-01-04 A NaT 0.0
#5 2018-01-04 B NaT 0.0
#6 2018-01-05 A NaT 0.0
#7 2018-01-05 B NaT 0.0
#8 2018-01-08 A NaT 0.0
#9 2018-01-08 B 2018-01-12 4.0
#10 2018-01-09 A NaT 0.0
#11 2018-01-09 B NaT 0.0
#12 2018-01-10 A 2018-01-15 3.0
#13 2018-01-10 B NaT 0.0
#14 2018-01-11 A NaT 0.0
#15 2018-01-11 B 2018-01-16 3.0
#16 2018-01-12 A NaT 0.0
#17 2018-01-12 B NaT 0.0
#18 2018-01-15 A 2018-01-17 2.0
#19 2018-01-15 B NaT 0.0
here is what I am doing currently, which clearly isn't most efficient:
# Step1: Make a new df with data of just one code and fill up any blank matdates with the very first available matdate. After that:
temp_df['newmatdate'] = datetime.date(2014,1,1) # create a temp column to hold the current minimum maturity date
temp_df['BusinessDaysToNextMat'] = 0 # this is the column that we are after
mindates = [] # initiate a list to maintain any new maturity dates which come up and keep it min-sorted
mindates.append(dummy) # where dummy is the very first available maturity date (as of 1st date we only know one trade, which is this) Have written dummy here, but it is a longer code, which may not pertain here
x = mindates[0] # create a variable to be used in the loop
g = datetime.datetime.now()
for i in range(len(temp_df['matdate'])): # loop through every date
if np.in1d(temp_df['matdate'][i],mindates)[0]==False: # if the current maturity date found DOES NOT exist in the list of mindates, add it
mindates.append(temp_df['matdate'][i])
while min(mindates)< temp_df['date'][i]: # if the current date is greater than the min mindate held so far,
mindates.sort() # sort it so you are sure to remove the min mindate
x = mindates[0] # note the date which you are dropping before dropping it
del mindates[0] # drop the curr min mindate, so the next mindate, becomes the new min mindate
if temp_df['matdate'][i] != x: # I think this might be redundant, but it is basically checking if the new matdate which you may be adding, wasn't the one
mindates.append(temp_df['matdate'][i]) # which you just removed, if not, add this new one to the list
curr_min = min(mindates)
temp_df['newmatdate'][i] = curr_min # add the current min mindate to the column
h = datetime.datetime.now()
print('loop took '+str((h-g).seconds) + ' seconds')
date = [d.date() for d in temp_df['date']] # convert from 'date' to 'datetime' to be able to use np.busday_count()
newmatdate = [d.date() for d in temp_df['newmatdate']]
temp_df['BusinessDaysToNextMat'] = np.busday_count(date,newmatdate) # phew
Also this is just for a single code - and then i will loop it over as many codes there are

Categories