Merge two dataframes on multiple keys with tolerance using merge_asof - python

We wish to find the best matches out of multiple keys in two dataframes. From the documentation the method merge_asof seemed to be a reasonable choice. Testing it for one column it worked as expected.
import pandas as pd
import numpy as np
data_key_1 = np.arange(10).astype(float)
data_key_2 = np.arange(10).astype(float)
data_key_1_noisy = data_key_1-0.25*np.random.rand(10)
data_key_2_noisy = data_key_2-0.1*np.random.rand(10)
data_target = list('abcdefghij')
# one key approach
df_1 = pd.DataFrame(zip(data_key_1[::2], ), columns=['key_1',])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_target), columns=['key_1', 'target',])
df_result_1 = pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5)
print(df_result_1)
With console output as follow.
key_1 target
0 0.0 a
1 2.0 c
2 4.0 e
3 6.0 g
4 8.0 i
When trying to use two keys, it failed. We tried different combinations of keyword settings but didn't manage to get it running.
# two keys approach
df_1 = pd.DataFrame(zip(data_key_1[::2], data_key_2[::2]), columns=['key_1', 'key2'])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_key_2_noisy, data_target), columns=['key_1', 'key2', 'target'])
df_result_2 = pd.merge_asof(df_1, df_2, on=['key_1', 'key_2'], direction='nearest', tolerance=0.5)
print(df_result_2)
It will throw an error:
pandas.errors.MergeError: can only asof on a key for left
Expected console output would have been.
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i
So we questioned ourselves whether we try to apply this method in an inappropriate context, or if it is an applicable use-case and we just messed up with the keywords.

I believe you can also use KDTree to do this
from scipy.spatial import cKDTree
tree2 = cKDTree(df_2[["key_1", "key2"]])
# get distances and indices of nearest neighbors
dists, inds = tree2.query(df_1[['key_1', 'key2']], 1)
# assign target if nearest neighbor is within tol
df_1['target'] = [df_2.loc[i, 'target'] if d < 0.5 else np.nan for d,i in zip(dists, inds)]
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i
Previously I suggested query_ball_tree, but that method doesnt return nearest neighbors in any particular order, hence one should use query to directly grab the nearest neighbor.

merge_asof is always one key only (think why it requires key to be sorted). You can merge on one key and filter the other key.
(pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5, suffixes=['','_'])
# may need to drop duplicate on `key_1, key2` here based on `abs` as well
.loc[lambda x: x['key2'].sub(x['key2_']).abs() < 0.5]
.drop(columns=['key2_'])
)
Output:
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i

Related

Fill a DataFrame with medians of group only for cell containing specific value

I am trying to find a nice/smart way to fill my DataFrame with median value from groups.
I have 2 groups "I" and "J" and 2 factors"A" and "B". I want to replace my negative value by the median of the group to which the value belongs.
One constraint is that I don't want to replace the NaN values.
Here is the code to make my initial DataFrame
tuples = [('I','0'), ('I','1'), ('I', '2'), ('J', '3'), ('I', '4'), ('J', '5')]
index = pd.MultiIndex.from_tuples(tuples, names=["Id1", "Id2"])
df = pd.DataFrame(np.arange(12).reshape(-1, 2), columns=['A', 'B'], index=index)
df["A"].iloc[0]=-1
df["B"].iloc[-1:]=-1
df["B"].iloc[-2]=18
df["B"].iloc[0]=np.NaN
df["B"].iloc[1]=np.NaN
which gives:
A B
Id1 Id2
I 0 -1 NaN
1 2 NaN
2 4 5.0
J 3 6 7.0
I 4 8 18.0
J 5 10 -1.0
Here is the way I solved it:
ind, col = np.where(df<0)
nb_df_lt_0 = len(ind)
for ii in np.arange(nb_df_lt_0) :
df.iloc[ind[ii],col[ii]] = np.NAN
xx, yy = ind[ii], col[ii]
index_Id1 = df.index.get_level_values("Id1")[xx]
df.iloc[xx,yy] = df.loc[index_Id1,:].iloc[:,yy].median()
df
This gives what I was looking for:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0
It works, but it doesn't look nice, and surely not very efficient since I have a For loop.
I would be very please to look at a solution with pandas or numpy functions which make the job.
Thanks in advance
You can do something like this:
df.mask(df<0, df.mask(df<0, np.nan).groupby(level=0).median())
Lets break that down. You need the median of the two groups "I" and "J" excluding the negative values:
median_df = df.mask(df<0, np.nan).groupby(level=0).median()
Then you want to substitute the median for the negative values in the original DataFrame:
df.mask(df<0, median_df)
You can use this:
It groups each col and then finds the median of the group (not including the -1 values.)
for col in df.columns:
df[col] = df.groupby('Id1')[col].apply(lambda x: (
x.replace(-1, x.loc[x != -1].median())
))
Let's start from a small correction in the way you created the source DataFrame:
As each column can contain NaN, which is a special case of float,
create the temporary DataFrame with data type of float:
np.arange(12, dtype='float')
(no change in the rest of your code to create the DataFrame).
You will need the following group processing function:
def grpProc(grp):
grp[grp == -1] = grp[grp != -1].median()
return grp
It computes the median from elements != 0 and saves it in elements ==
-1, assuming that the source group (grp) is a part of the current column
for each Id1.
Then the changed group is returned.
And to get the result, apply it to each column of your DataFrame,
grouped by Id1 (level 0):
result = df.apply(lambda col: col.groupby(level=0).apply(grpProc))
No axis parameter has been passed, so this function is applied to
each column (axis == 0).
For your sample data the result is:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0

python Pandas lambda apply doesn't work for NaN

been trying to do an efficient vlookup style on pandas, with IF function...
Basically, I want to apply to this column ccy_grp, that if the value (in a particular row) is 'NaN', it will take the value from another column ccy
def func1(tkn1, tkn2):
if tkn1 == 'NaN:
return tkn2
else:
return tkn1
tmp1_.ccy_grp = tmp1_.apply(lambda x: func1(x.ccy_grp, x.ccy), axis = 1)
but nope, doesn't work. The code cannot seem to detect 'NaN'. I tried another way of np.isnan(tkn1), but I just get a boolean error message...
Any experienced python pandas code developer know?
use pandas.isna to detect a value whether a NaN
generate data
import pandas as pd
import numpy as np
data = pd.DataFrame({'value':[np.NAN, None, 1,2,3],
'label':['str:np.NAN', 'str: None', 'str: 1', 'str: 2', 'str: 3']})
data
create a function
def func1(x):
if pd.isna(x):
return 'is a na'
else:
return f'{x}'
apply function to data
data['func1_result'] = data['value'].apply((lambda x: func1(x)))
data
There is a pandas method for what you are trying to do. Check out combine_first:
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with
non-null values from the other Series.
tmp1_.ccy_grp = tmp1_.ccy_grp.combine_first(tmp1_.ccy)
This looks like it should be a pandas mask/where/fillna problem, not an apply:
Given:
value values2
0 NaN 0.0
1 NaN 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0
Doing:
df.value.fillna(df.values2, inplace=True)
print(df)
# or
df.value.mask(df.value.isna(), df.values2, inplace=True)
print(df)
# or
df.value.where(df.value.notna(), df.values2, inplace=True)
print(df)
Output:
value values2
0 0.0 0.0
1 0.0 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0

Dataframe.isin is not working for me, returns same no of rows even if there are intersecting rows

I am applying an inner join in a for loop on another dataset and now I just need to remove the rows that are already part of the inner join so I went with Dataframe.isin(another_df) but it is not giving me the expected results. I checked the column names and their data types, they are all the same. Can someone help me with that, please?
In the following code, isin is where I check between two data frames still I'm not getting any response, I'm getting the same set of rows even if they have the same no of rows and columns.
Note: I'm dropping an extra column in isin function as it is the extra column present in one of the dataframes.
My code looks like this:
df = pd.DataFrame(columns= override.columns)
for i in list1:
join_value = tuple(i)
i.append('creditor_tier_interim')
subset_df = override.merge(criteria[i].dropna(), on = list(join_value), how = 'inner')
subset_df['PRE_CHARGEOFF_FLAG'] = pd.to_numeric(subset_df.PRE_CHARGEOFF_FLAG)
override=override[~override.isin(subset_df.drop(columns = 'creditor_tier_interim'))].dropna(how = 'all')
print('The override shape would be:', override.shape)
df = df.append(subset_df)
df = df.append(override)
It sounds as if you have 'left' and a 'right' DataFrames and you're look for those records that are exclusively in one or the other. The below returns rows that are in exclusively the right or left DataFrame.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
dataframe_left = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
dataframe_right = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
insert_left = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'],index=[7])
insert_right = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'], index=[6])
dataframe_right = dataframe_right.append(insert_right)
dataframe_left = dataframe_left.append(insert_left)
Code above produces this output
Left Table
A
B
C
D
E
0
-0.3240086903973736
1.0441549453943946
-0.23640436950107843
0.5466767470739027
-0.2123693649877372
1
-0.04263388410830733
-0.4855492977594353
-1.5584284407735072
1.2438524586306603
-0.31087239909921277
2
0.6982581750529829
-0.42379154444215905
1.1625089013522614
-3.378898146269229
1.0550121763954057
3
0.3774337535208665
0.6402576096348337
-0.2787520258645991
0.31071767629270125
0.34499495360962007
4
-0.133649590435452
0.3679768579635411
-2.0196709364730014
1.2860033685128436
-0.49674737879741193
7
0.0
1.0
2.0
3.0
4.0
Right Table
A
B
C
D
E
0
-0.09946693056759418
-0.03378933704588447
-0.4117873368048701
0.21976489856531914
-0.7020527418892488
1
-2.9936183481793233
0.42443360961021837
-0.1681576564885903
-0.5080538565354785
-0.29483296271514153
2
-0.6567306172004121
-1.221239625798079
-1.2604670988941196
0.44472543746187265
-0.4562966381137614
3
-0.0027697712245823482
0.1323767897141191
-0.11073953230359104
-0.3596157927825233
1.9894525572891626
4
0.5170901011452596
-1.1694605240821456
0.29238712582282705
-0.38912521589557797
-0.8793074660039492
6
0.0
1.0
2.0
3.0
4.0
After setting up the test dataframes we can join the two and filter for the rows we're interested in:
tmp = pd.merge(
left=dataframe_left,
right=dataframe_right,
right_index=True,
left_index=True,
how='outer',
suffixes=['_left','_right'],
indicator=True
)
tmp[tmp._merge.isin(['right_only','left_only'])]
This produces the below result
A_left
B_left
C_left
D_left
E_left
A_right
B_right
C_right
D_right
E_right
_merge
6
0.0
1.0
2.0
3.0
4.0
right_only
7
0.0
1.0
2.0
3.0
4.0
left_only

python pandas: Find top n and then m in the top n

I have a pandas data frame that looks like the following:
fastmoving[['dist','unique','id']]
Out[683]:
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09
What I want to achieve is to:
Find top n longest-distance entries. Column 'dist'
Find which ids have the largest percentage m in the top n entries. Column 'id'.
So far I was able to write the code for the maximum entries.
#Get the first id with the largest dist:
fastmoving.loc[fastmoving['dist'].idxmax(),'id']
#Get all id's with the largest dist:
fastmoving.loc[fastmoving['dist']==fastmoving['dist'].max(),'id']
what I miss is to my code to work for more than one value.
So instead of the maximum value, to work for a range of maximum values (top n values).
And then get all the ids that belong with over some m percentage in those n maximum values.
Can you please help me on how I can achieve that in pandas?
Thanks a lot
Alex
you can use nlargest for top n and quantile for top m%, like this:
import pandas as pd
from io import StringIO
fastmoving = pd.read_csv(StringIO("""
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09"""), sep="\s+")
n = 3
m = 50
top_n_dist = fastmoving.nlargest(n, ["dist"])
top_m_precent_id_in_top_n_dist = top_n_dist[top_n_dist['id']>top_n_dist['id'].quantile(m/100)]
print(top_m_precent_id_in_top_n_dist)
IIUC, you can leverage nlargest. The following example would take the top 3 values of dist, and from that, extract the top 2 values of id:
fastmoving.nlargest(3, ["dist", "id"]).nlargest(2, "id")
dist unique id
1 0.434386 4.0 8.288070e+09
1 0.406677 4.0 4.997434e+09

Find if sum of any two columns exceed X in pandas dataframe

. Columns are attributes, rows are observation.
I would like to extract rows, where sum of any two attributes exceed a specified value (say 0.7). Then, in two new columns, list column header with bigger and smaller contribution to sum.
I am new to python, so I am stuck proceeding after generating my dataframe.
You can do this:
import pandas as pd
from itertools import combinations
THRESHOLD = 8.0
def valuation_formula(row):
l = [sorted(x) for x in combinations(row, r=2) if sum(x) > THRESHOLD]
if(len(l) == 0):
row["smaller"], row["larger"] = None, None
else:
row["smaller"], row["larger"] = l[0] # since not specified by OP, we take the first such pair
return row
contribution_df = df.apply(lambda row: valuation_formula(row), axis=1)
So that, if
df = pd.DataFrame({"a" : [1.0, 2.0, 4.0], "b" : [5.0, 6.0, 7.0]})
a b
0 1.0 5.0
1 2.0 6.0
2 4.0 7.0
then, contribution_df is
a b smaller larger
0 1.0 5.0 NaN NaN
1 2.0 6.0 NaN NaN
2 4.0 7.0 4.0 7.0
HTH.

Categories