merge dataframes based on column A OR B - python

I need to merge two dataframes, but the merge can be made on either two columns of the right-hand dataframe.
df_1 = pd.DataFrame({'col' : ['a', 'b', 'c']})
df_2 = pd.DataFrame({'col_a' : ['a', 'b', np.nan], 'col_b' : ['z', np.nan, 'c']})
df_1.merge(df_2, how = 'left', left_on = 'col', right_on = 'col_a')
In the example above, the merge is finding a match for col == 'a' and col == 'b', because df_2 contains those values in its col_a column. But I would also like it to find the match with the col_b == 'c' of df_2. If regex worked with merge, a good solution would look this way:
df_1.merge(df_2, how = 'left', left_on = 'col', right_on = 'col_a|col_b')
The output should look like this:
col col_a col_b
a a z
b b NaN
c NaN c
Any ideas?

I believe what we are looking for here is to merge twice, concatenate the results and drop any duplicates that might result from col_a and col_b being the same.
import numpy as np
import pandas as pd
df_1 = pd.DataFrame({'col' : ['a', 'c', 'b']})
df_2 = pd.DataFrame({'col_a' : ['b', np.nan, 'a', 'a', 'c'], 'col_b' : [np.nan, 'c', 'z', 'b', 'c']})
df = (
pd.concat([
df_1.merge(df_2, left_on='col', right_on='col_a'),
df_1.merge(df_2, left_on='col', right_on='col_b'),
]).drop_duplicates()
.reset_index(drop=True)
)
print(df)
# col col_a col_b
# 0 a a z
# 1 a a b
# 2 c c c
# 3 b b NaN
# 4 c NaN c
# 5 b a b
We see we deal with:
a matches col_a twice
b matches col_a and col_b separately (including a row that matches a)
c matches col_a and col_b at the same time but isn't duplicated in the output.

You could perform both merges and use combine_first to fuse the two merges:
(df_1.merge(df_2, left_on='col', right_on='col_a', how='left')
.combine_first(df_1.merge(df_2, left_on='col', right_on='col_b', how='left'))
)
output:
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c
Other example (without the pitfall of the already aligned index):
df_1 = pd.DataFrame({'col' : ['a', 'c', 'b']})
df_2 = pd.DataFrame({'col_a' : ['b', np.nan, 'a'], 'col_b' : [np.nan, 'c', 'z']})
output:
col col_a col_b
0 a a z
1 c NaN c
2 b b NaN

Lest try join given your output
df_1.join(df_2)
output
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c
Or
df_1.merge(df_2, how='left', left_on='col', right_on='col_a').combine_first(df_2)
output
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c

Related

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])
import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.

How to do a pandas join if the match might appear in either of two columns?

Here is the situation: I have two pandas data frames:
TABLE 1:
name
alias
col3
str
str
str
TABLE 2:
name_or_alias
col2
col3
str
str
str
table1.name and table1.alias all contain unique values. Meaning, there are no duplicates between either of the two columns.
I need to do a left join on table2, but the problem is that the column to join on may be either table1.name OR table1.alias.
So, if I do:
table2.merge(table2, how=left, on=name),
I will only get some of the matches. If I do:
table2.merge(table2, how=left, on=alias),
I will also only get some of the matches. What I tried to do is concat the two merges and remove the duplicates
pd.concat([df1.merge(df2, how='left', left_on='name', right_on='name_or_alias'), df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')], axis=0).pipe(lambda x: x[x.index.duplicated()])
but this doesn't remove the duplicates correctly because if the match was in one of the matches but not the other, then it wont be duplicated (since the rows will be null for one of the merges and not the other).
I need to figure out how to remove the rows if the match was found in the other table. Any ideas?
You can melt, merge, and drop_duplicates:
(df1
.reset_index()
.melt(['index', 'col3'], value_name='name_or_alias')
.merge(df2, on='name_or_alias', suffixes=(None, '_2'), how='left')
.drop_duplicates('index')
.set_index('index')
)
NB. To keep the original DataFrame, join the output to it.
Output:
col3 variable name_or_alias col2 col3_2
index
0 0 name A 5.0 8.0
1 1 name B 3.0 6.0
2 2 name C NaN NaN
Used input:
df1 = pd.DataFrame({'name': ['A', 'B', 'C'], 'alias': ['D', 'E', 'F'], 'col3': [0, 1, 2]})
df2 = pd.DataFrame({'name_or_alias': ['B', 'D', 'A'], 'col2': [3, 4, 5], 'col3': [6, 7, 8]})
I think the problem here is that when you concat the 2 merged dataframes, it will generate duplicated records.
Assuming there are no duplicated values in name_or_alias, You can use combine_first instead of concat to combine 2 merged dataframes
It will fill null values in first dataframe with non-null values from second dataframe
df3=df1.merge(df2, how='left', left_on='name', right_on='name_or_alias')
df4=df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')
df3.combine_first(df4)
example input:
df1 = pd.DataFrame({'name': ['Peter', 'Mary', 'John', 'Tom'],'alias': ['P', 'M', 'J', 'T'],'col3': ['a', 'b' , 'c', 'd']})
df2 = pd.DataFrame({'name_or_alias': ['Peter', 'M', 'Tom'],'col2': ['e', 'f', 'g'],'col3': ['h', 'i' , 'j']})
output:
df1:
name alias col3
0 Peter P a
1 Mary M b
2 John J c
3 Tom T d
df2:
name_or_alias col2 col3
0 Peter e h
1 M f i
2 Tom g j
result:
name alias col3_x name_or_alias col2 col3_y
0 Peter P a Peter e h
1 Mary M b M f i
2 John J c NaN NaN NaN
3 Tom T d Tom g j

Python: Replace data from one dataframe using other dataframe

How to replace data from df1 using dataframe df2 based on column A
df1 = pd.DataFrame({'A': [0, 1, 2, 0, 4],'B': [5, 6, 7, 5, 9],'C': ['a', 'b', 'c', 'a', 'e'],'E': ['a1', '1b', '1c', '1a', '1e']})
df2 = pd.DataFrame({'A': [0, 1],'B': ['new', 'new1'],'C': ['t', 't1']})
Use DataFrame.merge with left join, replace missing values by original DataFrame by DataFrame.fillna and last filter columns by df1.columns:
df = df1.merge(df2, on='A', how='left', suffixes=('_','')).fillna(df1)[df1.columns]
print(df)
A B C E
0 0 new t a1
1 1 new1 t1 1b
2 2 7 c 1c
3 0 new t 1a
4 4 9 e 1e
Here is an option.
##set index to be the same
df1 = df1.set_index('A')
df2 = df2.set_index('A')
##update df1
df1.loc[df2.index,df2.columns] = df2
##reset the index to get it back to a column
df1 = df1.reset_index()

pandas merge and update efficiently

Iam getting df1 from the database.
Df2 needs to be merged with df1. Df1 contains additional columns not present in df2. df2 contains indexes that are already present in df1 and which rows need to be updated. the dataframe are multi indexed.
What i want:
-keep rows in df1 that are not in df2
-update df1's values with df2's values for matching indexes
-in the updated rows keep the values of the columns that are not present in df2.
-append rows that are in df2 but not in df1
My Solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={'idx1': ['A', 'B', 'C', 'D', 'E'], 'idx2': [1, 2, 3, 4, 5], 'one': ['df1', 'df1', 'df1', 'df1', 'df1'],
'two': ["y", "x", "y", "x", "y"]})
df2 = pd.DataFrame(data={'idx1': ['D', 'E', 'F', 'G'], 'idx2': [4, 5, 6, 7], 'one': ['df2', 'df2', 'df2', 'df2']})
desired_result = pd.DataFrame(data={'idx1': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'idx2': [1, 2, 3, 4, 5, 6, 7],
'one': ['df1','df1','df1','df2', 'df2', 'df2', 'df2'], 'two': ["y", "x", "y", "x", "y",np.nan,np.nan]})
updated = pd.merge(df1[['idx1', 'idx2']], df2, on=['idx1', 'idx2'], how='right')
keep = df1[~df1.isin(df2)].dropna()
my_res = pd.concat([updated, keep])
my_res.drop(columns='two', inplace=True)
my_res = pd.merge(my_res,df1[['idx1','idx2','two']], on=['idx1','idx2'])
This is very inefficient as i:
merge by right outer join df2 into index only columns of df1
find indexes that are in df2 but not in df1
concat the two dataframes
drop the columns that were not included in df2
merge on index to append those columns that i've previously dropped
Is there maybe a more efficient easier way to do this? I just cannot wrap my head around this.
EDIT:
By mutliindexed i mean that to identify a row i need to look at 4 different columns combined.
And unfortunately my solution does not work properly.
Merge the dataframes, update the column one with the values from one_, then drop this temporary column.
df = df1.merge(df2, on=['idx1', 'idx2'], how='outer', suffixes=['', '_'])
df['one'].update(df['one_'])
>>> df.drop(columns=['one_'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
Using DataFrame.append, Dataframe.drop_duplicates and Series.update:
First we append df1 and df2. Then we drop the duplicates based on column idx1 and idx2. Finally we update the two column NaN based on existing values in df1.
df3 = (df1.append(df2, sort=False)
.drop_duplicates(subset=['idx1', 'idx2'], keep='last')
.reset_index(drop=True))
df3['two'].update(df1['two'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
One line combine_first
Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index()
Yourdf
Out[216]:
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

Subtract one row from another in Pandas DataFrame

I am trying to subtract one row from another in a Pandas DataFrame. I have multiple descriptor columns preceding one numerical column, forcing me to set the index of the DataFrame on the two descriptor columns.
When I do this I get a KeyError on whatever the first column name listed in the set_index() list of columns is. In this case it is 'COL_A':
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
df.reset_index(inplace=True)
KeyError: 'COL_A'
I did not give this a second thought and cannot figure out why the KeyError is how this resolves.
I came upon this question for a quick answer. Here's what my solution ended up being.
>>> df = pd.DataFrame(data=[[5,5,5,5], [3,3,3,3]], index=['r1', 'r2'])
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
>>> df.loc['r3'] = df.loc['r1'] - df.loc['r2']
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
r3 2 2 2 2
>>>
Not sure I understand you correctly:
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
gives:
COL_A COL_B COL_C
0 A B 4
1 A B 2
then
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
yields:
COL_A COL_B
A B 4.0
B 0.5
If you now want to subtract, say row 0 from row 1, you can:
df.iloc[1].subtract(df.iloc[0])
to get:
COL_C -3.5

Categories