How to replace data from df1 using dataframe df2 based on column A
df1 = pd.DataFrame({'A': [0, 1, 2, 0, 4],'B': [5, 6, 7, 5, 9],'C': ['a', 'b', 'c', 'a', 'e'],'E': ['a1', '1b', '1c', '1a', '1e']})
df2 = pd.DataFrame({'A': [0, 1],'B': ['new', 'new1'],'C': ['t', 't1']})
Use DataFrame.merge with left join, replace missing values by original DataFrame by DataFrame.fillna and last filter columns by df1.columns:
df = df1.merge(df2, on='A', how='left', suffixes=('_','')).fillna(df1)[df1.columns]
print(df)
A B C E
0 0 new t a1
1 1 new1 t1 1b
2 2 7 c 1c
3 0 new t 1a
4 4 9 e 1e
Here is an option.
##set index to be the same
df1 = df1.set_index('A')
df2 = df2.set_index('A')
##update df1
df1.loc[df2.index,df2.columns] = df2
##reset the index to get it back to a column
df1 = df1.reset_index()
Related
I have two dataframes with matching keys. I would like to merge them together based on their keys and have the corresponding columns line up side by side. I am not sure how to achieve this as the pd.merge displays all columns for the first dataframe and then all columns for the second data frame:
df1 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [5, 6], 'col2': [7, 8]})
print(pd.merge(df1, df2, on=['key']))
key col1_x col2_x col1_y col2_y
0 a 1 3 5 7
1 b 2 4 6 8
I am looking for a way to do the same merge and have the columns displays side by side as such:
key col1_x col1_y col2_x col2_y
0 a 1 5 3 7
1 b 2 6 4 8
Any help achieving this would be greatly appreciated!
If you're ok with a bit of a shuffle you can sort the columns.
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = sorted(df.columns))
or you could do this to maintain the key in the front
cols = list(df.columns)
cols.remove('key')
print(cols)
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = ['key']+sorted(cols))
I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])
import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.
I need to merge two dataframes, but the merge can be made on either two columns of the right-hand dataframe.
df_1 = pd.DataFrame({'col' : ['a', 'b', 'c']})
df_2 = pd.DataFrame({'col_a' : ['a', 'b', np.nan], 'col_b' : ['z', np.nan, 'c']})
df_1.merge(df_2, how = 'left', left_on = 'col', right_on = 'col_a')
In the example above, the merge is finding a match for col == 'a' and col == 'b', because df_2 contains those values in its col_a column. But I would also like it to find the match with the col_b == 'c' of df_2. If regex worked with merge, a good solution would look this way:
df_1.merge(df_2, how = 'left', left_on = 'col', right_on = 'col_a|col_b')
The output should look like this:
col col_a col_b
a a z
b b NaN
c NaN c
Any ideas?
I believe what we are looking for here is to merge twice, concatenate the results and drop any duplicates that might result from col_a and col_b being the same.
import numpy as np
import pandas as pd
df_1 = pd.DataFrame({'col' : ['a', 'c', 'b']})
df_2 = pd.DataFrame({'col_a' : ['b', np.nan, 'a', 'a', 'c'], 'col_b' : [np.nan, 'c', 'z', 'b', 'c']})
df = (
pd.concat([
df_1.merge(df_2, left_on='col', right_on='col_a'),
df_1.merge(df_2, left_on='col', right_on='col_b'),
]).drop_duplicates()
.reset_index(drop=True)
)
print(df)
# col col_a col_b
# 0 a a z
# 1 a a b
# 2 c c c
# 3 b b NaN
# 4 c NaN c
# 5 b a b
We see we deal with:
a matches col_a twice
b matches col_a and col_b separately (including a row that matches a)
c matches col_a and col_b at the same time but isn't duplicated in the output.
You could perform both merges and use combine_first to fuse the two merges:
(df_1.merge(df_2, left_on='col', right_on='col_a', how='left')
.combine_first(df_1.merge(df_2, left_on='col', right_on='col_b', how='left'))
)
output:
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c
Other example (without the pitfall of the already aligned index):
df_1 = pd.DataFrame({'col' : ['a', 'c', 'b']})
df_2 = pd.DataFrame({'col_a' : ['b', np.nan, 'a'], 'col_b' : [np.nan, 'c', 'z']})
output:
col col_a col_b
0 a a z
1 c NaN c
2 b b NaN
Lest try join given your output
df_1.join(df_2)
output
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c
Or
df_1.merge(df_2, how='left', left_on='col', right_on='col_a').combine_first(df_2)
output
col col_a col_b
0 a a z
1 b b NaN
2 c NaN c
Iam getting df1 from the database.
Df2 needs to be merged with df1. Df1 contains additional columns not present in df2. df2 contains indexes that are already present in df1 and which rows need to be updated. the dataframe are multi indexed.
What i want:
-keep rows in df1 that are not in df2
-update df1's values with df2's values for matching indexes
-in the updated rows keep the values of the columns that are not present in df2.
-append rows that are in df2 but not in df1
My Solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={'idx1': ['A', 'B', 'C', 'D', 'E'], 'idx2': [1, 2, 3, 4, 5], 'one': ['df1', 'df1', 'df1', 'df1', 'df1'],
'two': ["y", "x", "y", "x", "y"]})
df2 = pd.DataFrame(data={'idx1': ['D', 'E', 'F', 'G'], 'idx2': [4, 5, 6, 7], 'one': ['df2', 'df2', 'df2', 'df2']})
desired_result = pd.DataFrame(data={'idx1': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'idx2': [1, 2, 3, 4, 5, 6, 7],
'one': ['df1','df1','df1','df2', 'df2', 'df2', 'df2'], 'two': ["y", "x", "y", "x", "y",np.nan,np.nan]})
updated = pd.merge(df1[['idx1', 'idx2']], df2, on=['idx1', 'idx2'], how='right')
keep = df1[~df1.isin(df2)].dropna()
my_res = pd.concat([updated, keep])
my_res.drop(columns='two', inplace=True)
my_res = pd.merge(my_res,df1[['idx1','idx2','two']], on=['idx1','idx2'])
This is very inefficient as i:
merge by right outer join df2 into index only columns of df1
find indexes that are in df2 but not in df1
concat the two dataframes
drop the columns that were not included in df2
merge on index to append those columns that i've previously dropped
Is there maybe a more efficient easier way to do this? I just cannot wrap my head around this.
EDIT:
By mutliindexed i mean that to identify a row i need to look at 4 different columns combined.
And unfortunately my solution does not work properly.
Merge the dataframes, update the column one with the values from one_, then drop this temporary column.
df = df1.merge(df2, on=['idx1', 'idx2'], how='outer', suffixes=['', '_'])
df['one'].update(df['one_'])
>>> df.drop(columns=['one_'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
Using DataFrame.append, Dataframe.drop_duplicates and Series.update:
First we append df1 and df2. Then we drop the duplicates based on column idx1 and idx2. Finally we update the two column NaN based on existing values in df1.
df3 = (df1.append(df2, sort=False)
.drop_duplicates(subset=['idx1', 'idx2'], keep='last')
.reset_index(drop=True))
df3['two'].update(df1['two'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
One line combine_first
Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index()
Yourdf
Out[216]:
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.
So if I have df2 equals to:
A B
0 wer 6
1 tyu 7
And df1 equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Use merge with outer join with filter by query, last remove helper column by drop:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)
You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I think the cleanest way can be:
We have base dataframe D and want to remove a subset D1. Let the output be D2
D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()
I find this other alternative useful too:
pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
keep=False drops both duplicates.
It doesn't require to put all the equal columns between the two df, so I find that a bit easier.