pandas merge and update efficiently

pandas merge and update efficiently - python

Iam getting df1 from the database.
Df2 needs to be merged with df1. Df1 contains additional columns not present in df2. df2 contains indexes that are already present in df1 and which rows need to be updated. the dataframe are multi indexed.
What i want:
-keep rows in df1 that are not in df2
-update df1's values with df2's values for matching indexes
-in the updated rows keep the values of the columns that are not present in df2.
-append rows that are in df2 but not in df1
My Solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={'idx1': ['A', 'B', 'C', 'D', 'E'], 'idx2': [1, 2, 3, 4, 5], 'one': ['df1', 'df1', 'df1', 'df1', 'df1'],
'two': ["y", "x", "y", "x", "y"]})
df2 = pd.DataFrame(data={'idx1': ['D', 'E', 'F', 'G'], 'idx2': [4, 5, 6, 7], 'one': ['df2', 'df2', 'df2', 'df2']})
desired_result = pd.DataFrame(data={'idx1': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'idx2': [1, 2, 3, 4, 5, 6, 7],
'one': ['df1','df1','df1','df2', 'df2', 'df2', 'df2'], 'two': ["y", "x", "y", "x", "y",np.nan,np.nan]})
updated = pd.merge(df1[['idx1', 'idx2']], df2, on=['idx1', 'idx2'], how='right')
keep = df1[~df1.isin(df2)].dropna()
my_res = pd.concat([updated, keep])
my_res.drop(columns='two', inplace=True)
my_res = pd.merge(my_res,df1[['idx1','idx2','two']], on=['idx1','idx2'])
This is very inefficient as i:
merge by right outer join df2 into index only columns of df1
find indexes that are in df2 but not in df1
concat the two dataframes
drop the columns that were not included in df2
merge on index to append those columns that i've previously dropped
Is there maybe a more efficient easier way to do this? I just cannot wrap my head around this.
EDIT:
By mutliindexed i mean that to identify a row i need to look at 4 different columns combined.
And unfortunately my solution does not work properly.

Merge the dataframes, update the column one with the values from one_, then drop this temporary column.
df = df1.merge(df2, on=['idx1', 'idx2'], how='outer', suffixes=['', '_'])
df['one'].update(df['one_'])
>>> df.drop(columns=['one_'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

Using DataFrame.append, Dataframe.drop_duplicates and Series.update:
First we append df1 and df2. Then we drop the duplicates based on column idx1 and idx2. Finally we update the two column NaN based on existing values in df1.
df3 = (df1.append(df2, sort=False)
.drop_duplicates(subset=['idx1', 'idx2'], keep='last')
.reset_index(drop=True))
df3['two'].update(df1['two'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

One line combine_first
Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index()
Yourdf
Out[216]:
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

Related

How to merge dataframes together with matching columns side by side?

I have two dataframes with matching keys. I would like to merge them together based on their keys and have the corresponding columns line up side by side. I am not sure how to achieve this as the pd.merge displays all columns for the first dataframe and then all columns for the second data frame:
df1 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [5, 6], 'col2': [7, 8]})
print(pd.merge(df1, df2, on=['key']))
key col1_x col2_x col1_y col2_y
0 a 1 3 5 7
1 b 2 4 6 8
I am looking for a way to do the same merge and have the columns displays side by side as such:
key col1_x col1_y col2_x col2_y
0 a 1 5 3 7
1 b 2 6 4 8
Any help achieving this would be greatly appreciated!

If you're ok with a bit of a shuffle you can sort the columns.
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = sorted(df.columns))
or you could do this to maintain the key in the front
cols = list(df.columns)
cols.remove('key')
print(cols)
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = ['key']+sorted(cols))

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])

import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.

Python: Replace data from one dataframe using other dataframe

How to replace data from df1 using dataframe df2 based on column A
df1 = pd.DataFrame({'A': [0, 1, 2, 0, 4],'B': [5, 6, 7, 5, 9],'C': ['a', 'b', 'c', 'a', 'e'],'E': ['a1', '1b', '1c', '1a', '1e']})
df2 = pd.DataFrame({'A': [0, 1],'B': ['new', 'new1'],'C': ['t', 't1']})

Use DataFrame.merge with left join, replace missing values by original DataFrame by DataFrame.fillna and last filter columns by df1.columns:
df = df1.merge(df2, on='A', how='left', suffixes=('_','')).fillna(df1)[df1.columns]
print(df)
A B C E
0 0 new t a1
1 1 new1 t1 1b
2 2 7 c 1c
3 0 new t 1a
4 4 9 e 1e

Here is an option.
##set index to be the same
df1 = df1.set_index('A')
df2 = df2.set_index('A')
##update df1
df1.loc[df2.index,df2.columns] = df2
##reset the index to get it back to a column
df1 = df1.reset_index()

Concatenate a list of data frames not having same columns to one single data frame

Df 1 has columns A B C D, Df2 has columns A B D. Df1 and Df2 are in a list. How do I concatenate them into 1 df?
Or can I directly append these dfs to one single df without using a list ?

Short answer: yes you can combine them into single pandas dataframe without that much work. Sample code:
import pandas as pd
df1 = [(1,2,3,4)]
df2 = [(9,9,9)]
df1 = pd.DataFrame(df1, columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(df2, columns=['A', 'B', 'D'])
df = pd.concat([df1, df2], sort=False)
Which results into:
>>> pd.concat([df1, df2], sort=False)
A B C D
0 1 2 3.0 4
0 9 9 NaN 9

Remove one dataframe from another with Pandas

I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.
So if I have df2 equals to:
A B
0 wer 6
1 tyu 7
And df1 equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)

Use merge with outer join with filter by query, last remove helper column by drop:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k

The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)

You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k

pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k

I think the cleanest way can be:
We have base dataframe D and want to remove a subset D1. Let the output be D2
D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()

I find this other alternative useful too:
pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
keep=False drops both duplicates.
It doesn't require to put all the equal columns between the two df, so I find that a bit easier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas merge and update efficiently - python

One line combine_first Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index() Yourdf Out[216]: idx1 idx2 one two 0 A 1 df1 y 1 B 2 df1 x 2 C 3 df1 y 3 D 4 df2 x 4 E 5 df2 y 5 F 6 df2 NaN 6 G 7 df2 NaN

Related

How to merge dataframes together with matching columns side by side?

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

Python: Replace data from one dataframe using other dataframe

Concatenate a list of data frames not having same columns to one single data frame

Remove one dataframe from another with Pandas

Categories

Resources