Identifying rows shared between two Pandas DataFrames based on two columns - python

Related to: How to find row with same value in 2 columns between 2 dataframes but different values in other columns pandas
I have two DataFrames: df1 and df2.
I would like to find all the rows in these combined DataFrames that have identical values in 'columnA' (object) and 'columnB' (int). These rows will have differing values in other columns I don't care about. The shape of these DataFrames also differs.
I've tried something like:
concat = pd.concat([df1, df2])
overlap = concat[concat.duplicated(subset=['columnA','columnB'], keep=False)]
But the output doesn't look right (maybe it is). Just want to check - am I missing anything?
Edit:
Say I wanted all the rows with the same value in columnA but different values in columnB - would this work?
df3 = (concat[concat.duplicated(subset=['columnA'], keep=False)]
.drop_duplicates(subset=['columnB']))

You can use pd.merge
df1 = pd.DataFrame(data=[('A','B','C'),('E','F','G'),('A','B','F')], columns=['columnA','columnB','columnC'])
df2 = pd.DataFrame(data=[('X','Y','G'),('A','B','Y'),('A','C','F')], columns=['columnA','columnB','columnC'])
df2['columnB'] = df2['columnB'].astype(str) #convert to string
print(df1)
columnA columnB columnC
0 A B C
1 E F G
2 A B F
print(df2)
columnA columnB columnC
0 X Y G
1 A B Y
2 A C F
And then after applying pd.merge:
df_m = pd.merge(df1,df2,how='inner',on='columnA')
----
df_m
columnA columnB_x columnC_x columnB_y columnC_y
0 A B C B Y
1 A B C C F
2 A B F B Y
3 A B F C F
Regarding your edit, try this:
df_final = df_m[df_m['columnB_x'] != df_m['columnB_y']]
------
print(df_final)
columnA columnB_x columnC_x columnB_y columnC_y
1 A B C C F
3 A B F C F

Related

How to compare two dataframes in Python pandas and output the difference?

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

Union of two pandas DataFrames

Say I have two data frames:
df1:
A
0 a
1 b
df2:
A
0 a
1 c
I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:
A B
0 a df1, df2
1 b df1
2 c df2
I can get the concatenated data frame (df3) without duplicates as follows:
import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)
I can't think of/find a method to have control over what element goes where. How can I add the extra column?
Thank you very much for any tips.
Merge with an indicator argument, and remap the result:
m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}
result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)
result
A B
0 a df1, df2
1 b df1
2 c df2
Use the command below:
df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) \
.groupby('A') \
.aggregate(list) \
.reset_index()
The result will be:
A source
0 a [df1, df2]
1 b [df1]
2 c [df2]
The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.
We use outer join to solve this -
df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df
A B
0 a df1,df2
1 b df1
2 c df2

How to create a dataframe from a column in another dataframe that concat by comma ','

i am wondering how to create a data frame that create from list of list and also transpose it
i have a data frame below
df : ColA ColB
A,B,C NA
D,E NA
df2 : ColA ColB Sales
A B 10
A C 10
B C 10
C B 10
C A 2
D E 100
My Expected Result
df3 : ColA TotalSales
A,B,C 42
D,E 100
Similar to your previous related question:
def summer(x):
values = x.split(',')
m1 = df2['ColA'].isin(values)
m2 = df2['ColB'].isin(values)
return df2.loc[m1 | m2, 'Sales'].sum()
df['TotalSales'] = df['ColA'].apply(summer)

Pandas - Interleave / Zip two DataFrames by row

Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F

Categories