pandas compare 2 dataframes of different column names and different shape [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have 2 dataframes with different no. of rows and different column names. I want to compare and get the matching rows specific to that columns as output.
e.g
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333]})
df2 = pd.DataFrame({'AA': [1,22], 'BB': ['see','ab'], 'CC': [123,222]})
df1: foo bar foobar
0 11 aa 111
1 22 ab 222
2 33 ac 333
df2: AA BB CC
0 1 see 123
1 22 ab 222
df2 not necessarily has to have same no of rows and columns.
expected output: for matching rows of df2 in df1
df3:
foo bar foobar
1 22 ab 222
I have tried using np.all, but this seems to work only if we have same no. of rows or single row in df2.
df3 = df1.loc[np.all(df1[['bar','foobar']].values == df2[['BB','CC']].values, axis=1),:]
Essentially needed, difference rows or matching rows from any of the df1 or df2.
expected output: for unmatched rows of df1 from df2
df3:
foo bar foobar
0 11 aa 111
2 33 ac 333
Imagine in this case: The order of columns are different, column mapping I will do. example: ( if columns values of a,b,c of df1 == column values of d,e,f in df2) get me the matched rows form df1 or df2.
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333], 'barfoo':[2,22,34]})
df2 = pd.DataFrame({'AA': [22,33], 'CC': [222,333], 'BB': ['ab','ac']})
output : In this case I am matching on (foo:AA, bar:BB, foobar:CC)
df3:
foo bar foobar barfoo
1 22 ab 222 22
2 33 ac 333 34
Appreciate and thanks.

You can temporarily rename the columns of df2 and perform the inner join (a.k.a. merge) on the two dataframes. It will find all rows that are present in both dataframes:
mapper = dict(zip(df2, df1)) # Column mapper
df2.rename(columns=mapper).merge(df1)
# foo bar foobar
#0 22 ab 222

import pandas as pd
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333]})
df2 = pd.DataFrame({'AA': [1,22], 'BB': ['see','ab'], 'CC': [123,222]})
df3 = df2.rename({'AA': 'foo', 'BB': 'bar', 'CC': 'foobar'})
df3 = df1.merge(df3, how = 'inner' ,indicator=False)
print('df1\n',df1)
print('df2\n',df2)
print('df3\n',df3)
Output
df1
foo bar foobar
0 11 aa 111
1 22 ab 222
2 33 ac 333
df2
AA BB CC
0 1 see 123
1 22 ab 222
df3
foo bar foobar
0 22 ab 222

Related

Pandas conditional merge 2 dataframes with one to many relationship

I am trying to merge two pandas DataFrames with one of many relationship. However, there are a couple of caveats. Explanation below.
import pandas as pd
df1 = pd.DataFrame({'name': ['AA', 'BB', 'CC', 'DD'],
'col1': [1, 2, 3, 4],
'col2': [1, 2, 3, 4] })
df2 = pd.DataFrame({'name': ['AA', 'AA', 'BB', 'BB', 'CC', 'DD'],
'col3': [0, 10, np.nan, 11, 12, np.nan] })
I'd like to merge the 2 DataFrames, however, ignore the 0 and np.nan in df2 when joining. I cannot simply filter df2 as there are other columns that I need.
Basically, I'd like to join on rows with one-to-many relationship that are not 0 or NaNs.
Expected output:
how about this :
merged_Df = df1.merge(df2.sort_values(['name','col3'], ascending=False).groupby(['name']).head(1), on='name')
output :
>>>
name col1 col2 col3
0 AA 1 1 10.0
1 BB 2 2 11.0
2 CC 3 3 12.0
3 DD 4 4 NaN
One way:
>>> df1.merge(df2).drop_duplicates(subset=['name'], keep='last')
name col1 col2 col3
1 AA 1 1 10.0
3 BB 2 2 11.0
4 CC 3 3 12.0
5 DD 4 4 13.0

Fill in values based on a differnt datafram values in pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have the following dataframe:
df1 = pd.DataFrame({'ID': ['foo', 'foo','bar','foo', 'baz', 'foo'],'value': [1, 2, 3, 5, 4, 3, 1, 2, 3]})
df2 = pd.DataFrame({'ID': ['foo', 'bar', 'baz', 'foo'],'age': [10, 21, 32, 15]})
I would like to create a new column in DF1 called age, and take the values from df2, that match on 'ID'. I would like for those values to be duplicated (instead of nan), when 'ID' value appears more than once in df1.
I tried a merge of df1 and df2, but they produce NaNs instead of duplicates.
Tha Pandas 101 does not contain an answer for this problem.
I think you need outer join:
df = pd.merge(df1, df2, on='ID', how='outer')
print(df)
ID value age
0 foo 1 10
1 foo 1 15
2 foo 2 10
3 foo 2 15
4 foo 5 10
5 foo 5 15
6 foo 3 10
7 foo 3 15
8 bar 3 21
9 baz 4 32

Update row values in a dataframe based on another row's values?

I have a dataframe with two columns: a and b
df
a b
0 john 123
1 john
2 mark
3 mark 456
4 marcus 789
I want to update values of b column based on a column.
a b
0 john 123
1 john 123
2 mark 456
3 mark 456
4 marcus 789
If john has value 123 in b. Remaining john also must have same value.
Assuming your dataframe is:
df = pd.DataFrame({'a': ['john', 'john', 'mark', 'mark', 'marcus'], 'b': [123, '', '', 456, 789]})
You can df.groupby the dataframe on column a and then apply transform on the column b of the grouped dataframe returning the first non empty value in the grouped column b.
Use:
df['b'] = (
df.groupby('a')['b']
.transform(lambda s: s[s.ne('')].iloc[0] if s.ne('').any() else s)
)
Result:
# print(df)
a b
0 john 123
1 john 123
2 mark 456
3 mark 456
4 marcus 789
Example:
df = pd.DataFrame({'A': [0," ", 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df1=df.replace({'A':" "},3)
Hope this helps, In your case it would be like
df1=df.replace({'b':" "},123)

how to match two dataFrame in python

I have two dataFrame in Python.
The first one is df1:
'ID' 'B'
AA 10
BB 20
CC 30
DD 40
The second one is df2:
'ID' 'C' 'D'
BB 30 0
DD 35 0
What I want to get finally is like df3:
'ID' 'C' 'D'
BB 30 20
DD 35 40
how to reach this goal?
my code is:
for i in df.ID
if len(df2.ID[df2.ID==i]):
df2.D[df2.ID==i]=df1.B[df2.ID==i]
but it doesn't work.
So first of all, I've interpreted the question differently, since your description is rather ambiguous. Mine boils down to this:
df1 is this data structure:
ID B <- column names
AA 10
BB 20
CC 30
DD 40
df2 is this data structure:
ID C D <- column names
BB 30 0
DD 35 0
Dataframes have a merge option, if you wanted to merge based on index the following code would work:
import pandas as pd
df1 = pd.DataFrame(
[
['AA', 10],
['BB', 20],
['CC', 30],
['DD', 40],
],
columns=['ID','B'],
)
df2 = pd.DataFrame(
[
['BB', 30, 0],
['DD', 35, 0],
], columns=['ID', 'C', 'D']
)
df3 = pd.merge(df1, df2, on='ID')
Now df3 only contains rows with ID's in both df1 and df2:
ID B C D <- column names
BB 20 30 0
DD 40 35 0
Now you were trying to remove D, and fill it in with column B, a.k.a
ID C D
BB 30 20
DD 35 40
Something that can be done with these simple steps:
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
Or to summarize:
def match(df1, df2):
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
return df3
Following code will replace zero in df1 with value df2
df1=pd.DataFrame(['A','B',0,4,6],columns=['x'])
df2=pd.DataFrame(['A','X',3,0,5],columns=['x'])
df3=df1[df1!=0].fillna(df2)

How can I check the ID of a pandas data frame in another data frame in Python?

Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).

Categories