How to join pandas dataframe on 2 columns? - python

Assume the following DataFrames
df1:
id data1
1 10
2 200
3 3000
4 40000
df2:
id1 id2 data2
1 2 210
1 3 3010
1 4 40010
2 3 3200
2 4 40200
3 4 43000
I want the new df3:
id1 id2 data2 data11 data12
1 2 210 10 200
1 3 3010 10 3000
1 4 40010 10 40000
2 3 3200 200 3000
2 4 40200 200 40000
3 4 43000 3000 40000
What is the correct way to achieve this in pandas?
Edit: Please not the specific data can be arbitrary. I chose this specific data just to show where everything comes from, but every data element has no correlation to any other data element.
Other dataframes examples, because the first one wasn't clear enough:
df4:
id data1
1 a
2 b
3 c
4 d
df5:
id1 id2 data2
1 2 e
1 3 f
1 4 g
2 3 h
2 4 i
3 4 j
I want the new df6:
id1 id2 data2 data11 data12
1 2 e a b
1 3 f a c
1 4 g a d
2 3 h b c
2 4 i b d
3 4 j c d
Edit2:
Data11 and Data12 are simply a copy of data1, with the corresponding id id1 or id2

1.First merge both dataframe using id1 and id column
2.rename data1 as data11
3. drop id column
4. Now merge df1 and df3 on id2 and id
df3 = pd.merge(df2,df1,left_on=['id1'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data11'},inplace=True)
df3.drop('id',axis=1,inplace=True)
df3 = pd.merge(d3,df1,left_on=['id2'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data12'},inplace=True)
df3.drop('id',axis=1,inplace=True)
I hope it would solve your problem

Try this:
# merge dataframes, first on id and id1 then on id2
df3 = pd.merge(df1, df2, left_on="id", right_on="id1", how="inner")
df3 = pd.merge(df1, df3, left_on="id", right_on="id2", how="inner")
# rename and reorder columns
cols = [ 'id1', 'id2', 'data2', 'data1_y', 'data1_x']
df3 = df3[cols]
new_cols = ["id1", "id2", "data2", "data11", "data12"]
df3.columns = new_cols
df3.sort_values("id1", inplace=True)
print(df3)
This prints out:
id1 id2 data2 data11 data12
0 1 2 210 10 200
1 1 3 3010 10 3000
2 1 4 40010 10 40000
3 2 3 3200 200 3000
4 2 4 40200 200 40000
5 3 4 43000 3000 40000

one of the solution to your problem is:
data1 = {'id' : [1,2,3,4],
'data1' : [10,200,3000,40000]}
data2 = {'id1' : [1,1,1,2,2,3],
'id2' : [2,3,4,3,4,4],
'data2' : [210,3010,40010,3200,40200,43000]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1:
id data1
1 10
2 200
3 3000
4 40000
df2:
id1 id2 data2
1 2 210
1 3 3010
1 4 40010
2 3 3200
2 4 40200
3 4 43000
df3 = df2.set_index('id1').join(df1.set_index('id'))
df3.index.names = ['id1']
df3.reset_index(inplace=True)
final = df3.set_index('id2').join(df1.set_index('id'), rsuffix='2')
final.index.names = ['id2']
final.reset_index(inplace=True)
final[['id1','id2','data2','data1','data12']].sort_values('id1')
output df:
id1 id2 data2 data1 data12
1 2 210 10 200
1 3 3010 10 3000
1 4 40010 10 40000
2 3 3200 200 3000
2 4 40200 200 40000
3 4 43000 3000 40000
I hope this will help you.
​

Using merge in a for loop with range and f-string
One way we can generalise this and to make it more easily expandable when having more than two dataframes, is to use list comprehension and a for loop with range.
After that we drop the duplicate column names:
dfs = [df2.merge(df1,
left_on=f'id{x+1}',
right_on='id',
how='left').rename(columns={'data1':f'data1{x+1}'}) for x in range(2)]
df = pd.concat(dfs, axis=1).drop('id', axis=1)
df = df.loc[:, ~df.columns.duplicated()]
Output
id1 id2 data2 data11 data12
0 1 2 210 10 200
1 1 3 3010 10 3000
2 1 4 40010 10 40000
3 2 3 3200 200 3000
4 2 4 40200 200 40000
5 3 4 43000 3000 40000

As #tawab_shakeel has mentioned earlier, your primary step is to merge the Dataframes on a particular column based on certain (SQL) join rules; just for you to understand the different approaches to merging on specific column(s), here is a general guide.
Joining Dataframes in Pandas
SQL Join Types

use two left hand merges on column id1 and id2 for dataframe df2
txt="""id,data1
1,a
2,b
3,c
4,d
"""
from io import StringIO
f = StringIO(txt)
df1 = pd.read_table(f,sep =',')
df1['id']=df1['id'].astype(int)
txt="""id1,id2,data2
1,2,e
1,3,f
1,4,g
2,3,h
2,4,i
3,4,j
"""
f = StringIO(txt)
df2 = pd.read_table(f,sep =',')
df2['id1']=df2['id1'].astype(int)
df2['id2']=df2['id2'].astype(int)
left_on='id1'
right_on='id'
suffix='_1'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on,
suffixes=("", suffix))
left_on='id2'
right_on='id'
suffix='_2'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on,
suffixes=("", suffix))
print(df2)
output
id1 id2 data2 id data1 id_2 data1_2
0 1 2 e 1 a 2 b
1 1 3 f 1 a 3 c
2 1 4 g 1 a 4 d
3 2 3 h 2 b 3 c
4 2 4 i 2 b 4 d
5 3 4 j 3 c 4 d

Related

How can i concat 2 pandas dataframe with column filter

I have 2 pandas dataframe
DF1
rowid
city
id2
id3
1
citya
10
8
2
cityb
20
9
DF2
city
id2
id3
cityc
10
8
cityd
10
4
citye
10
1
citye
20
9
cityf
20
4
citye
20
1
I want to concat 2 dataframe from id2 values.
But i need to add DF2 under to DF1 rows without duplicated values like this,
Note: on df1 i have too many id2 values with diffrent row number like (id2 : 10 , id3: 2) and i need to filter by row values before insert df2 values under to df1 rows
rowid
city
id2
id3
1
citya
10
8
cityd
10
4
citye
10
1
2
cityb
20
9
cityf
20
4
cityg
20
1
I dont have any idea about that
You can use concat and drop_duplicates:
>>> (pd.concat([df1, df2.assign(rowid='')])
.drop_duplicates(['id2', 'id3'])
.sort_values('id2', ignore_index=True))
rowid city id2 id3
0 1 citya 10 8
1 cityd 10 4
2 citye 10 1
3 2 cityb 20 9

Pandas replace columns by merging another dataframe

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

Python Pandas Dataframe Groupby Sum question

I'm new in Python and I need to combine 2 dataframe with 'id' as the primary key. I need to sum up all the Charges from df1 and df2.
df1:
[df1][1]
id Name Charge
1 A 100
1 A 100
2 B 200
2 B 200
5 C 300
6 D 400
df2:
[df2][2]
id Name Charge
1 A 100
1 A 100
2 B 200
8 X 200
output:
[output][3]
id Name Charge(TOTAL from df1 & df2)
1 A 400
2 B 600
5 C 300
6 D 400
8 X 200
Try:
pd.concat([df1, df2]).groupby(['id', 'Name'], as_index=False)['Charge'].sum()
Output:
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200
ans = pd.concat([df1, df2], axis=0).groupby(["id", "Name"]).sum().reset_index()
print(ans)
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200

pandas dataframe how to merge all rows based on groupby

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create the following dataframe:
df = [type1 , type2 ,type3, val1, val2, val3 , val1_a, val2_b, val3_b
a b q 1 2 3 4 6 7
a c w 3 5 2 2 1 8
b c t 2 9 0 2 9 0
Please notice: there could be 1 or 2 rows at each groupby, but not more. in case of 1 - just duplicate the single row
Idea is use GroupBy.cumcount for counter by type1, type2, then is created MultiIndex, reshaped by DataFrame.unstack, forward filling missing values per rows by ffill, converting to integers, sorting by counter level and last in list comprehension flatten MultiIndex:
g = df.groupby(["type1","type2"]).cumcount()
df1 = (df.set_index(["type1","type2", g])
.unstack()
.ffill(axis=1)
.astype(int)
.sort_index(level=1, axis=1))
df1.columns = [f'{a}_{b}' if b != 0 else a for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
type1 type2 val1 val2 val3 val1_1 val2_1 val3_1
0 a b 1 2 3 4 6 7
1 a c 3 5 2 2 1 8
2 b c 2 9 0 2 9 0

How to extract values of one dataframe with values of other dataframe in pandas?

Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])

Categories