Merge three or more data frames with Pandas - python

In Pandas merge function you can merge two data frames, but I need to merge N, similar to an SQL statement where you combine N tables in a full outer join. For example, I need to merge the three data frames below by 'type_1', 'subject_id_1', 'type_2', 'subject_id_2' and 'type_3', 'subject_id_3'. Is this possible?
import pandas as pd
raw_data = {
'type_1': [1, 1, 0, 0, 1],
'subject_id_1': ['1', '2', '3', '4', '5'],
'first_name_1': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'subject_id_1', 'first_name_1'])
raw_datab = {
'type_2': [1, 1, 0, 0, 0],
'subject_id_2': ['4', '5', '6', '7', '8'],
'first_name_2': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'subject_id_2', 'first_name_2'])
raw_datac = {
'type_3': [1, 1],
'subject_id_3': ['4', '5'],
'first_name_3': ['Joe', 'Paul']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'subject_id_3', 'first_name_3'])
### need to include here the third data frame
merged = pd.merge(df_a, df_b, left_on=['type_1','subject_id_1'],
right_on = ['type_2','subject_id_2'], how='outer')
print(merged)
Note: The names of the fields to join are different in each data frame.

I believe need join by indices created by set_index with concat:
dfs = [df_a.set_index(['type_1','subject_id_1']),
df_b.set_index(['type_2','subject_id_2']),
df_c.set_index(['type_3','subject_id_3'])]
df = pd.concat(dfs, axis=1)
print (df)
first_name_1 first_name_2 first_name_3
0 3 Allen NaN NaN
4 Alice NaN NaN
6 NaN Bran NaN
7 NaN Bryce NaN
8 NaN Betty NaN
1 1 Alex NaN NaN
2 Amy NaN NaN
4 NaN Billy Joe
5 Ayoung Brian Paul
df = pd.concat(dfs, axis=1).rename_axis(('type','subject_id')).reset_index()
print (df)
type subject_id first_name_1 first_name_2 first_name_3
0 0 3 Allen NaN NaN
1 0 4 Alice NaN NaN
2 0 6 NaN Bran NaN
3 0 7 NaN Bryce NaN
4 0 8 NaN Betty NaN
5 1 1 Alex NaN NaN
6 1 2 Amy NaN NaN
7 1 4 NaN Billy Joe
8 1 5 Ayoung Brian Paul

Related

How to add data to NaN rows from another data

How can i add datas from another data, but without removing NaN values?
I have three data similar to this
df_main = df_main = pd.DataFrame({'ID': ['10', '11', '12', '13', '14', '15', '16'], 'Name': [ np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
ID Name
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
df2 = pd.DataFrame({'ID': ['10', '11', '12'], 'Name': [ 'Peter', 'Bruce', 'Tony']})
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
df3 = pd.DataFrame({'ID': ['15', '16'], 'Name': ['Wanda', 'Natasha']})
ID Name
0 15 Wanda
1 16 Natasha
What I want to have is data like this:
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha
I tried this code but it did not work
for id in df2['ID'].unique():
if id in df_main['ID'].unique():
df_main.loc[df_main['ID'] == id, 'Name'] = df2.loc[df2['ID'] == id, 'Name']
for id in df3['ID'].unique():
if id in df_main['ID'].unique():
df_main.loc[df_main['ID'] == id, 'Name'] = df3.loc[df3['ID'] == id, 'Name']
IIUC, you can use concat with GroupBy.first :
out = pd.concat([df2, df_main, df3]).groupby("ID", as_index=False).first()
Output :
print(out)
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 None
4 14 None
5 15 Wanda
6 16 Natasha
concat df2/df3 and map the values:
df_main['Name'] = df_main['ID'].map(pd.concat([df2, df3]).set_index('ID')['Name'])
Output:
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha
df_main.set_index("ID").combine_first(df2.set_index("ID"))\
.combine_first(df3.set_index("ID")).reset_index()
out
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha

Merging/Concat/Joining two dataframes

i have a pandas dataframe with a distinct code identifier as detailed below:
df1 = pd.DataFrame([['a', 1], ['b', 2],['c', 3],['d', 4],['e', 5],['f', 5]],
columns=['code', 'value1'])
with a second dataframe with the following
df2 = pd.DataFrame([['a', 11], ['b', 12],['c', 13],['d', 14],['e', 15],['f', 16],['g', 17], ['h', 2],['i', 3],['j', 4],['k', 5],['l', 5]],
columns=['code', 'value2'])
i would like to only see the codes identified in df1 (i.e a-f) and have a third column entitled value2.
I have tried
df1 = df1.join(df2, on = 'Code')
but i keep getting a value of NaN
I have looked at several places and seen merge, concat and join, but none of them appear to work
try this:
df1 = df1.merge(df2, on = 'code')
since you named the column 'code' not 'Code'
To only see the codes identified in df1 (i.e a-f) and have a third column entitled value2, you should use merge method with how='inner' and on='code:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16
Use:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16
Or do you mean by with how='outer' and merge?
>>> df1.merge(df2, how='outer', on='code')
code value1 value2
0 a 1.0 11
1 b 2.0 12
2 c 3.0 13
3 d 4.0 14
4 e 5.0 15
5 f 5.0 16
6 g NaN 17
7 h NaN 2
8 i NaN 3
9 j NaN 4
10 k NaN 5
11 l NaN 5
>>>

Combining two dataframes and filling blanks in one with values of another (using email as index)

I have two dataframes that have the following columns : Phone, Email and Name
Dataframe1 has 20k in length, whereas dataframe2 has 1k length. I would like to fill the blanks in the Phone column in dataframe1 with the phone numbers in dataframe2 using the email as a match index between the two dataframes.
What is the best way to do this? I have tried combine_frist() and Merge() but combine_first() returns the value in the same row rather than the value that matches the email address. Merge() resulted in the same thing.
Am I wrong to think I need to set email as an index and then map phones to that index? I feel like this is correct but I simply do not know how to do this. Any help is appreciated! Thank you :)
Example :
In [1]
import pandas as pd
df1 = pd.DataFrame({'Phone': [1, NaN, 3, 4, 5, NaN, 7],
'Name': ['Bob', 'Jon', 'Iris', 'Jacob','Donald','Beatrice','Jane'],
'Email': ['bob#gmail.com','jon#gmail.com','iris#gmail.com','jacob#gmail.com','donald#gmail.com','beatrice#gmail.com','jane#gmail.cm'})
df2 = pd.DataFrame({'Phone': [2, 1, 3, 5],
'Name': ['Jon', 'Bob', 'Donald'],
'Email': ['jon#gmail.com','bob#gmail.com', 'donald#gmail.com'})
In [2]: df1
Out [2]:
Phone Name Email
1 Bob bob#gmail.com
NaN Jon jon#gmail.com
3 Iris iris#gmail.com
4 Jac jacob#gmail.com
5 Don donald#gmail.com
NaN Bea beatrice#gmail.com
7 Jane jane#gmail.com
x 20000 len
In [3]: df2
Out [3]:
Phone Name Email
2 Jon jon#gmail.com
1 Bob bob#gmail.com
6 Bea beatrice#gmail.com
5 Don donald#gmail.com
x 1100 len
What I've tried
In [4]: df3 = pd.merge(df1,df2, on="Email", how="left")
Out [4]:
Phone Name Email
1 Bob bob#gmail.com
1 Jon jon#gmail.com
3 Iris iris#gmail.com
4 Jac jacob#gmail.com
5 Don donald#gmail.com
NaN Bea beatrice#gmail.com
7 Jane jane#gmail.com
In [5]: df3 = df1.combine_first(df2)
Out [5]:
Phone Name Email
1 Bob bob#gmail.com
1 Jon jon#gmail.com
3 Iris iris#gmail.com
4 Jac jacob#gmail.com
5 Don donald#gmail.com
NaN Bea beatrice#gmail.com
7 Jane jane#gmail.com
What I would like it to look like:
In [6]: df3
Out [6]
1 Bob bob#gmail.com
2 Jon jon#gmail.com
3 Iris iris#gmail.com
4 Jac jacob#gmail.com
5 Don donald#gmail.com
6 Bea beatrice#gmail.com
7 Jane jane#gmail.com
Constructing the data frame like so:
df1 = pd.DataFrame({'Phone': [1, NaN, 3, 4, 5, NaN, 7],
'Name': ['Bob', 'Jon', 'Iris', 'Jacob','Donald','Beatrice','Jane'],
'Email': ['bob#gmail.com','jon#gmail.com','iris#gmail.com','jacob#gmail.com','donald#gmail.com','beatrice#gmail.com','jane#gmail.cm']})
df2 = pd.DataFrame({'Phone': [2, 1, 5, 6],
'Name': ['Jon', 'Bob', 'Donald', 'Beatrice'],
'Email': ['jon#gmail.com','bob#gmail.com', 'donald#gmail.com', 'beatrice#gmail.com']})
The merge gives:
>>> df1.merge(df2, on='Email', how='left')
Phone_x Name_x Email Phone_y Name_y
0 1.0 Bob bob#gmail.com 1.0 Bob
1 NaN Jon jon#gmail.com 2.0 Jon
2 3.0 Iris iris#gmail.com NaN NaN
3 4.0 Jacob jacob#gmail.com NaN NaN
4 5.0 Donald donald#gmail.com 5.0 Donald
5 NaN Beatrice beatrice#gmail.com 6.0 Beatrice
6 7.0 Jane jane#gmail.cm NaN NaN
Then reduce Phone over columns.
>>> df1.merge(df2, on='Email', how='left')[['Phone_x', 'Phone_y']].ffill(axis=1)
Phone_x Phone_y
0 1.0 1.0
1 NaN 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 NaN 6.0
6 7.0 7.0
Reassign the right-most column in that result - if output is assigned to result, access by result.iloc[:, -1] - as a new column to the original data frame.

Combine Dataframe rows to fill in missing data

Suppose I have a dataframe with rows containing missing data, but a set of columns acting as a key:
import pandas as pd
import numpy as np
data = {"id": [1, 1, 2, 2, 3, 3, 4 ,4], "name": ["John", "John", "Paul", "Paul", "Ringo", "Ringo", "George", "George"], "height": [178, np.nan, 182, np.nan, 175, np.nan, 188, np.nan], "weight": [np.nan, np.NaN, np.nan, 72, np.nan, 68, np.nan, 70]}
df = pd.DataFrame.from_dict(data)
print(df)
id name height weight
0 1 John 178.0 NaN
1 1 John NaN NaN
2 2 Paul 182.0 NaN
3 2 Paul NaN 72.0
4 3 Ringo 175.0 NaN
5 3 Ringo NaN 68.0
6 4 George 188.0 NaN
7 4 George NaN 70.0
How would I go about "squashing" these rows with duplicate keys down to pick the non-nan value (if it exists)?
desired output:
id name height weight
0 1 John 178.0 NaN
2 2 Paul 182.0 72.0
4 3 Ringo 175.0 68.0
6 4 George 188.0 70.0
The index doesn't matter, and there is always at most one row with Non-NaN data. I think I need to use groupby(['id', 'name']), but I'm not sure where to go from there.
If there are always only one non NaNs values per groups is possible aggregate many ways:
df = df.groupby(['id', 'name'], as_index=False).first()
Or:
df = df.groupby(['id', 'name'], as_index=False).last()
Or:
df = df.groupby(['id', 'name'], as_index=False).mean()
Or:
df = df.groupby(['id', 'name'], as_index=False).sum(min_count=1)

Pandas join two dataframes based on relationship described in dictionary

I have two dataframes that I want to join based on a relationship described in a dictionary of lists, where the keys in the dictionary refer to ids from dfA idA column, and the items in the list are ids from dfB idB column. The dataframes and dictionary look something like this:
dfA
colA colB idA
0 a abc 3
1 b def 4
2 b ghi 5
dfB
colX idB colZ
0 bob 7 a
1 bob 7 b
2 bob 7 c
3 jim 8 d
4 jake 9 a
5 jake 9 e
myDict = { '3': [ '7', '8' ], '4': [], '5': ['7', '9'] }
How can I use myDict to join the two dataframes to produce a dataframe like the following?
dfC
colA colB idA colX idB colZ
0 a abc 3 bob 7 a
1 b
2 c
3 jim 8 d
4 b def 4 None None None
5 b ghi 5 bob 7 a
6 b
7 c
8 jake 9 a
9 e
You can create a linking table (DataFrame) from your dictionary. Below full working example. It might need some row and column sorting at the end to produce exactly your output.
import pandas as pd
import numpy as np
dfA = pd.DataFrame({'colA': ('a', 'b', 'b'),
'colB': ('abc', 'def', 'ghi'),
'idA': ('3', '4', '5')})
dfB = pd.DataFrame({'colX': ('bob', 'bob', 'bob', 'jim', 'jake', 'jake'),
'idB': ('7', '7', '7', '8', '9', '9'),
'colZ': ('a', 'b', 'c', 'd', 'a', 'e')})
myDict = {'3': ['7', '8'], '4': [], '5': ['7', '9']}
dfC = pd.DataFrame(columns=['idA', 'idB'])
i = 0
for key, value in myDict.items():
# the if statement is for empty list to create one record with NaNs
if not value:
dfC.loc[i, 'idA'] = key
dfC.loc[i, 'idB'] = np.nan
i += 1
for val in value:
dfC.loc[i, 'idA'] = key
dfC.loc[i, 'idB'] = val
i += 1
temp = dfA.merge(dfC, how='right')
result = temp.merge(dfB, how='outer')
print(result)
The output is:
colA colB idA idB colX colZ
0 a abc 3 7 bob a
1 a abc 3 7 bob b
2 a abc 3 7 bob c
3 b ghi 5 7 bob a
4 b ghi 5 7 bob b
5 b ghi 5 7 bob c
6 a abc 3 8 jim d
7 b def 4 NaN NaN NaN
8 b ghi 5 9 jake a
9 b ghi 5 9 jake e
This is not greatest solution, but it is fairly simple and gets the job done
temp = pd.DataFrame(dfA.idAaux.tolist(), index = dfA.idA).stack()
temp = temp.reset_index()[['idA', 0]]
temp.columns = ['idA', 'idB']
temp2 = dfA.merge(temp, left_on='idA', right_on='idA', how='left').drop('idAaux', axis=1)
temp2['idB'] = pd.to_numeric(temp2['idB'])
res= temp2.merge(dfB, left_on='idB', right_on='idB', how='left')
Output:
colA colB idA idB colX colZ
0 a abc 3 7.0 bob a
1 a abc 3 7.0 bob b
2 a abc 3 7.0 bob c
3 a abc 3 8.0 jim d
4 b def 4 NaN NaN NaN
5 b ghi 5 7.0 bob a
6 b ghi 5 7.0 bob b
7 b ghi 5 7.0 bob c
8 b ghi 5 9.0 jake a
9 b ghi 5 9.0 jake e

Categories