Dataframe merging in Pandas - python

I have two dataframes. First (df1) contains Name, ID and PIN. Second contains Identifier, City and Country. Dataframe shown below.
df1 = pd.DataFrame({"Name": ["Sam", "Ajay", "Lee", "Lee Yong Dae", "Cai Yun"], "ID": ["S01", "A01", "L02", "L03", "C01"], "PIN": ["SM392", "AA09", "Lee101", "Lee201", "C101"]})
df2 = pd.DataFrame({"Identifier": ["Sam", "L02", "C101"], "City": ["Moscow", "Seoul", "Beijing"], "Country": ["Russia", "Korea", "China"]})
I want to merge the dataframes if either name or ID or PIN matches with the identifier of df2. The expected output is:
City Country Name PIN Student ID
0 Moscow Russia Sam SM392 S01
1 0 0 Ajay AA09 A01
2 Seoul Korea Lee Lee101 L02
3 0 0 Lee Yong Dae Lee201 L03
4 Beijing China Cai Yun C101 C01

This is perhaps not the most elegant solution, but it works for me.
You have to create 3 separate merges and combine the results.
The code below gives the expected output (with nan values instead of 0 for the unmatched elements of the DataFrame)
import numpy as np
import pandas as pd
#Initial data
df1 = pd.DataFrame({"Name": ["Sam", "Ajay", "Lee", "Lee Yong Dae", "Cai Yun"], "ID": ["S01", "A01", "L02", "L03", "C01"], "PIN": ["SM392", "AA09", "Lee101", "Lee201","C101"]})
df2 = pd.DataFrame({"Identifier": ["Sam", "L02", "C101"], "City": ["Moscow", "Seoul", "Beijing"], "Country": ["Russia", "Korea", "China"]})
def merge_three(df1,df2):
#Perform three seperate merges
df3=df1.merge(df2, how='outer', left_on='ID', right_on='Identifier')
df4=df1.merge(df2, how='outer', left_on='Name', right_on='Identifier')
df5=df1.merge(df2, how='outer', left_on='PIN', right_on='Identifier')
#Copy 2nd and 3rd merge results to df3
df3['City_x']=df4['City']
df3['Country_x']=df4['Country']
df3['City_y']=df5['City']
df3['Country_y']=df5['Country']
#Merge the correct City and Country values. Use max to remove the NaN values
df6=df3[['City','Country','Name','PIN','ID']]
df6['City']=np.max([df3['City'],df3['City_x'],df3['City_y']],axis=0)
df6['Country']=np.max([df3['Country'],df3['Country_x'],df3['Country_y']],axis=0)
#Remove extra un-matched rows from merge
df_final=df6[df6['Name'].notnull()]
return df_final
df_out = merge_three(df1,df2)
Output:
df_out
City Country Name PIN ID
0 Moscow Russia Sam SM392 S01
1 NaN NaN Ajay AA09 A01
2 Seoul Korea Lee Lee101 L02
3 NaN NaN Lee Yong Dae Lee201 L03
4 Beijing China Cai Yun C101 C01

Not sure, but maybe this is what you are looking for:
a = df1.merge(df2, left_on='ID', right_on='Identifier')
b = df1.merge(df2, left_on='Name', right_on='Identifier')
с = df1.merge(df2, left_on='PIN', right_on='Identifier')
df = a.append(b).append(с)
df
ID Name PIN City Country Identifier
0 L02 Lee Lee101 Seoul Korea L02
0 S01 Sam SM392 Moscow Russia Sam
0 C01 Cai Yun C101 Beijing China C101

Related

How to create a dataframe using a list of dictionaries that also consist of lists

I have a list of dictionaries that also consist of lists and would like to create a dataframe using this list. For example, the data looks like this:
lst = [{'France': [[12548, ABC], [45681, DFG], [45684, HJK]]},
{'USA': [[84921, HJK], [28917, KLESA]]},
{'Japan':[[38292, ASF], [48902, DSJ]]}]
And this is the dataframe I'm trying to create
Country Amount Code
France 12548 ABC
France 45681 DFG
France 45684 HJK
USA 84921 HJK
USA 28917 KLESA
Japan 38292 ASF
Japan 48902 DSJ
As you can see, the keys became column values of the country column and the numbers and the strings became the amount and code columns. I thought I could use something like the following, but it's not working.
df = pd.DataFrame(lst)
You probably need to transform the data into a format that Pandas can read.
Original data
data = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
Transforming the data
new_data = []
for country_data in data:
for country, values in country_data.items():
new_data += [{"Country": country, "Amount": amt, "Code": code} for amt, code in values]
Create the dataframe
df = pd.DataFrame(new_data)
Ouput
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
df = pd.concat([pd.DataFrame(elem) for elem in list])
df = df.apply(lambda x: pd.Series(x.dropna().values)).stack()
df = df.reset_index(level=[0], drop=True).to_frame(name = 'vals')
df = pd.DataFrame(df["vals"].to_list(),index= df.index, columns=['Amount', 'Code']).sort_index()
print(df)
output:
Amount Code
France 12548 ABC
USA 84921 HJK
Japan 38292 ASF
France 45681 DFG
USA 28917 KLESA
Japan 48902 DSJ
France 45684 HJK
Use nested list comprehension for flatten data and pass to DataFrame constructor:
lst = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
L = [(country, *x) for country_data in lst
for country, values in country_data.items()
for x in values]
df = pd.DataFrame(L, columns=['Country','Amount','Code'])
print (df)
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Build a new dictionary that combines the individual dicts into one, before concatenating the dataframes:
new_dict = {}
for ent in lst:
for key, value in ent.items():
new_dict[key] = pd.DataFrame(value, columns = ['Amount', 'Code'])
pd.concat(new_dict, names=['Country']).droplevel(1).reset_index()
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ

return gender by the country from my dataframe

I have a dataframe as follow:
name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1
I want to assign the gender based on code 'US'. If the name is same, then all the gender should be the same as gender assigned to code us. For name that has no duplicate, we return the same row.
The return result should be
name code gender
wang ca 1
jay us 1
lisa us 1
I used
df.gropuby(['name', 'country'])['gender'].transform()
Any suggestions on how to fix this?
# Get country and gender in separate lists for a name
a = df.groupby('name')['country'].apply(list).reset_index(name='country_list')
b = df.groupby('name')['gender'].apply(list).reset_index(name='gender_list')
# Merge
df2 = a.merge(b, on='name', how='left')
# Using apply get final required values
def get_val(x):
cl, gl = x
final = [cl[0], gl[0]]
for c,g in zip(cl,gl):
if c=='us':
final.append(c)
final.append(g)
return final
df2['final_col'] = df2[['country_list', 'gender_list']].apply(get_val, axis=1)
df2['code'] = df2['final_col'].apply(lambda l: l[0])
df2['gender'] = df2['final_col'].apply(lambda l: l[1])
print(df2)
The approach I've used is a merge() followed by a conditional replace (np.where())
It's a bit more sophisticated but will work for conditions not it your sample data.
import io
import numpy as np
df = pd.read_csv(io.StringIO("""name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1"""), sep="\s+")
# use "us" as basis for lookup. left merge on name only
df2 = (df.merge(df.query("country=='us'"),
on=["name"], how="left", suffixes=("", "_new"))
# replace only where it's not "us" and "us" has a different value
.assign(gender=lambda x: np.where((x["country"]!="us")&
(x["gender"]!=x["gender_new"])&
~(x["gender_new"].isna())
# force type casting so it doesn't become float64 because of NaN
, x["gender_new"].fillna(-1).astype("int64"),
x["gender"]))
# remove columns inserted by merge...
.drop(columns=["country_new", "gender_new"])
)
output
name country gender
wang ca 1
jay us 1
jay ca 1
jay ca 1
lisa en 1
lisa us 1

How to clean duplicate entries from the left side of a joint dataframe in pandas?

For visualization and export of a wide, left-joint dataframe in pandas, I would like to remove repeated entries from the left side.
What do I mean by this?
import pandas as pd
cities = pd.DataFrame().append([
{"Name": "Peter", "City": "Boston"},
{"Name": "Paul", "City": "Houston"}
], ignore_index=True)
emails = pd.DataFrame().append( [
{"Name": "Peter", "Email": "peter#company.com"},
{"Name": "Peter", "Email": "peter#university.edu"},
{"Name": "Paul", "Email": "paul#company.com"},
], ignore_index=True)
print(cities.merge(emails))
This prints
Name City Email
0 Peter Boston peter#company.com
1 Peter Boston peter#university.edu
2 Paul Houston paul#company.com
What I would like to print is
Name City Email
0 Peter Boston peter#company.com
1 peter#university.edu
2 Paul Houston paul#company.com
How can I achieve this, ideally during the join so I don't have to keep track of which columns are from the former left and right sides?
Use Series.duplicated per all columns and then set '' in DataFrame.mask:
df = cities.merge(emails)
df1 = df.mask(df.apply(pd.Series.duplicated), '')
print (df1)
Name City Email
0 Peter Boston peter#company.com
1 peter#university.edu
2 Paul Houston paul#company.com
Inspired by #jezreal's answer, which unfortunately determines duplicates for each column individually, I came up with this:
df = cities.merge(emails)
df.loc[df.duplicated(subset=cities.columns), df.columns.isin(cities.columns)] = ""
print(df)
I wasn't able to get df.mask working with this approach, which would be a bit more explicit, but I'm okay with df.loc:
Name City Email
0 Peter Boston peter#company.com
1 peter#university.edu
2 Paul Houston paul#company.com

Python split one column into multiple columns and reattach the split columns into original dataframe

I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None

Joining string of a columns over several index while keeping other colums

Here is an example data set:
>>> df1 = pandas.DataFrame({
"Name": ["Alice", "Marie", "Smith", "Mallory", "Bob", "Doe"],
"City": ["Seattle", None, None, "Portland", None, None],
"Age": [24, None, None, 26, None, None],
"Group": [1, 1, 1, 2, 2, 2]})
>>> df1
Age City Group Name
0 24.0 Seattle 1 Alice
1 NaN None 1 Marie
2 NaN None 1 Smith
3 26.0 Portland 2 Mallory
4 NaN None 2 Bob
5 NaN None 2 Doe
I would like to merge the Name column for all index of the same group while keeping the City and the Age wanting someting like:
>>> df1_summarised
Age City Group Name
0 24.0 Seattle 1 Alice Marie Smith
1 26.0 Portland 2 Mallory Bob Doe
I know those 2 columns (Age, City) will be NaN/None after the first index of a given group from the structure of my starting data.
I have tried the following:
>>> print(df1.groupby('Group')['Name'].apply(' '.join))
Group
1 Alice Marie Smith
2 Mallory Bob Doe
Name: Name, dtype: object
But I would like to keep the Age and City columns...
try this:
In [29]: df1.groupby('Group').ffill().groupby(['Group','Age','City']).Name.apply(' '.join)
Out[29]:
Group Age City
1 24.0 Seattle Alice Marie Smith
2 26.0 Portland Mallory Bob Doe
Name: Name, dtype: object
using dropna and assign with groupby
docs to assign
df1.dropna(subset=['Age', 'City']) \
.assign(Name=df1.groupby('Group').Name.apply(' '.join).values)
timing
per request
update
use groupby and agg
I thought of this and it feels far more satisfying
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join))
to get the exact output
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join)) \
.reset_index().reindex_axis(df1.columns, 1)

Categories