I am trying to merge two dataframes in pandas with large sets of data, however it is causing me some problems. I will try to illustrate with a smaller example.
df1 has a list of equipment and several columns relating to the equipment:
Item ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement Mixer Tim Active New York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong Kong
5 Winch Kojo Inactive Sydney
6 Circular Saw Alex Active Moscow
df2 has a list of instances where equipment has been used. This has some similar columns to df1, however some of the fields are NaN values and also instances of equipment not in df1 have also been recorded:
Item ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement Mixer NaN 29/02/2020 New York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt
The resulting dataframe I was hoping to end up with was this:
Item ID Equipment Owner Status Date Location
1 Jackhammer James Active 08/09/2020 London
1 Jackhammer James Active 08/10/2020 London
2 Cement Mixer Tim Active 29/02/2020 New York
3 Drill Sarah Active 11/02/2020 Paris
3 Drill Sarah Active 30/11/2020 Paris
3 Drill Sarah Active 21/12/2020 Paris
4 Ladder Luke Inactive NaN Hong Kong
5 Winch Kojo Inactive NaN Sydney
6 Circular Saw Alex Active 19/06/2020 Moscow
7 Hammer Ken NaN 21/12/2020 Toronto
8 Sander Ezra NaN 19/06/2020 Frankfurt
Instead, with the following code I'm getting duplicate rows, I think because of the NaN values:
data = pd.merge(df1, df2, how='outer', on=['Item ID'])
Item ID Equipment_x Equipment_y Owner_x Owner_y Status Date Location_x Location_y
1 Jackhammer NaN James James Active 08/09/2020 London London
1 Jackhammer NaN James James Active 08/10/2020 London London
2 Cement Mixer NaN Tim NaN Active 29/02/2020 New York New York
3 Drill NaN Sarah Sarah Active 11/02/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 30/11/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 21/12/2020 Paris NaN
4 Ladder NaN Luke NaN Inactive NaN Hong Kong Hong Kong
5 Winch NaN Kojo NaN Inactive NaN Sydney Sydney
6 Circular Saw NaN Alex NaN Active 19/06/2020 Moscow Moscow
7 NaN Hammer NaN Ken NaN 21/12/2020 NaN Toronto
8 NaN Sander NaN Ezra NaN 19/06/2020 NaN Frankfurt
Ideally I could just drop the _y columns however the data in the bottom rows means I would be losing important information. Instead the only thing I can think of merging the columns and force pandas to compare the values in each column and always favour the non-NaN value. I'm not sure if this is possible or not though?
merging the columns and force pandas to compare the values in each column and always favour the non-NaN value.
Is this what you mean?
In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])
In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])
In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])
In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)
In [49]: data
Out[49]:
Item ID Equipment Status Date Location Owner
0 1 Jackhammer Active 08/09/2020 London James
1 1 Jackhammer Active 08/10/2020 London James
2 2 Cement Mixer Active 29/02/2020 New York Tim
3 3 Drill Active 11/02/2020 Paris Sarah
4 3 Drill Active 30/11/2020 Paris Sarah
5 3 Drill Active 21/12/2020 Paris Sarah
6 4 Ladder Inactive NaN Hong Kong Luke
7 5 Winch Inactive NaN Sydney Kojo
8 6 Circular Saw Active 19/06/2020 Moscow Alex
9 7 Hammer NaN 21/12/2020 Toronto Ken
10 8 Sander NaN 19/06/2020 Frankfurt Ezra
(To my knowledge) you cannot really merge on null column. However you can use fillna to take the value and replace it by something else if it is NaN. Not a very elegant solution, but it seems to solve your example at least.
Also see pandas combine two columns with null values
Generically you can do that as follows:
# merge the two dataframes using a suffix that ideally does
# not appear in your data
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
second_col= f'{col}{suffix_string}'
if second_col in data.columns:
data[col]= data[second_col].where(data[col].isna(), data[col])
columns_to_remove.append(second_col)
if columns_to_remove:
data.drop(columns=columns_to_remove, inplace=True)
data
The result is:
Item_ID Equipment Owner Status Location Date
0 1 Jackhammer James Active London 08/09/2020
1 1 Jackhammer James Active London 08/10/2020
2 2 Cement_Mixer Tim Active New_York 29/02/2020
3 3 Drill Sarah Active Paris 11/02/2020
4 3 Drill Sarah Active Paris 30/11/2020
5 3 Drill Sarah Active Paris 21/12/2020
6 4 Ladder Luke Inactive Hong_Kong NaN
7 5 Winch Kojo Inactive Sydney NaN
8 6 Circular_Saw Alex Active Moscow 19/06/2020
9 7 Hammer Ken NaN Toronto 21/12/2020
10 8 Sander Ezra NaN Frankfurt 19/06/2020
On the following test data:
df1= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement_Mixer Tim Active New_York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong_Kong
5 Winch Kojo Inactive Sydney
6 Circular_Saw Alex Active Moscow"""), sep='\s+')
df2= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement_Mixer NaN 29/02/2020 New_York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular_Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt"""), sep='\s+')
Related
I've been trying to merge two dataframes that look as below, one is multi-indexed while the other is not.
FIRST DATAFRAME: bd_df
outcome opp_name
Sam 3 win Roy Jones
2 win Floyd Mayweather
1 win Bernard Hopkins
James 3 win James Bond
2 win Michael O'Terry
1 win Donald Trump
Jonny 3 win Oscar De la Hoya
2 win Roberto Duran
1 loss Manny Pacquiao
Dyaus 3 win Thierry Henry
2 win David Beckham
1 loss Gabriel Jesus
SECOND DATAFRAME: bt_df
name country colour wins losses
0 Sam England red 10 0
1 Jonny China blue 9 3
2 Dyaus Arsenal white 3 8
3 James USA green 12 6
I'm aiming to merge the two dataframes such that bd_df is joined to bt_df based on the 'name' value where they match. I also have been trying to rename the axis of bd_df with no luck - code is also below.
My code is as below currently, with the output. Appreciate any help!
boxrec_tables = pd.read_csv(Path(boxrec_tables_path),index_col=[0,1]).rename_axis(['name', 'bout number'])
bt_df = pd.DataFrame(boxrec_tables)
bout_data = pd.read_csv(Path(bout_data_path))
bd_df = pd.DataFrame(bout_data)
OUTPUT
outcome opp_name name country colour wins losses
Sam 3 win Roy Jones James USA green 12 6
2 win Floyd Mayweather Dyaus Arsenal white 3 8
1 win Bernard Hopkins Jonny China blue 9 3
James 3 win James Bond James USA green 12 6
2 win Michael O'Terry Dyaus Arsenal white 3 8
1 win Donald Trump Jonny China blue 9 3
Jonny 3 win Oscar De la Hoya James USA green 12 6
2 win Roberto Duran Dyaus Arsenal white 3 8
1 loss Manny Pacquiao Jonny China blue 9 3
Dyaus 3 win Thierry Henry James USA green 12 6
2 win David Beckham Dyaus Arsenal white 3 8
1 loss Gabriel Jesus Jonny China blue 9 3
Following suggestion by #Jezrael:
df = (bd_df.join(bt_df.set_index('opp name', drop=False)).set_index('name',append=True))
country colour wins losses outcome opp name
name
0 Sam England red 10 0 NaN NaN
1 Jonny China blue 9 3 NaN NaN
2 Dyaus Arsenal white 3 8 NaN NaN
3 James USA green 12 6 NaN NaN
Issue currently that the merged dataframe values are showing as NaN, while the bout number values are missing also
I think you need merge by bout number in level of MultiIndex with index in bt_df:
main_df = (bd_df.reset_index()
.merge(bt_df,
left_on='bout number',
right_index=True,
how='left',
suffixes=('_',''))
.set_index(['name_', 'bout number'])
)
print (main_df)
outcome opp_name name country colour wins \
name_ bout number
Sam 3 win Roy Jones James USA green 12
2 win Floyd Mayweather Dyaus Arsenal white 3
1 win Bernard Hopkins Jonny China blue 9
James 3 win James Bond James USA green 12
2 win Michael O'Terry Dyaus Arsenal white 3
1 win Donald Trump Jonny China blue 9
Jonny 3 win Oscar De la Hoya James USA green 12
2 win Roberto Duran Dyaus Arsenal white 3
1 loss Manny Pacquiao Jonny China blue 9
Dyaus 3 win Thierry Henry James USA green 12
2 win David Beckham Dyaus Arsenal white 3
1 loss Gabriel Jesus Jonny China blue 9
losses
name_ bout number
Sam 3 6
2 8
1 3
James 3 6
2 8
1 3
Jonny 3 6
2 8
1 3
Dyaus 3 6
2 8
1 3
Starting with this dataframe of train trip segments:
df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
'Passenger Count': [1, 1, 2, 4, 4]})
Name Trip Id From To Passenger Count
Susie 1 London Paris 1
Susie 1 Paris Berlin 1
Frank 2 Lyon Milan 2
Tony 3 Munich Prague 4
Tony 3 Prague Vienna 4
(Note: a trip is a number of associated segments that forms 1 travel activity, think changing trains.)
I need to expand and remove the passenger counts to achieve a one-row-per-person-segment dataframe.
Each anonymous segment should list the reference passenger. Every traveler needs their own Trip Id.
The result should look like this:
Name Trip Id From To Named Passenger
Susie 1 London Paris NaN
Susie 1 Paris Berlin NaN
Frank 2 Lyon Milan NaN
NaN 4 Lyon Milan Frank
Tony 3 Munich Prague NaN
Tony 3 Prague Vienna NaN
NaN 5 Munich Prague Tony
NaN 5 Prague Vienna Tony
NaN 6 Munich Prague Tony
NaN 6 Prague Vienna Tony
NaN 7 Munich Prague Tony
NaN 7 Prague Vienna Tony
I almost achieved this, but am struggling with getting every person to have their own trip id.
I first managed to expand the passengers like this:
# First, setting the reference name for all records
df['Named Passenger'] = df.apply(lambda r: r['Name'], axis=1)
# Creating an expansion index.
new_index = df.index.repeat(df['Passenger Count'])
# Expanding the df
expanded = df.loc[new_index]
# Removing again the reference name for the original rows
expanded.loc[~new_index.duplicated(), 'Named Passenger'] = np.nan
# And removing the Name on duplicated rows (>1 personal info columns in reality)
expanded.loc[new_index.duplicated(), 'Name'] = np.nan
expanded = expanded.reset_index(drop=True)
expanded.drop(columns=['Passenger Count'], inplace=True)
expanded now looks like this:
Name Trip Id From To Named Passenger
0 Susie 1 London Paris NaN
1 Susie 1 Paris Berlin NaN
2 Frank 2 Lyon Milan NaN
3 NaN 2 Lyon Milan Frank
4 Tony 3 Munich Prague NaN
5 NaN 3 Munich Prague Tony
6 NaN 3 Munich Prague Tony
7 NaN 3 Munich Prague Tony
8 Tony 3 Prague Vienna NaN
9 NaN 3 Prague Vienna Tony
10 NaN 3 Prague Vienna Tony
11 NaN 3 Prague Vienna Tony
..but I have no idea how to correctly update the Trip Id now? (It doesn't matter what it is, as long as it's unique per passenger.)
You could combine range and explode. Does this work for you?
import pandas as pd
import numpy as np
df=pd.DataFrame({ 'Name': ['Susie', 'Susie', 'Frank', 'Tony', 'Tony'],
'Trip Id': [1, 1, 2, 3, 3], 'From': ['London', 'Paris', 'Lyon', 'Munich', 'Prague'],
'To': ['Paris', 'Berlin', 'Milan', 'Prague', 'Vienna'],
'Passenger Count': [1, 1, 2, 4, 4]})
# Expand rows
df['Named Passenger'] = df['Name'].copy()
df['Passenger Id'] = df['Passenger Count'].map(lambda x: list(range(1, x+1)))
df = df.explode('Passenger Id')
df['Trip Id Unique'] = df['Trip Id'].astype(str) + "_" + df['Passenger Id'].astype(str)
# Remove names
df['Name'] = np.where(~df.index.duplicated(keep='first'), df['Name'], np.nan)
df['Named Passenger'] = np.where(df['Name'].isnull(), df['Named Passenger'], np.nan)
# Encode Trip ID with unique numerical ID
df['Trip Id Unique'] = pd.Categorical(df['Trip Id Unique'])
df['Trip Id Unique'] = df['Trip Id Unique'].cat.codes +1
# Print
df[['Name', 'Trip Id Unique', 'From', 'To', 'Named Passenger']].sort_values(by=['Trip Id Unique', 'From'])
# Name Trip Id Unique From To Named Passenger
# 0 Susie 1 London Paris NaN
# 1 Susie 1 Paris Berlin NaN
# 2 Frank 2 Lyon Milan NaN
# 2 NaN 3 Lyon Milan Frank
# 3 Tony 4 Munich Prague NaN
# 4 Tony 4 Prague Vienna NaN
# 3 NaN 5 Munich Prague Tony
# 4 NaN 5 Prague Vienna Tony
# 3 NaN 6 Munich Prague Tony
# 4 NaN 6 Prague Vienna Tony
# 3 NaN 7 Munich Prague Tony
# 4 NaN 7 Prague Vienna Tony
I have df below as:
id | name | status | country | ref_id
3 Bob False Germany NaN
5 422 True USA 3
7 Nick False India NaN
6 Chris True Australia 7
8 324 True Africa 28
28 Tim False Canada 53
I want to add a new column for each row, where if the status for that row is True, if the ref_id for that row exists in the id column in another row and that rows status is False, give me the value of the name in that column.
So expected output below would be:
id | name | status | country | ref_id | new
3 Bob False Germany NaN NaN
5 422 True USA 3 Bob
7 Nick False India NaN NaN
6 Chris True Australia 7 Nick
8 324 True Africa 28 Tim
28 Tim False Canada 53 NaN
I have code below that I am using for other purposes that just filters for rows that have a status of True, and and id_reference value that exists in the id column like below:
(df.loc[df["status"]&df["id_reference"].astype(float).isin(df.loc[~df["status"], "id"])])
But I am trying to also calculate a new column as mentioned prior above with the value of the name if it has one in that column
Thanks!
Let us try
df['new']=df.loc[df.status,'ref_id'].map(df.set_index('id')['name'])
df
id name status country ref_id new
0 3 Bob False Germany NaN NaN
1 5 422 True USA 3.0 Bob
2 7 Nick False India NaN NaN
3 6 Chris True Australia 7.0 Nick
4 8 324 True Africa 28.0 Tim
5 28 Tim False Canada 53.0 NaN
This essentially a merge:
merged = (df.loc[df['status'],['ref_id']]
.merge(df.loc[~df['status'],['id','name']], left_on='ref_id', right_on='id')
)
df['ref_id'] = (df['id'].map(merged.set_index('id')['name'])
.where(df['status'])
)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have for example 2 data frames with user and their rating for each place such as:
Dataframe 1:
Name Golden Gate
Adam 1
Susan 4
Mike 5
John 4
Dataframe 2:
Name Botanical Garden
Jenny 1
Susan 4
Leslie 5
John 3
I want to combine them into a single data frame with the result:
Combined Dataframe:
Name Golden Gate Botanical Garden
Adam 1 NA
Susan 4 4
Mike 5 NA
John 4 3
Jenny NA 1
Leslie NA 5
How to do that?
Thank you.
You need to perform an outer join or a concatenation along an axis:
final_df = df1.merge(df2,how='outer',on='Name')
Output:
Name Golden Gate Botanical Garden
0 Adam 1.0 NaN
1 Susan 4.0 4.0
2 Mike 5.0 NaN
3 John 4.0 3.0
4 Jenny NaN 1.0
5 Leslie NaN 5.0
I found that pandas merge with how='outer' solves the problem. The link provided by #Celius Stingher is useful
I'm trying to perform a groupby on a table where given this groupby index, all values are either correct or Nan. EG:
id country name
0 1 France None
1 1 France Pierre
2 2 None Marge
3 1 None Pierre
4 3 USA Jim
5 3 None Jim
6 2 UK None
7 4 Spain Alvaro
8 2 None Marge
9 3 None Jim
10 4 Spain None
11 3 None Jim
I just want to get the values for each of the 4 people, which should never clash, eg:
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
I've tried:
groupby().first()
groupby.nth(0,dropna='any'/'all')
and even
groupby().apply(lambda x: x.loc[x.first_valid_index()])
All to no avail. What am I missing?
EDIT: to help you making the example dataframe for testing:
df = pd.DataFrame({'id':[1,1,2,1,3,3,2,4,2,3,4,3],'country':['France','France',None,None,'USA',None,'UK','Spain',None,None,'Spain',None],'name':[None,'Pierre','Marge','Pierre','Jim','Jim',None,'Alvaro','Marge','Jim',None,'Jim']})
Pandas groupby.first returns first not-null value but does not support None, try
df.fillna(np.nan).groupby('id').first()
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
Possible specifying to dropna when values are None
df.groupby('id').first(dropna=True)
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro