Combining Multiple Dataframes with Unique Name [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have for example 2 data frames with user and their rating for each place such as:
Dataframe 1:
Name Golden Gate
Adam 1
Susan 4
Mike 5
John 4
Dataframe 2:
Name Botanical Garden
Jenny 1
Susan 4
Leslie 5
John 3
I want to combine them into a single data frame with the result:
Combined Dataframe:
Name Golden Gate Botanical Garden
Adam 1 NA
Susan 4 4
Mike 5 NA
John 4 3
Jenny NA 1
Leslie NA 5
How to do that?
Thank you.

You need to perform an outer join or a concatenation along an axis:
final_df = df1.merge(df2,how='outer',on='Name')
Output:
Name Golden Gate Botanical Garden
0 Adam 1.0 NaN
1 Susan 4.0 4.0
2 Mike 5.0 NaN
3 John 4.0 3.0
4 Jenny NaN 1.0
5 Leslie NaN 5.0

I found that pandas merge with how='outer' solves the problem. The link provided by #Celius Stingher is useful

Related

Pandas: compare how to compare two columns in different sheets and return matched value

I have two dataframes with multiple columns.
I would like to compare df1['id'] and df2['id'] and return a new df with another column that have the match value.
example:
df1
**id** **Name**
1 1 Paul
2 2 Jean
3 3 Alicia
4 4 Jennifer
df2
**id** **Name**
1 1 Paul
2 6 Jean
3 3 Alicia
4 7 Jennifer
output
**id** **Name** *correct_id*
1 1 Paul 1
2 2 Jean N/A
3 3 Alicia 3
4 4 Jennifer N/A
Note- the length of the two columns I want to match is not the same.
Try:
df1["correct_id"] = (df1["id"].isin(df2["id"]) * df1["id"]).replace(0, "N/A")
print(df1)
Prints:
id Name correct_id
0 1 Paul 1
1 2 Jean N/A
2 3 Alicia 3
3 4 Jennifer N/A

How do you merge dataframes in pandas with different shapes?

I am trying to merge two dataframes in pandas with large sets of data, however it is causing me some problems. I will try to illustrate with a smaller example.
df1 has a list of equipment and several columns relating to the equipment:
Item ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement Mixer Tim Active New York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong Kong
5 Winch Kojo Inactive Sydney
6 Circular Saw Alex Active Moscow
df2 has a list of instances where equipment has been used. This has some similar columns to df1, however some of the fields are NaN values and also instances of equipment not in df1 have also been recorded:
Item ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement Mixer NaN 29/02/2020 New York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt
The resulting dataframe I was hoping to end up with was this:
Item ID Equipment Owner Status Date Location
1 Jackhammer James Active 08/09/2020 London
1 Jackhammer James Active 08/10/2020 London
2 Cement Mixer Tim Active 29/02/2020 New York
3 Drill Sarah Active 11/02/2020 Paris
3 Drill Sarah Active 30/11/2020 Paris
3 Drill Sarah Active 21/12/2020 Paris
4 Ladder Luke Inactive NaN Hong Kong
5 Winch Kojo Inactive NaN Sydney
6 Circular Saw Alex Active 19/06/2020 Moscow
7 Hammer Ken NaN 21/12/2020 Toronto
8 Sander Ezra NaN 19/06/2020 Frankfurt
Instead, with the following code I'm getting duplicate rows, I think because of the NaN values:
data = pd.merge(df1, df2, how='outer', on=['Item ID'])
Item ID Equipment_x Equipment_y Owner_x Owner_y Status Date Location_x Location_y
1 Jackhammer NaN James James Active 08/09/2020 London London
1 Jackhammer NaN James James Active 08/10/2020 London London
2 Cement Mixer NaN Tim NaN Active 29/02/2020 New York New York
3 Drill NaN Sarah Sarah Active 11/02/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 30/11/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 21/12/2020 Paris NaN
4 Ladder NaN Luke NaN Inactive NaN Hong Kong Hong Kong
5 Winch NaN Kojo NaN Inactive NaN Sydney Sydney
6 Circular Saw NaN Alex NaN Active 19/06/2020 Moscow Moscow
7 NaN Hammer NaN Ken NaN 21/12/2020 NaN Toronto
8 NaN Sander NaN Ezra NaN 19/06/2020 NaN Frankfurt
Ideally I could just drop the _y columns however the data in the bottom rows means I would be losing important information. Instead the only thing I can think of merging the columns and force pandas to compare the values in each column and always favour the non-NaN value. I'm not sure if this is possible or not though?
merging the columns and force pandas to compare the values in each column and always favour the non-NaN value.
Is this what you mean?
In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])
In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])
In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])
In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)
In [49]: data
Out[49]:
Item ID Equipment Status Date Location Owner
0 1 Jackhammer Active 08/09/2020 London James
1 1 Jackhammer Active 08/10/2020 London James
2 2 Cement Mixer Active 29/02/2020 New York Tim
3 3 Drill Active 11/02/2020 Paris Sarah
4 3 Drill Active 30/11/2020 Paris Sarah
5 3 Drill Active 21/12/2020 Paris Sarah
6 4 Ladder Inactive NaN Hong Kong Luke
7 5 Winch Inactive NaN Sydney Kojo
8 6 Circular Saw Active 19/06/2020 Moscow Alex
9 7 Hammer NaN 21/12/2020 Toronto Ken
10 8 Sander NaN 19/06/2020 Frankfurt Ezra
(To my knowledge) you cannot really merge on null column. However you can use fillna to take the value and replace it by something else if it is NaN. Not a very elegant solution, but it seems to solve your example at least.
Also see pandas combine two columns with null values
Generically you can do that as follows:
# merge the two dataframes using a suffix that ideally does
# not appear in your data
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
second_col= f'{col}{suffix_string}'
if second_col in data.columns:
data[col]= data[second_col].where(data[col].isna(), data[col])
columns_to_remove.append(second_col)
if columns_to_remove:
data.drop(columns=columns_to_remove, inplace=True)
data
The result is:
Item_ID Equipment Owner Status Location Date
0 1 Jackhammer James Active London 08/09/2020
1 1 Jackhammer James Active London 08/10/2020
2 2 Cement_Mixer Tim Active New_York 29/02/2020
3 3 Drill Sarah Active Paris 11/02/2020
4 3 Drill Sarah Active Paris 30/11/2020
5 3 Drill Sarah Active Paris 21/12/2020
6 4 Ladder Luke Inactive Hong_Kong NaN
7 5 Winch Kojo Inactive Sydney NaN
8 6 Circular_Saw Alex Active Moscow 19/06/2020
9 7 Hammer Ken NaN Toronto 21/12/2020
10 8 Sander Ezra NaN Frankfurt 19/06/2020
On the following test data:
df1= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement_Mixer Tim Active New_York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong_Kong
5 Winch Kojo Inactive Sydney
6 Circular_Saw Alex Active Moscow"""), sep='\s+')
df2= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement_Mixer NaN 29/02/2020 New_York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular_Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt"""), sep='\s+')

Pandas Dataframe str.split error wrong number of items passed [duplicate]

This question already has answers here:
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 2 years ago.
Having trouble with a particular str.split error
My dataframe contains a number followed by text:
(Names are made up
print(df)
Date Entry
20/2/2019 6 John Smith
20/2/2019 8 Matt Princess
21/2/2019 4 Nick Dromos
21/2/2019 4 Adam Force
21/2/2019 5 Gary
21/2/2019 4 El Chaparro
21/2/2019 7 Mike O Malley
21/2/2019 8 Jason
22/2/2019 7 Mitchell
I am simply trying to split the Entry column into two following the number.
Code i have tried:
df['number','name'] = df['Entry'].str.split('([0-9])',n=1,expand=True)
ValueError: Wrong number of items passed 3, placement implies 1
And then i tried on the space alone:
df['number','name'] = df['Entry'].str.split(" ",n=1,expand=True)
ValueError: Wrong number of items passed 2, placement implies 1
Ideally the df looks like:
print(df)
Date number name
20/2/2019 6 John Smith
20/2/2019 8 Matt Princess
21/2/2019 4 Nick Dromos
21/2/2019 4 Adam Force
21/2/2019 5 Gary
21/2/2019 4 El Chaparro
21/2/2019 7 Mike O Malley
21/2/2019 8 Jason
22/2/2019 7 Mitchell
I feel like it may be something small but i cant seem to get it working. Any help would be great! Thanks very much
Add double [] and if want remove column from original also add DataFrame.pop, last remove first empty column by drop, [0-9]+ is change for get digits with length more like 1 like 10, 567...:
df[['number','name']] = df.pop('Entry').str.split('([0-9]+)',n=1,expand=True).drop(0, axis=1)
print (df)
Date number name
0 20/2/2019 6 John Smith
1 20/2/2019 8 Matt Princess
2 21/2/2019 4 Nick Dromos
3 21/2/2019 4 Adam Force
4 21/2/2019 5 Gary
5 21/2/2019 4 El Chaparro
6 21/2/2019 7 Mike O Malley
7 21/2/2019 8 Jason
8 22/2/2019 7 Mitchell
Solution with Series.str.extract:
df[['number','name']] = df.pop('Entry').str.extract('([0-9]+)(.*)')
#alternative
#df[['number','name']] = df.pop('Entry').str.extract('(\d+)(.*)')
print (df)
Date number name
0 20/2/2019 6 John Smith
1 20/2/2019 8 Matt Princess
2 21/2/2019 4 Nick Dromos
3 21/2/2019 4 Adam Force
4 21/2/2019 5 Gary
5 21/2/2019 4 El Chaparro
6 21/2/2019 7 Mike O Malley
7 21/2/2019 8 Jason
8 22/2/2019 7 Mitchell
pop function is for avoid remove column after select, so this code working same:
df[['number','name']] = df.pop('Entry').str.extract('(\d+)(.*)')
vs
df[['number','name']] = df['Entry'].str.extract('(\d+)(.*)')
df = df.drop('Entry', axis=1)

Python - How to fill string value with the modal value for the group

I have a dataset like the below. I want to be able to be able to populate the missing text with what is normal for the group. I have tried using ffil but this doesn't help the ones that are blank at the start, and bfil similarly for the end. How can I do this?
Group Name
1 Annie
2 NaN
3 NaN
4 David
1 NaN
2 Bertha
3 Chris
4 NaN
Desired Output:
Group Name
1 Annie
2 Bertha
3 Chris
4 David
1 Annie
2 Bertha
3 Chris
4 David
Using collections.Counter to create a modal mapping by group:
from collections import Counter
s = df.dropna(subset=['Name'])\
.groupby('Group')['Name']\
.apply(lambda x: Counter(x).most_common()[0][0])
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David
You can use value_counts and head:
s = df.groupby('Group')['Name'].apply(lambda x: x.value_counts().head(1)).reset_index(-1)['level_1']
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Output:
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David

Pandas Fillna with the MAX value_counts of each group

There two columns in the DataFrame named "country" & "taster_name". Column "taster_name" got some missing values in it. I want to fillna the missing values with the MAX VALUE_COUNTS of the taster_name of each country(depending on which country the missing value belongs to). I don't know how I can make it.
From the code below, we can check the MAX VALUE_COUNTS of the taster_name of each country.
wine[['country','taster_name']].groupby('country').taster_name.value_counts()
try this,
df.groupby('country')['teaser_name'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))
As you didn't provide sample data. I created by myself.
Sample Input:
country teaser_name
0 A abraham
1 B silva
2 A abraham
3 A NaN
4 B NaN
5 C john
6 C NaN
7 C john
8 C jacob
9 A NaN
10 B silva
11 A william
Output:
country teaser_name
0 A abraham
1 B silva
2 A abraham
3 A abraham
4 B silva
5 C john
6 C john
7 C john
8 C jacob
9 A abraham
10 B silva
11 A william
Explanation:
Try to groupby country and fill NaN values with value_counts. In value_counts by default it's in descending order. So, you can take first element and fill with NaN.

Categories