Pandas: Merge many-to-one - python

Let's say I have 2 data frames:
df1:
Name Age
Pete 19
John 30
Max 24
df2:
Name Subject Grade
Pete Math 90
Pete History 100
John English 90
Max History 90
Max Math 80
I want to merge them df2 to df1, many to one, to end up with something like this:
Name Age Subject Grade
Pete 19 Math 90
Pete 19 History 100
John 30 English 90
Max 24 History 90
Max 24 Math 80
I don't want to group them by Subject and Grade, I need to duplicate them so it would keep everything.

Simply you could use pd.merge as follows:
import pandas as pd
if __name__ == '__main__':
df1 = pd.DataFrame({"Name": ["Pete", "John", "Max"],
"Age": [19, 30, 24]})
df2 = pd.DataFrame({"Name": ["Pete", "Pete", "John", "Max", "Max"],
"Subject": ["Math", "History", "English", "History", "Math"],
"Grade": [90, 100, 90, 90, 80]})
df3 = pd.merge(df1, df2, how="right", on="Name")
print(df1)
print(df2)
print(df3)
Result:
Name Age
0 Pete 19
1 John 30
2 Max 24
Name Subject Grade
0 Pete Math 90
1 Pete History 100
2 John English 90
3 Max History 90
4 Max Math 80
Name Age Subject Grade
0 Pete 19 Math 90
1 Pete 19 History 100
2 John 30 English 90
3 Max 24 History 90
4 Max 24 Math 80

Related

How to replace a row value in a pandas dataframe after a desired number is achieved?

Here's a simple piece of code, something similar to what I am doing. I'm trying to replace the value after 1 with a -1. But in my case, how would I do it if I don't know where the 1's are in a dataframe of over 1000's of rows?
import pandas as pd
df = pd.DataFrame({'Name':['Craig', 'Davis', 'Anthony', 'Tony'], 'Age':[22, 27, 24, 33], 'Employed':[0, 1, 0, 0]})
df
I have this...
Name
Age
Employed
Craig
22
0
Davis
27
1
Anthony
24
0
Tony
33
0
I want something similar to this but iterable through 1000's of rows
Name
Age
Employed
Craig
22
0
Davis
27
1
Anthony
24
-1
Tony
33
0
Use shift to get the next row after a 1:
df = df.loc[df['Employed'].shift() == 1, 'Employed'] = -1
print(df)
# Output
Name Age Employed
0 Craig 22 0
1 Davis 27 1
2 Anthony 24 -1
3 Tony 33 0

Pandas dataframe sorting string values and by descending aggregated values

I'm working on transforming a dataframe to show the top 3 earners.
The dataframe looks like this
data = {'Name': ['Allistair', 'Bob', 'Carrie', 'Diane', 'Allistair', 'Bob', 'Carrie','Evelyn'], 'Sale': [20, 21, 19, 18, 5, 300, 35, 22]}
df = pd.DataFrame(data)
print(df)
Name Sale
0 Allistair 20
1 Bob 21
2 Carrie 19
3 Diane 18
4 Allistair 5
5 Bob 300
6 Carrie 35
7 Evelyn 22
In my actual dataset, I have several more columns and rows, and I want to print out and get to
something like
Name Sale
0 Bob 321
1 Carrie 35
2 Allistair 25
Every iteration that I've searched through doesn't quite get there because I get
'Name' is both an index level and a column label, which is ambiguous.
Use groupby:
>>> df.groupby('Name').sum().sort_values('Sale', ascending=False)
Sale
Name
Bob 321
Carrie 54
Allistair 25
Evelyn 22
Diane 18
Thanks to #Andrej Kasely above,
df.groupby("Name")["Sale"].sum().nlargest(3)

Add a column from an existing dataframe into another between every other column

I'll try my best to explain this as I had trouble phrasing the title. I have two dataframes. What I would like to do is add a column from df1 into df2 between every other column.
For example, df1 looks like this :
Age City
0 34 Sydney
1 30 Toronto
2 31 Mumbai
3 32 Richmond
And after adding in df2 it looks like this:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
In terms of code, I wasn't quite sure where to even start.
'''Concatenating the dataframes'''
for i in range len(df2):
pos = i+1
df3 = df2.insert
#df2 = pd.concat([df1, df2], axis=1).sort_index(axis=1)
#df2.columns = np.arange(len(df2.columns))
#print (df2)
I was originally going to run it through a loop, but I wasn't quite sure how to do it. Any help would be appreciated!
You can use itertools.zip_longest. For example:
from itertools import zip_longest
new_columns = [
v
for v in (c for a in zip_longest(df2.columns, df1.columns) for c in a)
if not v is None
]
df_out = pd.concat([df1, df2], axis=1)[new_columns]
print(df_out)
Prints:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States

Overlay of two dataframe using python

Problem:
I have two dataframes viz 'infm' and 'ufl'. In 'ufl', Age and Salary column for Ben and Creg is updated. I want to update the corresponding row in 'infm' too .
Approach Taken:
I am iterating through each row of 'infm' and taking 'Name' column to match both dataframe. If corresponding names are matched, then updating the Age column of 'infm' with value in 'ufl'
Input:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 25 30000 y
Creg 23 22000 x
Dawood 25 30000 w
Update on two rows of Input:
NAME AGE SAlARY COUNTRY
Ben 36 90000 y
Creg 34 92000 x
Expected Output:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 36 90000 y
Creg 34 92000 x
Dawood 25 30000 w
Actual output:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 25 30000 y
Creg 23 22000 x
Dawood 25 30000 w
Code used:
import pandas as pd
infm=pd.read_excel('D:/data/test.xls')
ufl=pd.read_excel('D:/data/test1.xls')
for row in infm.iterrows():
a=row[1]['Name']
b=ufl['Name'].unique().tolist()
for i in b:
if i==a:
row[1]['Age']=(ufl['Age'][ufl['Name']==a]).tolist()[0]

How to split single column of pandas dataframe into multiple columns with group?

I am new to python pandas. I have one dataframe like below:
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
print df
Name age
0 football 25
1 ramesh 22
2 suresh 21
3 pankaj 32
4 cricket 37
5 rakesh 26
6 mohit 24
7 mahesh 30
"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:
Expected Output:
sports_name sport_person_name age
football ramesh 25
suresh 22
pankaj 32
cricket rakesh 26
mohit 24
mahesh 30
If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?
Edit : If don't want to hardcode the sports names
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['', '22','21','32','','26','24','30']})
df = df.replace('', np.nan, regex=True)
nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.
You can use:
#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
Similar solution with DataFrame.insert - then reorder is not necessary:
#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
If want only one value of sport add limit=1 to ffill and replace NaNs to empty string:
sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 suresh 21
2 pankaj 32
3 cricket rakesh 26
4 mohit 24
5 mahesh 30
The output you want is a dictionary and not a dataframe.
The dictionary will look:
{'Sport' : {'Player' : age,'Player2' : age}}
If you really want a dataframe:
If the name always comes before the players:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket'
,'rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
sports=['football', 'cricket']
wanted_dict={}
current_sport=''
for val in df['sport_person_name']:
if val in sports:
current_sport=val
else:
wanted_dict[val]=current_sport
#Now you got - {name:sport_name,...}
df['sports_name']=999
for val in df['sport_person_name']
df['sports_name']=np.where((val not in sports)&
(df['sport_person_name']==val),
wanted_dict[val],'sport)
df = df[df['sports_name']!='sport']
What it should look like:
sports_name sport_person_name age
football ramesh 25
football suresh 22
football pankaj 32
cricket rakesh 26
cricket mohit 24
cricket mahesh 30

Categories