I have two dataframes. I want to change some of the values.
I know how to change it on a one by one basis, using isin and where statement but I don't know how to change a large list of changes.
df1
Name Type
David Staff
Jones Pilot
Jack Pilot
Susan Steward
John Staff
Leroy Staff
Steve Staff
df2
Name Type
David Captain
Leroy Pilot
Steve Pilot
How do I change the "type" column on df1 by using df2?
df_desired
Name Type
David Captain
Jones Pilot
Jack Pilot
Susan Steward
John Staff
Leroy Pilot
Steve Pilot
You can try map Type column of df2 to df1 then update
df1['Type'].update(df1['Name'].map(df2.set_index('Name')['Type']))
print(df1)
Name Type
0 David Captain
1 Jones Pilot
2 Jack Pilot
3 Susan Steward
4 John Staff
5 Leroy Pilot
6 Steve Pilot
Related
I feel like this should be really simple but I have been stuck on this all day.
I have a dataframe that looks like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005 Yes
Robert 2005 Yes
I am grouping by Name and Date. In each group, if "Yes" appears in Graduated, then clear the Really? column. And if "Yes" doesn't appear in the group, then leave as is. The output should look like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005
Robert 2005 Yes
I keep trying different variations of mask = df.groupby(['Name','Date'])['Graduated'].isin('Yes') before doing df.loc[mask, "Really?"] = Nonebut receive AttributeError (I assume my syntax is incorrect).
Edited expected output.
Try this:
s = df['Graduated'].eq('Yes').groupby([df['Name'],df['Date']]).transform('any')
df['Really?'] = df['Really?'].mask(s,'')
I have the following dataframe:
Year Name Town Vehicle
2000 John NYC Truck
2000 John NYC Car
2010 Jim London Bike
2010 Jim London Car
I would like to condense this dataframe to one row per Year/ Name /Town so that my end result is:
Year Name Town Vehicle Vehicle2
2000 John NYC Truck Car
2010 Jim London Bike Car
Im guessing it is some sort of df.grouby statement but im not sure how to create the new column. Any help would be much appreciated!
Use GroupBy.cumcount for counter with reshape by Series.unstack:
g = df.groupby(['Year', 'Name','Town']).cumcount()
df1 = (df.set_index(['Year', 'Name','Town', g])['Vehicle']
.unstack()
.add_prefix('Vehicle')
.reset_index())
print (df1)
Year Name Town Vehicle0 Vehicle1
0 2000 John NYC Truck Car
1 2010 Jim London Bike Car
I want to de-dupe rows in pandas based off of multiple criteria.
I have 3 columns: name, id and nick_name.
First rule is look for duplicate id's. When id's match, only keep rows where name and nick_name are different as long as I am keeping at least one row.
In other words, if name and nick_name don't match, keep that row. If name and nick_name match, then get rid of that row, as long as it isn't the only row that would be left for that id.
Example data:
data = {"name": ["Sam", "Sam", "Joseph", "Joseph", "Joseph", "Philip", "Philip", "James"],
"id": [1,1,2,2,2,3,3,4],
"nick_name": ["Sammie", "Sam", "Joseph", "Joe", "Joey", "Philip", "Philip", "James"]}
df = pd.DataFrame(data)
df
Produces:
name id nick_name
0 Sam 1 Sammie
1 Sam 1 Sam
2 Joseph 2 Joseph
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
6 Philip 3 Philip
7 James 4 James
Based on my rules above, I want a resulting dataframe to produce the following:
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
7 James 4 James
We can split this into 3 boolean condtions to filter your initial dataframe by.
#where name and nick_name match, keep the first value.
con1 = df.duplicated(subset=['name','nick_name'],keep='first')
# where ids are duplicated and name is not equal to nick_name
con2 = df.duplicated(subset=['id'],keep=False) & df['name'].ne(df['nick_name'])
# where no duplicate exists.
con3 = df.groupby('id')['id'].transform('size').eq(1)
print(df.loc[con1 | con2 | con3])
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
6 Philip 3 Philip
7 James 4 James
If I have a table as follows in Pandas Dataframe:
Date Name
15/12/01 John Doe
15/12/01 John Doe
15/12/01 John Doe
15/12/02 Mary Jean
15/12/02 Mary Jean
15/12/02 Mary Jean
I would like to delete all instances of John Doe/Mary Jean (or whatever name may be there) with the same date and only keep the latest one. After the operation it would look like this:
Date Name
15/12/01 John Doe
15/12/02 Mary Jean
Where the third instance of both John Doe and Mary Jean have been kept and the rest have been deleted. How could I do this in an efficient and fast way in Pandas?
Thanks!
From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA