serial
name
match
1
John
5,6,8
2
Steve
1,7
3
Kevin
4
4
Kevin
3
5
John
1,6,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial
name
match
updated_match
1
John
5,6,8
5,8
2
Steve
1,7
3
Kevin
4
4
4
Kevin
3
3
5
John
1,6,8
1,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
1,5
Convert values of serial column to strings,then mapping aggregate sets to Series wth same size like original, split column match and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have for example 2 data frames with user and their rating for each place such as:
Dataframe 1:
Name Golden Gate
Adam 1
Susan 4
Mike 5
John 4
Dataframe 2:
Name Botanical Garden
Jenny 1
Susan 4
Leslie 5
John 3
I want to combine them into a single data frame with the result:
Combined Dataframe:
Name Golden Gate Botanical Garden
Adam 1 NA
Susan 4 4
Mike 5 NA
John 4 3
Jenny NA 1
Leslie NA 5
How to do that?
Thank you.
You need to perform an outer join or a concatenation along an axis:
final_df = df1.merge(df2,how='outer',on='Name')
Output:
Name Golden Gate Botanical Garden
0 Adam 1.0 NaN
1 Susan 4.0 4.0
2 Mike 5.0 NaN
3 John 4.0 3.0
4 Jenny NaN 1.0
5 Leslie NaN 5.0
I found that pandas merge with how='outer' solves the problem. The link provided by #Celius Stingher is useful
I have a dataframe with a string column and I would like to drop all rows after the last occurrence of a name.
first_name
Andy
Josh
Mark
Tim
Alex
Andy
Josh
Mark
Tim
Alex
Andy
Josh
Mark
What I would like is to drop rows after Alex occurs for the last time, so drop the rows with Andy, Josh and Mark.
I figured I drop before the first occurrence with: df=df[(df.first_name== 'Alex').idxmax():], but don't know how to drop last rows.
Thanks!
argmax
df.iloc[:len(df) - (df.first_name.to_numpy() == 'Alex')[::-1].argmax()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
last_valid_index
df.loc[:df.where(df == 'Alex').last_valid_index()]
Option 3
df.loc[:df.first_name.eq('Alex')[::-1].idxmax()]
Option 4
df.iloc[:np.flatnonzero(df.first_name.eq('Alex')).max() + 1]
Option 5
This is silly!
df[np.logical_or.accumulate(df.first_name.eq('Alex')[::-1])[::-1]]
mask and bfill
df[df['first_name'].mask(df['first_name'] != 'Alex').bfill().notna()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
cumsum and idxmax
df.loc[:(df['first_name'] == 'Alex').cumsum().idxmax()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
cumsum and max
u = (df['first_name'] == 'Alex').shift().cumsum()
df[u < u.max()]
first_name
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Hi I'm trying to find the unique Player which show up in every Team.
df =
Team Player Number
A Joe 8
A Mike 10
A Steve 11
B Henry 9
B Steve 19
B Joe 4
C Mike 18
C Joe 6
C Steve 18
C Dan 1
C Henry 3
and the result should be:
result =
Team Player Number
A Joe 8
A Steve 11
B Joe 4
B Steve 19
C Joe 6
C Steve 18
since Joe and Steve are the only Player in each Team
You can use a GroupBy.transform to get a count of unique teams that each player is a member of, and compare this to the overall count of unique teams. This will give you a Boolean array, which you can use to filter your DataFrame:
df = df[df.groupby('Player')['Team'].transform('nunique') == df['Team'].nunique()]
The resulting output:
Team Player Number
0 A Joe 8
2 A Steve 11
4 B Steve 19
5 B Joe 4
7 C Joe 6
8 C Steve 18