Pandas Dataframe str.split error wrong number of items passed [duplicate] - python

This question already has answers here:
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 2 years ago.
Having trouble with a particular str.split error
My dataframe contains a number followed by text:
(Names are made up
print(df)
Date Entry
20/2/2019 6 John Smith
20/2/2019 8 Matt Princess
21/2/2019 4 Nick Dromos
21/2/2019 4 Adam Force
21/2/2019 5 Gary
21/2/2019 4 El Chaparro
21/2/2019 7 Mike O Malley
21/2/2019 8 Jason
22/2/2019 7 Mitchell
I am simply trying to split the Entry column into two following the number.
Code i have tried:
df['number','name'] = df['Entry'].str.split('([0-9])',n=1,expand=True)
ValueError: Wrong number of items passed 3, placement implies 1
And then i tried on the space alone:
df['number','name'] = df['Entry'].str.split(" ",n=1,expand=True)
ValueError: Wrong number of items passed 2, placement implies 1
Ideally the df looks like:
print(df)
Date number name
20/2/2019 6 John Smith
20/2/2019 8 Matt Princess
21/2/2019 4 Nick Dromos
21/2/2019 4 Adam Force
21/2/2019 5 Gary
21/2/2019 4 El Chaparro
21/2/2019 7 Mike O Malley
21/2/2019 8 Jason
22/2/2019 7 Mitchell
I feel like it may be something small but i cant seem to get it working. Any help would be great! Thanks very much

Add double [] and if want remove column from original also add DataFrame.pop, last remove first empty column by drop, [0-9]+ is change for get digits with length more like 1 like 10, 567...:
df[['number','name']] = df.pop('Entry').str.split('([0-9]+)',n=1,expand=True).drop(0, axis=1)
print (df)
Date number name
0 20/2/2019 6 John Smith
1 20/2/2019 8 Matt Princess
2 21/2/2019 4 Nick Dromos
3 21/2/2019 4 Adam Force
4 21/2/2019 5 Gary
5 21/2/2019 4 El Chaparro
6 21/2/2019 7 Mike O Malley
7 21/2/2019 8 Jason
8 22/2/2019 7 Mitchell
Solution with Series.str.extract:
df[['number','name']] = df.pop('Entry').str.extract('([0-9]+)(.*)')
#alternative
#df[['number','name']] = df.pop('Entry').str.extract('(\d+)(.*)')
print (df)
Date number name
0 20/2/2019 6 John Smith
1 20/2/2019 8 Matt Princess
2 21/2/2019 4 Nick Dromos
3 21/2/2019 4 Adam Force
4 21/2/2019 5 Gary
5 21/2/2019 4 El Chaparro
6 21/2/2019 7 Mike O Malley
7 21/2/2019 8 Jason
8 22/2/2019 7 Mitchell
pop function is for avoid remove column after select, so this code working same:
df[['number','name']] = df.pop('Entry').str.extract('(\d+)(.*)')
vs
df[['number','name']] = df['Entry'].str.extract('(\d+)(.*)')
df = df.drop('Entry', axis=1)

Related

How to match between row values

serial
name
match
1
John
5,6,8
2
Steve
1,7
3
Kevin
4
4
Kevin
3
5
John
1,6,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial
name
match
updated_match
1
John
5,6,8
5,8
2
Steve
1,7
3
Kevin
4
4
4
Kevin
3
3
5
John
1,6,8
1,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
1,5
Convert values of serial column to strings,then mapping aggregate sets to Series wth same size like original, split column match and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)

Combining Multiple Dataframes with Unique Name [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have for example 2 data frames with user and their rating for each place such as:
Dataframe 1:
Name Golden Gate
Adam 1
Susan 4
Mike 5
John 4
Dataframe 2:
Name Botanical Garden
Jenny 1
Susan 4
Leslie 5
John 3
I want to combine them into a single data frame with the result:
Combined Dataframe:
Name Golden Gate Botanical Garden
Adam 1 NA
Susan 4 4
Mike 5 NA
John 4 3
Jenny NA 1
Leslie NA 5
How to do that?
Thank you.
You need to perform an outer join or a concatenation along an axis:
final_df = df1.merge(df2,how='outer',on='Name')
Output:
Name Golden Gate Botanical Garden
0 Adam 1.0 NaN
1 Susan 4.0 4.0
2 Mike 5.0 NaN
3 John 4.0 3.0
4 Jenny NaN 1.0
5 Leslie NaN 5.0
I found that pandas merge with how='outer' solves the problem. The link provided by #Celius Stingher is useful

How do I drop all rows after last occurrence of a value?

I have a dataframe with a string column and I would like to drop all rows after the last occurrence of a name.
first_name
Andy
Josh
Mark
Tim
Alex
Andy
Josh
Mark
Tim
Alex
Andy
Josh
Mark
What I would like is to drop rows after Alex occurs for the last time, so drop the rows with Andy, Josh and Mark.
I figured I drop before the first occurrence with: df=df[(df.first_name== 'Alex').idxmax():], but don't know how to drop last rows.
Thanks!
argmax
df.iloc[:len(df) - (df.first_name.to_numpy() == 'Alex')[::-1].argmax()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
last_valid_index
df.loc[:df.where(df == 'Alex').last_valid_index()]
Option 3
df.loc[:df.first_name.eq('Alex')[::-1].idxmax()]
Option 4
df.iloc[:np.flatnonzero(df.first_name.eq('Alex')).max() + 1]
Option 5
This is silly!
df[np.logical_or.accumulate(df.first_name.eq('Alex')[::-1])[::-1]]
mask and bfill
df[df['first_name'].mask(df['first_name'] != 'Alex').bfill().notna()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
cumsum and idxmax
df.loc[:(df['first_name'] == 'Alex').cumsum().idxmax()]
first_name
0 Andy
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex
cumsum and max
u = (df['first_name'] == 'Alex').shift().cumsum()
df[u < u.max()]
first_name
1 Josh
2 Mark
3 Tim
4 Alex
5 Andy
6 Josh
7 Mark
8 Tim
9 Alex

How to strip the string and replace the existing elements in DataFrame

I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob

Pandas intersection of groups

Hi I'm trying to find the unique Player which show up in every Team.
df =
Team Player Number
A Joe 8
A Mike 10
A Steve 11
B Henry 9
B Steve 19
B Joe 4
C Mike 18
C Joe 6
C Steve 18
C Dan 1
C Henry 3
and the result should be:
result =
Team Player Number
A Joe 8
A Steve 11
B Joe 4
B Steve 19
C Joe 6
C Steve 18
since Joe and Steve are the only Player in each Team
You can use a GroupBy.transform to get a count of unique teams that each player is a member of, and compare this to the overall count of unique teams. This will give you a Boolean array, which you can use to filter your DataFrame:
df = df[df.groupby('Player')['Team'].transform('nunique') == df['Team'].nunique()]
The resulting output:
Team Player Number
0 A Joe 8
2 A Steve 11
4 B Steve 19
5 B Joe 4
7 C Joe 6
8 C Steve 18

Categories