Pandas - Data transformation of column using now delimiters - python

I have a pandas dataframe which consists of players names and statistics from a sporting match. The only source of data lists them in the following format:
# PLAYER M FG 3PT FT REB AST STL PTS
34 BLAKE Brad 38 17 5 6 3 0 3 0 24
12 JONES Ben 42 10 2 6 1 0 4 1 12
8 SMITH Todd J. 16 9 1 4 1 0 3 2 18
5 MAY-DOUGLAS James 9 9 0 3 1 0 2 1 6
44 EDLIN Taylor 12 6 0 5 1 0 0 1 8
The players names are in reverse order: Surname Firstname. I need to transform the names to the current order of firstname lastname. So, specifically:
BLAKE Brad -> Brad BLAKE
SMITH Todd J. -> Todd J. SMITH
MAY-DOUGLAS James -> James MAY-DOUGLAS
The case of the letters do not matter, however I thought potentially they could be used to differentiate the first and lastname. I know all lastnames with always be in uppercase even if they include a hyphen. The first name will always be sentence case (first letter uppercase and the rest lowercase). However some names include the middle name to differentiate players with the same name. I see how a space character can be used a delemiter and potentially use a "split" transformation but it guess difficult with the middle name character.
Is there any suggestions of a function from Pandas I can use to achieve this?
The desired out put is:
# PLAYER M FG 3PT FT REB AST STL PTS
34 Brad BLAKE 38 17 5 6 3 0 3 0 24
12 Ben JONES 42 10 2 6 1 0 4 1 12
8 Todd J. SMITH 16 9 1 4 1 0 3 2 18
5 James MAY-DOUGLAS 9 9 0 3 1 0 2 1 6
44 Taylor EDLIN 12 6 0 5 1 0 0 1 8

Try to split by first whitespace, then reverse the list and join list values with whitespace.
df['PLAYER'] = df['PLAYER'].str.split(' ', 1).str[::-1].str.join(' '))
To reverse only certain names, you can use isin then boolean indexing
names = ['BLAKE Brad', 'SMITH Todd J.', 'MAY-DOUGLAS James']
mask = df['PLAYER'].isin(names)
df.loc[mask, 'PLAYER'] = df.loc[mask, 'PLAYER'].str.split('-', 1).str[::-1].str.join(' ')

Related

How to match between row values

serial
name
match
1
John
5,6,8
2
Steve
1,7
3
Kevin
4
4
Kevin
3
5
John
1,6,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial
name
match
updated_match
1
John
5,6,8
5,8
2
Steve
1,7
3
Kevin
4
4
4
Kevin
3
3
5
John
1,6,8
1,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
1,5
Convert values of serial column to strings,then mapping aggregate sets to Series wth same size like original, split column match and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)

Assign values (1 to N) for similar rows in a dataframe Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1

How to merge pandas dataframe to a dataframe with more columns, while filling in the empty columns with corresponding values? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes, df1 with (for example) 5 rows and 3 columns, and df2 with 5 rows but only 2 columns, as such:
df1 = pd.DataFrame({"First_Name":['Mark','John','Adam','Mark','Adam'], "Purchased":[20,12,13,40,23], "Last_Name":['S.','M.','C.','S.','C.'])
First_Name Purchased Last_Name
0 Mark 20 S.
1 John 12 M.
2 Adam 13 C.
3 Mark 40 S.
4 Adam 23 C.
df2 = pd.DataFrame({"First_Name":['Jane','Mark','Mark','Adam','Jane'], "Purchased":[3,16,17,10,23]})
First_Name Purchased
0 Jane 3
1 Mark 16
2 Mark 17
3 Adam 10
4 Jane 23
I want to append the rows from df2 to df1, while also creating values for the third column (in this example, "Last Name") based on the values from df1.
For example, I want the output to be:
First_Name Purchased Last_Name
0 Mark 20 S.
1 John 12 M.
2 Adam 13 C.
3 Mark 40 S.
4 Adam 23 C.
5 Jane 3 nan
6 Mark 16 S.
7 Mark 17 S.
8 Adam 10 C.
9 Jane 23 nan
Is there any way to do all these functions simply? Thanks!
This should do the trick:
final = pd.concat([df1, df2]).reset_index()
final.sort_values('First_Name')
final.fillna(method='ffill')
which gives
index First_Name Purchased Last_Name
0 0 Mark 20 S.
3 3 Mark 40 S.
6 1 Mark 16 S.
7 2 Mark 17 S.
1 1 John 12 M.
5 0 Jane 3 M.
9 4 Jane 23 M.
2 2 Adam 13 C.
4 4 Adam 23 C.
8 3 Adam 10 C.

How to strip the string and replace the existing elements in DataFrame

I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

Categories