Pandas Dataframe, aggregate and put the data in the next column - python

I have a dataset like this:
Country Name Match Result
US Martin Win 3
US Martin Lose 1
US Martin Draw 5
UK Luther Win 5
UK Luther Draw 3
I'd like to add two more columns with sum result from Win, Lose and Draw, and percentage of each match like this:
Country Name Match Result All Percentage
US Martin Win 3 8 0.375
US Martin Lose 1 8 0.125
US Martin Draw 5 8 0.625
UK Luther Win 6 10 0.6
UK Luther Draw 4 10 0.4
I've already tried using groupby and got result for size total match. However I don't know how to put it in the next column.
Thank you

IIUC you need GroupBy.transform, sample DataFrame was changed:
df['All'] = df.groupby(['Country','Name'])['Result'].transform('sum')
df['Percentage'] = df.Result.div(df.All)
print (df)
Country Name Match Result All Percentage
0 US Martin Win 2 8 0.250
1 US Martin Lose 1 8 0.125
2 US Martin Draw 5 8 0.625
3 UK Luther Win 6 10 0.600
4 UK Luther Draw 4 10 0.400

Related

How to merge two dataframes, where one is multi-indexed, with different headers

I've been trying to merge two dataframes that look as below, one is multi-indexed while the other is not.
FIRST DATAFRAME: bd_df
outcome opp_name
Sam 3 win Roy Jones
2 win Floyd Mayweather
1 win Bernard Hopkins
James 3 win James Bond
2 win Michael O'Terry
1 win Donald Trump
Jonny 3 win Oscar De la Hoya
2 win Roberto Duran
1 loss Manny Pacquiao
Dyaus 3 win Thierry Henry
2 win David Beckham
1 loss Gabriel Jesus
SECOND DATAFRAME: bt_df
name country colour wins losses
0 Sam England red 10 0
1 Jonny China blue 9 3
2 Dyaus Arsenal white 3 8
3 James USA green 12 6
I'm aiming to merge the two dataframes such that bd_df is joined to bt_df based on the 'name' value where they match. I also have been trying to rename the axis of bd_df with no luck - code is also below.
My code is as below currently, with the output. Appreciate any help!
boxrec_tables = pd.read_csv(Path(boxrec_tables_path),index_col=[0,1]).rename_axis(['name', 'bout number'])
bt_df = pd.DataFrame(boxrec_tables)
bout_data = pd.read_csv(Path(bout_data_path))
bd_df = pd.DataFrame(bout_data)
OUTPUT
outcome opp_name name country colour wins losses
Sam 3 win Roy Jones James USA green 12 6
2 win Floyd Mayweather Dyaus Arsenal white 3 8
1 win Bernard Hopkins Jonny China blue 9 3
James 3 win James Bond James USA green 12 6
2 win Michael O'Terry Dyaus Arsenal white 3 8
1 win Donald Trump Jonny China blue 9 3
Jonny 3 win Oscar De la Hoya James USA green 12 6
2 win Roberto Duran Dyaus Arsenal white 3 8
1 loss Manny Pacquiao Jonny China blue 9 3
Dyaus 3 win Thierry Henry James USA green 12 6
2 win David Beckham Dyaus Arsenal white 3 8
1 loss Gabriel Jesus Jonny China blue 9 3
Following suggestion by #Jezrael:
df = (bd_df.join(bt_df.set_index('opp name', drop=False)).set_index('name',append=True))
country colour wins losses outcome opp name
name
0 Sam England red 10 0 NaN NaN
1 Jonny China blue 9 3 NaN NaN
2 Dyaus Arsenal white 3 8 NaN NaN
3 James USA green 12 6 NaN NaN
Issue currently that the merged dataframe values are showing as NaN, while the bout number values are missing also
I think you need merge by bout number in level of MultiIndex with index in bt_df:
main_df = (bd_df.reset_index()
.merge(bt_df,
left_on='bout number',
right_index=True,
how='left',
suffixes=('_',''))
.set_index(['name_', 'bout number'])
)
print (main_df)
outcome opp_name name country colour wins \
name_ bout number
Sam 3 win Roy Jones James USA green 12
2 win Floyd Mayweather Dyaus Arsenal white 3
1 win Bernard Hopkins Jonny China blue 9
James 3 win James Bond James USA green 12
2 win Michael O'Terry Dyaus Arsenal white 3
1 win Donald Trump Jonny China blue 9
Jonny 3 win Oscar De la Hoya James USA green 12
2 win Roberto Duran Dyaus Arsenal white 3
1 loss Manny Pacquiao Jonny China blue 9
Dyaus 3 win Thierry Henry James USA green 12
2 win David Beckham Dyaus Arsenal white 3
1 loss Gabriel Jesus Jonny China blue 9
losses
name_ bout number
Sam 3 6
2 8
1 3
James 3 6
2 8
1 3
Jonny 3 6
2 8
1 3
Dyaus 3 6
2 8
1 3

Pandas - Data transformation of column using now delimiters

I have a pandas dataframe which consists of players names and statistics from a sporting match. The only source of data lists them in the following format:
# PLAYER M FG 3PT FT REB AST STL PTS
34 BLAKE Brad 38 17 5 6 3 0 3 0 24
12 JONES Ben 42 10 2 6 1 0 4 1 12
8 SMITH Todd J. 16 9 1 4 1 0 3 2 18
5 MAY-DOUGLAS James 9 9 0 3 1 0 2 1 6
44 EDLIN Taylor 12 6 0 5 1 0 0 1 8
The players names are in reverse order: Surname Firstname. I need to transform the names to the current order of firstname lastname. So, specifically:
BLAKE Brad -> Brad BLAKE
SMITH Todd J. -> Todd J. SMITH
MAY-DOUGLAS James -> James MAY-DOUGLAS
The case of the letters do not matter, however I thought potentially they could be used to differentiate the first and lastname. I know all lastnames with always be in uppercase even if they include a hyphen. The first name will always be sentence case (first letter uppercase and the rest lowercase). However some names include the middle name to differentiate players with the same name. I see how a space character can be used a delemiter and potentially use a "split" transformation but it guess difficult with the middle name character.
Is there any suggestions of a function from Pandas I can use to achieve this?
The desired out put is:
# PLAYER M FG 3PT FT REB AST STL PTS
34 Brad BLAKE 38 17 5 6 3 0 3 0 24
12 Ben JONES 42 10 2 6 1 0 4 1 12
8 Todd J. SMITH 16 9 1 4 1 0 3 2 18
5 James MAY-DOUGLAS 9 9 0 3 1 0 2 1 6
44 Taylor EDLIN 12 6 0 5 1 0 0 1 8
Try to split by first whitespace, then reverse the list and join list values with whitespace.
df['PLAYER'] = df['PLAYER'].str.split(' ', 1).str[::-1].str.join(' '))
To reverse only certain names, you can use isin then boolean indexing
names = ['BLAKE Brad', 'SMITH Todd J.', 'MAY-DOUGLAS James']
mask = df['PLAYER'].isin(names)
df.loc[mask, 'PLAYER'] = df.loc[mask, 'PLAYER'].str.split('-', 1).str[::-1].str.join(' ')

How do I select suitable rows from different relevant columns? (pandas Dataframe)

Everyone.I am the beginner for Pandas.
My aim: select the most valuable team from the "team_list".
the most valuable team means: most goals,least Yellow and Red Cards .
the "team_list" consists of "Team","Goals","Yellow Cards","Red Cards" - four columns.
team_list shows
I want to solve the question like this,but it isn't python style. How can I do that?
sortGoals=euro.sort_values(by=['Goals'],ascending=False);
sortCards=sortGoals.sort_values(by=['Yellow Cards','Red Cards']);
print (sortCards.head(1));
the result :
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
the team information :
euro=DataFrame({'Team':['Croatia','Czech
Republic','Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of
Ireland','Russia','Spain','Sweden','Ukraine'],'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],'Yellow
Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
euro:
Team Goals Yellow Cards Red Cards
0 Croatia 4 9 0
1 Czech Republic 4 7 0
2 Denmark 4 4 0
3 England 5 5 0
4 France 3 6 0
5 Germany 10 4 0
6 Greece 5 9 1
7 Italy 6 16 0
8 Netherlands 2 5 0
9 Poland 2 7 1
10 Portugal 6 12 0
11 Republic of Ireland 1 6 1
12 Russia 5 6 0
13 Spain 12 11 0
14 Sweden 5 7 0
15 Ukraine 2 5 0
Joran Beasley inspires me, thank you.
euro['RedCard_rate']=euro['Red Cards']/euro['Goals'];
euro['YellowCard_rate']=euro['Yellow Cards']/euro['Goals'];
sort_teams=euro.sort_values(by=['YellowCard_rate','RedCard_rate']);
print (sort_teams[['Team','Goals','Yellow Cards','Red Cards']].head(1));
the results:
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
You can do this:
germany = euro.loc[euro.Team == 'Germany']
More on pandas here: https://pandas.pydata.org/docs/user_guide/index.html
Is this what your looking for?
df[df['Team'].eq('Germany')]
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
import pandas
df =pandas.DataFrame({'Team':['Croatia','Czech Republic',
'Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of Ireland',
'Russia','Spain','Sweden','Ukraine'],
'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],
'Yellow Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],
'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
scores = df['Goals'] - df['Yellow Cards'] - df['Red Cards']
df2 = pandas.DataFrame({'Team': df['Team'],'score':scores})
print(df2['Team'][df2['score'].idxmax()])
is that what you mean?

Python rolling sum taking data from to columns

The below is a part of a dataframe which consists of football game results.
FTHG stands for "Full time home goals"
FTAG stands for "Full time away goals"
Date HomeTeam AwayTeam FTHG FTAG FTR
14/08/93 Arsenal Coventry 0 3 A
14/08/93 Aston Villa QPR 4 1 H
16/08/93 Tottenham Arsenal 0 1 A
17/08/93 Everton Man City 1 0 H
21/08/93 QPR Southampton 2 1 H
21/08/93 Sheffield Arsenal 0 1 A
24/08/93 Arsenal Leeds 2 1 H
24/08/93 Man City Blackburn 0 2 A
28/08/93 Arsenal Everton 2 0 H
I want to create a code in python that calculates a rolling sum (for ex. 3) of the goals scored by each team regardless if the team was home or visitor.
The groupby method does half the job. Say "a" is a variable and "df" is dataframe
a = df.groupby("HomeTeam")["FTHG"].rolling(3).sum()
The result be something like that:
FTHG
Arsenal NaN
NaN
4.0
.....
However I would like the code to take into account also the goals when Arsenal was visiting team. Respectively to produce a column (it should not be called FTHG but to be some new column)
Arsenal NaN
NaN
2
4
5
Ideas will be much appreciated
you can combine those columns together and then apply groupby
tmp1 = df[['Date','HomeTeam', 'FTHG']]
tmp2 = df[['Date','AwayTeam', 'FTAG']]
tmp1.columns = ['Date','name', 'score']
tmp2.columns = ['Date','name', 'score']
tmp = pd.concat([tmp1,tmp2])
tmp.sort_values(by='Date').groupby("name")["score"].rolling(3).sum()
name
Arsenal 0 NaN
2 NaN
5 2.0
6 4.0
8 5.0

how to remove rows based on some specific criteria

I have a data frame like the table below. based on the ranks i provided I want to remove rows in this way.
Data..
name date rank
angel 7/25/2017 3
maggie 8/8/2017 2
maggie 8/8/2017 1
maggie 8/8/2017 2
maggie 8/8/2017 3
smith 8/16/2017 1
smith 8/16/2017 3
laura 9/26/2017 2
laura 9/26/2017 1
laura 9/26/2017 2
laura 9/27/2017 3
lisa 9/5/2017 1
lisa 9/5/2017 3
bill 7/20/2017 1
bill 7/20/2017 3
bill 7/21/2017 3
bill 7/31/2017 3
bill 8/1/2017 3
bill 8/7/2017 1
tomy 8/1/2017 3
What I want to do is for every given name - if there is one date- I want to keep that row but for same name- same date if there are different ranks-I want to select in order and remove the rest. so for example- if bill has 4 rows in the same date- but different ranks_ I want to remove all ranks and keep only "1" with all row information
The output I want is like this:
name date rank
angel 7/25/2017 3
maggie 8/8/2017 1
smith 8/16/2017 1
laura 9/26/2017 1
laura 9/27/2017 3
lisa 9/5/2017 1
bill 7/20/2017 1
bill 8/7/2017 1
tomy 8/1/2017 3
Can someone please help me with that
I was able to get this answered by the following
`data = df.loc[df.groupby(['name', 'date'])['rank'].idxmin()]`
However, I would still like to know if a complex for loop can get that too. I am new to python and would love to learn more.
Thanks

Categories