How do I select suitable rows from different relevant columns? (pandas Dataframe)

How do I select suitable rows from different relevant columns? (pandas Dataframe) - python

Everyone.I am the beginner for Pandas.
My aim: select the most valuable team from the "team_list".
the most valuable team means: most goals,least Yellow and Red Cards .
the "team_list" consists of "Team","Goals","Yellow Cards","Red Cards" - four columns.
team_list shows
I want to solve the question like this,but it isn't python style. How can I do that?
sortGoals=euro.sort_values(by=['Goals'],ascending=False);
sortCards=sortGoals.sort_values(by=['Yellow Cards','Red Cards']);
print (sortCards.head(1));
the result :
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
the team information :
euro=DataFrame({'Team':['Croatia','Czech
Republic','Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of
Ireland','Russia','Spain','Sweden','Ukraine'],'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],'Yellow
Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
euro:
Team Goals Yellow Cards Red Cards
0 Croatia 4 9 0
1 Czech Republic 4 7 0
2 Denmark 4 4 0
3 England 5 5 0
4 France 3 6 0
5 Germany 10 4 0
6 Greece 5 9 1
7 Italy 6 16 0
8 Netherlands 2 5 0
9 Poland 2 7 1
10 Portugal 6 12 0
11 Republic of Ireland 1 6 1
12 Russia 5 6 0
13 Spain 12 11 0
14 Sweden 5 7 0
15 Ukraine 2 5 0
Joran Beasley inspires me, thank you.
euro['RedCard_rate']=euro['Red Cards']/euro['Goals'];
euro['YellowCard_rate']=euro['Yellow Cards']/euro['Goals'];
sort_teams=euro.sort_values(by=['YellowCard_rate','RedCard_rate']);
print (sort_teams[['Team','Goals','Yellow Cards','Red Cards']].head(1));
the results:
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0

You can do this:
germany = euro.loc[euro.Team == 'Germany']
More on pandas here: https://pandas.pydata.org/docs/user_guide/index.html

Is this what your looking for?
df[df['Team'].eq('Germany')]
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0

import pandas
df =pandas.DataFrame({'Team':['Croatia','Czech Republic',
'Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of Ireland',
'Russia','Spain','Sweden','Ukraine'],
'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],
'Yellow Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],
'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
scores = df['Goals'] - df['Yellow Cards'] - df['Red Cards']
df2 = pandas.DataFrame({'Team': df['Team'],'score':scores})
print(df2['Team'][df2['score'].idxmax()])
is that what you mean?

Related

How to merge two dataframes, where one is multi-indexed, with different headers

I've been trying to merge two dataframes that look as below, one is multi-indexed while the other is not.
FIRST DATAFRAME: bd_df
outcome opp_name
Sam 3 win Roy Jones
2 win Floyd Mayweather
1 win Bernard Hopkins
James 3 win James Bond
2 win Michael O'Terry
1 win Donald Trump
Jonny 3 win Oscar De la Hoya
2 win Roberto Duran
1 loss Manny Pacquiao
Dyaus 3 win Thierry Henry
2 win David Beckham
1 loss Gabriel Jesus
SECOND DATAFRAME: bt_df
name country colour wins losses
0 Sam England red 10 0
1 Jonny China blue 9 3
2 Dyaus Arsenal white 3 8
3 James USA green 12 6
I'm aiming to merge the two dataframes such that bd_df is joined to bt_df based on the 'name' value where they match. I also have been trying to rename the axis of bd_df with no luck - code is also below.
My code is as below currently, with the output. Appreciate any help!
boxrec_tables = pd.read_csv(Path(boxrec_tables_path),index_col=[0,1]).rename_axis(['name', 'bout number'])
bt_df = pd.DataFrame(boxrec_tables)
bout_data = pd.read_csv(Path(bout_data_path))
bd_df = pd.DataFrame(bout_data)
OUTPUT
outcome opp_name name country colour wins losses
Sam 3 win Roy Jones James USA green 12 6
2 win Floyd Mayweather Dyaus Arsenal white 3 8
1 win Bernard Hopkins Jonny China blue 9 3
James 3 win James Bond James USA green 12 6
2 win Michael O'Terry Dyaus Arsenal white 3 8
1 win Donald Trump Jonny China blue 9 3
Jonny 3 win Oscar De la Hoya James USA green 12 6
2 win Roberto Duran Dyaus Arsenal white 3 8
1 loss Manny Pacquiao Jonny China blue 9 3
Dyaus 3 win Thierry Henry James USA green 12 6
2 win David Beckham Dyaus Arsenal white 3 8
1 loss Gabriel Jesus Jonny China blue 9 3
Following suggestion by #Jezrael:
df = (bd_df.join(bt_df.set_index('opp name', drop=False)).set_index('name',append=True))
country colour wins losses outcome opp name
name
0 Sam England red 10 0 NaN NaN
1 Jonny China blue 9 3 NaN NaN
2 Dyaus Arsenal white 3 8 NaN NaN
3 James USA green 12 6 NaN NaN
Issue currently that the merged dataframe values are showing as NaN, while the bout number values are missing also

I think you need merge by bout number in level of MultiIndex with index in bt_df:
main_df = (bd_df.reset_index()
.merge(bt_df,
left_on='bout number',
right_index=True,
how='left',
suffixes=('_',''))
.set_index(['name_', 'bout number'])
)
print (main_df)
outcome opp_name name country colour wins \
name_ bout number
Sam 3 win Roy Jones James USA green 12
2 win Floyd Mayweather Dyaus Arsenal white 3
1 win Bernard Hopkins Jonny China blue 9
James 3 win James Bond James USA green 12
2 win Michael O'Terry Dyaus Arsenal white 3
1 win Donald Trump Jonny China blue 9
Jonny 3 win Oscar De la Hoya James USA green 12
2 win Roberto Duran Dyaus Arsenal white 3
1 loss Manny Pacquiao Jonny China blue 9
Dyaus 3 win Thierry Henry James USA green 12
2 win David Beckham Dyaus Arsenal white 3
1 loss Gabriel Jesus Jonny China blue 9
losses
name_ bout number
Sam 3 6
2 8
1 3
James 3 6
2 8
1 3
Jonny 3 6
2 8
1 3
Dyaus 3 6
2 8
1 3

Counting distinct, until a certain condition based on another row is met

I have the following df
Original df
Step | CampaignSource | UserId
1 Banana Jeff
1 Banana John
2 Banana Jefferson
3 Website Nunes
4 Banana Jeff
5 Attendance Nunes
6 Attendance Antonio
7 Banana Antonio
8 Website Joseph
9 Attendance Joseph
9 Attendance Joseph
Desired output
Steps | CampaignSource | CountedDistinctUserid
1 Website 2 (Because of different userids)
2 Banana 1
3 Banana 1
4 Website 1
5 Banana 1
6 Attendance 1
7 Attendance 1
8 Attendance 1
9 Attendance 1 (but i want to have 2 here even tho they have similar user ids and because is the 9th step)
What i want to do is impose a condition where if the step column which is made by strings equals '9', i want to count the userids as non distinct, any ideas on how i could do that? I tried applying a function but i just couldnt make it work.
What i am currently doing:
df[['Steps','UserId','CampaignSource']].groupby(['Steps','CampaignSource'],as_index=False,dropna=False).nunique()

You can group by "Step" and use a condition on the group name:
df.groupby('Step')['UserId'].apply(lambda g: g.nunique() if g.name<9 else g.count())
output:
Step
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: UserId, dtype: int64
As DataFrame:
(df.groupby('Step', as_index=False)
.agg(CampaignSource=('CampaignSource', 'first'),
CountedDistinctUserid=('CampaignSource', lambda g: g.nunique() if g.name<9 else g.count())
)
)
output:
Step CampaignSource CountedDistinctUserid
0 1 Banana 2
1 2 Banana 1
2 3 Website 1
3 4 Banana 1
4 5 Attendance 1
5 6 Attendance 1
6 7 Banana 1
7 8 Website 1
8 9 Banana 2

You can apply different functions to different groups depending if condition match.
out = (df[['Steps','UserId','CampaignSource']]
.groupby(['Steps','CampaignSource'],as_index=False,dropna=False)
.apply(lambda g: g.assign(CountedDistinctUserid=( [len(g)]*len(g)
if g['Steps'].eq(9).all()
else [g['UserId'].nunique()]*len(g) ))))
print(out)
Steps UserId CampaignSource CountedDistinctUserid
0 1 Jeff Banana 2
1 1 John Banana 2
2 2 Jefferson Banana 1
3 3 Nunes Website 1
4 4 Jeff Banana 1
5 5 Nunes Attendance 1
6 6 Antonio Attendance 1
7 7 Antonio Banana 1
8 8 Joseph Website 1
9 9 Joseph Attendance 2
10 9 Joseph Attendance 2

Seaborn heatmap from dataframe with color bars representing metadata

I have the following pandas dataframes:
DF =
USA Canada Denmark Japan England Spain Brazil
mountain 3 1 9 7 1 1 4
forest 6 3 2 1 5 2 4
plains 7 6 0 7 5 1 5
swampland 6 9 6 6 2 2 8
fjord 7 0 1 2 7 5 5
glacier 1 1 1 1 8 1 7
city 7 0 8 0 1 3 1
hills 3 5 3 9 3 0 0
MD = Country Continent Random_int
USA America 1
Japan Asia 3
Denmark Europe 2
Norway Europe 1
Cambodia Asia 3
Canada America 2
England Asia 3
Chad Africa 2
China Asia 1
Brazil America 3
What I am trying to do is to use seaborn to create a heatmap as in this image:
From https://nbviewer.org/github/MaayanLab/clustergrammer-widget/blob/master/Running_clustergrammer_widget.ipynb
I am trying to get the color bars at the top (Circled in a blue - I dont know the terminology for these), where the authors of the clustergrammar heatmap have 'category' and 'gender' metadata displayed.
How can I do that in seaborn? Or is there any alternative to seaborn that works better for this?
I haven't gotten anything I have tried to work.
Many thanks in advance!

Pandas - Data transformation of column using now delimiters

I have a pandas dataframe which consists of players names and statistics from a sporting match. The only source of data lists them in the following format:
# PLAYER M FG 3PT FT REB AST STL PTS
34 BLAKE Brad 38 17 5 6 3 0 3 0 24
12 JONES Ben 42 10 2 6 1 0 4 1 12
8 SMITH Todd J. 16 9 1 4 1 0 3 2 18
5 MAY-DOUGLAS James 9 9 0 3 1 0 2 1 6
44 EDLIN Taylor 12 6 0 5 1 0 0 1 8
The players names are in reverse order: Surname Firstname. I need to transform the names to the current order of firstname lastname. So, specifically:
BLAKE Brad -> Brad BLAKE
SMITH Todd J. -> Todd J. SMITH
MAY-DOUGLAS James -> James MAY-DOUGLAS
The case of the letters do not matter, however I thought potentially they could be used to differentiate the first and lastname. I know all lastnames with always be in uppercase even if they include a hyphen. The first name will always be sentence case (first letter uppercase and the rest lowercase). However some names include the middle name to differentiate players with the same name. I see how a space character can be used a delemiter and potentially use a "split" transformation but it guess difficult with the middle name character.
Is there any suggestions of a function from Pandas I can use to achieve this?
The desired out put is:
# PLAYER M FG 3PT FT REB AST STL PTS
34 Brad BLAKE 38 17 5 6 3 0 3 0 24
12 Ben JONES 42 10 2 6 1 0 4 1 12
8 Todd J. SMITH 16 9 1 4 1 0 3 2 18
5 James MAY-DOUGLAS 9 9 0 3 1 0 2 1 6
44 Taylor EDLIN 12 6 0 5 1 0 0 1 8

Try to split by first whitespace, then reverse the list and join list values with whitespace.
df['PLAYER'] = df['PLAYER'].str.split(' ', 1).str[::-1].str.join(' '))
To reverse only certain names, you can use isin then boolean indexing
names = ['BLAKE Brad', 'SMITH Todd J.', 'MAY-DOUGLAS James']
mask = df['PLAYER'].isin(names)
df.loc[mask, 'PLAYER'] = df.loc[mask, 'PLAYER'].str.split('-', 1).str[::-1].str.join(' ')

Why am I not able to drop values within columns on pandas using python3?

I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.

You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I select suitable rows from different relevant columns? (pandas Dataframe) - python

You can do this: germany = euro.loc[euro.Team == 'Germany'] More on pandas here: https://pandas.pydata.org/docs/user_guide/index.html

Is this what your looking for? df[df['Team'].eq('Germany')] Team Goals Yellow Cards Red Cards 5 Germany 10 4 0

Related

How to merge two dataframes, where one is multi-indexed, with different headers

Counting distinct, until a certain condition based on another row is met

Seaborn heatmap from dataframe with color bars representing metadata

Pandas - Data transformation of column using now delimiters

Why am I not able to drop values within columns on pandas using python3?

Categories

Resources