When I write the below codes in pandas
gold.groupby(['Games','country'])['Medal'].value_counts()
I get the below result, how to extract the top medal winner for each Games,The result should be all the games,country with most medal,medal tally
Games country Medal
1896 Summer Australia Gold 2
Austria Gold 2
Denmark Gold 1
France Gold 5
Germany Gold 25
...
2016 Summer UK Gold 64
USA Gold 139
Ukraine Gold 2
Uzbekistan Gold 4
Vietnam Gold 1
Name: Medal, Length: 1101, dtype: int64
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal country notes
68 17294 Cai Yalin M 23.0 174.0 60.0 China CHN 2000 Summer 2000 Summer Sydney Shooting Shooting Men's Air Rifle, 10 metres Gold China NaN
77 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN 2012 Summer 2012 Summer London Badminton Badminton Men's Doubles Gold China NaN
87 17995 Cao Lei F 24.0 168.0 75.0 China CHN 2008 Summer 2008 Summer Beijing Weightlifting Weightlifting Women's Heavyweight Gold China NaN
104 18005 Cao Yuan M 17.0 160.0 42.0 China CHN 2012 Summer 2012 Summer London Diving Diving Men's Synchronized Platform Gold China NaN
105 18005 Cao Yuan M 21.0 160.0 42.0 China CHN 2016 Summer 2016 Summer Rio de Janeiro Diving Diving Men's Springboard Gold China NaN
The data Your data only included Chinese gold medal winners so I added a row:
ID Name Sex Age Height Weight Team NOC \
0 17294 Cai Yalin M 23.0 174.0 60.0 China CHN
1 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN
2 17995 Cao Lei F 24.0 168.0 75.0 China CHN
3 18005 Cao Yuan M 17.0 160.0 42.0 China CHN
4 18005 Cao Yuan M 21.0 160.0 42.0 China CHN
5 292929 Serge de Gosson M 52.0 178.0 69.0 France FR
Games Year Season City Sport \
0 2000 Summer 2000 Summer Sydney Shooting
1 2012 Summer 2012 Summer London Badminton
2 2008 Summer 2008 Summer Beijing Weightlifting
3 2012 Summer 2012 Summer London Diving
4 2016 Summer 2016 Summer Rio de Janeiro Diving
5 2022 Summer 2022 Summer Stockholm Calisthenics
Event Medal country notes
0 Shooting Men's Air Rifle, 10 metres Gold China NaN
1 Badminton Men's Doubles Gold China NaN
2 Weightlifting Women's Heavyweight Gold China NaN
3 Diving Men's Synchronized Platform Gold China NaN
4 Diving Men's Springboard Gold China NaN
5 Planche Gold France NaN
YOu want to de exactly what you did but sort the data and keep the top row:
gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1)
Which returns:
Games country Medal
2000 Summer China Gold 1
2008 Summer China Gold 1
2012 Summer China Gold 2
2016 Summer China Gold 1
2022 Summer France Gold 1
Name: Medal, dtype: int64
or as a dataframe:
GOLD_TOP = pd.DataFrame(gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1))
df_gold = df[df["Medal"]=="Gold"].groupby("Team").Medal.count().reset_index()
df_gold = df_gold.sort_values(by="Medal",ascending=False)[:8]
df_gold
Related
I was trying to crawl down nba player info from https://nba.com/players and click the button "Show Historic" on the webpage
nba_webpage_picture
part of the html code for the input button shows below:
<div aria-label="Show Historic Toggle" class="Toggle_switch__2e_90">
<input type="checkbox" class="Toggle_input__gIiFd" name="showHistoric">
<span class="Toggle_slider__hCMQQ Toggle_sliderActive__15Jrf Toggle_slidercerulean__1UnnV">
</span>
</div>
I simply use find_element_by_xpath to locate the input button and click
button_show_historic = driver.find_element_by_xpath("//input[#name='showHistoric']")
button_show_historic.click()
However it says:
Exception has occurred: ElementNotInteractableException
Message: element not interactable
(Session info: chrome=88.0.4324.192)
Could anyone help on solving this issue? Is this because the input is not visible?
Simply wait for the span element not the input element and click.
wait = WebDriverWait(driver, 30)
driver.get('https://www.nba.com/players')
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[.='I Accept']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH,"//input[#name='showHistoric']/preceding::span[1]"))).click()
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Also to find an api just look under Developer tools ->Network->Headers
and Response to find if it gets populated.
Most probably problem is you don't have any wait code. You should wait until page is loaded. You can use simple python sleep function:
import time
time.sleep(3) #it will wait 3 seconds
##Do your action
Or You can use explicit wait. Check this page: selenium.dev
No need to use selenium when there's an api. Try this:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/playerindex'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Referer': 'http://stats.nba.com'}
payload = {
'College': '',
'Country': '',
'DraftPick': '',
'DraftRound': '',
'DraftYear': '',
'Height': '' ,
'Historical': '1',
'LeagueID': '00',
'Season': '2020-21',
'SeasonType': 'Regular Season',
'TeamID': '0',
'Weight': ''}
jsonData = requests.get(url, headers=headers, params=payload).json()
cols = jsonData['resultSets'][0]['headers']
data = jsonData['resultSets'][0]['rowSet']
df = pd.DataFrame(data, columns=cols)
Output: [4589 rows x 26 columns]
print(df.head(20).to_string())
PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION HEIGHT WEIGHT COLLEGE COUNTRY DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME FROM_YEAR TO_YEAR
0 76001 Abdelnaby Alaa alaa-abdelnaby 1.610613e+09 blazers 0 Portland Trail Blazers POR 30 F 6-10 240 Duke USA 1990.0 1.0 25.0 NaN 5.7 3.3 0.3 Career 1990 1994
1 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1.610613e+09 rockets 0 Houston Rockets HOU 54 C 6-9 235 Iowa State USA 1968.0 1.0 5.0 NaN 9.0 8.0 1.2 Career 1968 1977
2 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1.610613e+09 lakers 0 Los Angeles Lakers LAL 33 C 7-2 225 UCLA USA 1969.0 1.0 1.0 NaN 24.6 11.2 3.6 Career 1969 1988
3 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1.610613e+09 nuggets 0 Denver Nuggets DEN 1 G 6-1 162 Louisiana State USA 1990.0 1.0 3.0 NaN 14.6 1.9 3.5 Career 1990 2000
4 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1.610613e+09 kings 0 Sacramento Kings SAC 9 F-G 6-6 235 San Jose State France 1997.0 1.0 11.0 NaN 7.8 3.3 1.1 Career 1997 2003
5 949 Abdur-Rahim Shareef shareef-abdur-rahim 1.610613e+09 grizzlies 0 Memphis Grizzlies MEM 3 F 6-9 245 California USA 1996.0 1.0 3.0 NaN 18.1 7.5 2.5 Career 1996 2007
6 76005 Abernethy Tom tom-abernethy 1.610613e+09 warriors 0 Golden State Warriors GSW 5 F 6-7 220 Indiana USA 1976.0 3.0 43.0 NaN 5.6 3.2 1.2 Career 1976 1980
7 76006 Able Forest forest-able 1.610613e+09 sixers 0 Philadelphia 76ers PHI 6 G 6-3 180 Western Kentucky USA 1956.0 NaN NaN NaN 0.0 1.0 1.0 Career 1956 1956
8 76007 Abramovic John john-abramovic 1.610610e+09 None 1 Pittsburgh Ironmen PIT None F 6-3 195 Salem USA NaN NaN NaN NaN 9.5 NaN 0.7 Career 1946 1947
9 203518 Abrines Alex alex-abrines 1.610613e+09 thunder 0 Oklahoma City Thunder OKC 8 G 6-6 190 FC Barcelona Spain 2013.0 2.0 32.0 NaN 5.3 1.4 0.5 Career 2016 2018
10 1630173 Achiuwa Precious precious-achiuwa 1.610613e+09 heat 0 Miami Heat MIA 5 F 6-8 225 Memphis Nigeria 2020.0 1.0 20.0 1.0 5.9 3.9 0.6 Season 2020 2020
11 101165 Acker Alex alex-acker 1.610613e+09 clippers 0 LA Clippers LAC 3 G 6-5 185 Pepperdine USA 2005.0 2.0 60.0 NaN 2.7 1.0 0.5 Career 2005 2008
12 76008 Ackerman Donald donald-ackerman 1.610613e+09 knicks 0 New York Knicks NYK G 6-0 183 Long Island-Brooklyn USA 1953.0 2.0 NaN NaN 1.5 0.5 0.8 Career 1953 1953
13 76009 Acres Mark mark-acres 1.610613e+09 magic 0 Orlando Magic ORL 42 C 6-11 220 Oral Roberts USA 1985.0 2.0 40.0 NaN 3.6 4.1 0.5 Career 1987 1992
14 76010 Acton Charles charles-acton 1.610613e+09 rockets 0 Houston Rockets HOU 24 F 6-6 210 Hillsdale USA NaN NaN NaN NaN 3.3 2.0 0.5 Career 1967 1967
15 203112 Acy Quincy quincy-acy 1.610613e+09 kings 0 Sacramento Kings SAC 13 F 6-7 240 Baylor USA 2012.0 2.0 37.0 NaN 4.9 3.5 0.6 Career 2012 2018
16 76011 Adams Alvan alvan-adams 1.610613e+09 suns 0 Phoenix Suns PHX 33 C 6-9 210 Oklahoma USA 1975.0 1.0 4.0 NaN 14.1 7.0 4.1 Career 1975 1987
17 76012 Adams Don don-adams 1.610613e+09 pistons 0 Detroit Pistons DET 10 F 6-7 210 Northwestern USA 1970.0 8.0 120.0 NaN 8.7 5.6 1.8 Career 1970 1976
18 200801 Adams Hassan hassan-adams 1.610613e+09 nets 0 Brooklyn Nets BKN 8 F 6-4 220 Arizona USA 2006.0 2.0 54.0 NaN 2.5 1.2 0.2 Career 2006 2008
19 1629121 Adams Jaylen jaylen-adams 1.610613e+09 bucks 0 Milwaukee Bucks MIL 6 G 6-0 225 St. Bonaventure USA NaN NaN NaN 1.0 0.3 0.4 0.3 Season 2018 2020
I have a dataframe in which matches played by a team in a year is given. Match Date is a column.
Team 1 Team 2 Winner Match Date
5 Australia England England 2018-01-14
12 Australia England England 2018-01-19
14 Australia England England 2018-01-21
20 Australia England Australia 2018-01-26
22 Australia England England 2018-01-28
34 New Zealand England New Zealand 2018-02-25
35 New Zealand England England 2018-02-28
36 New Zealand England England 2018-03-03
43 New Zealand England New Zealand 2018-03-07
46 New Zealand England England 2018-03-10
62 Scotland England Scotland 2018-06-10
63 England Australia England 2018-06-13
64 England Australia England 2018-06-16
65 England Australia England 2018-06-19
66 England Australia England 2018-06-21
67 England Australia England 2018-06-24
68 England India India 2018-07-12
70 England India England 2018-07-14
72 England India England 2018-07-17
106 Sri Lanka England no result 2018-10-10
107 Sri Lanka England England 2018-10-13
108 Sri Lanka England England 2018-10-17
109 Sri Lanka England England 2018-10-20
112 Sri Lanka England Sri Lanka 2018-10-23
Match Date is in datetime. I could plot the number of matches played versus winning matches. This is the code I used.
England.set_index('Match Date', inplace = True)
England.resample('1M').count()['Winner'].plot()
England_win.resample('1M').count()['Winner'].plot()
But I would like to plot the winning percentage by month. Please help.
Thank you
I am sure there are more efficient ways to do this, but one way to plot this using an approach similar to yours:
import matplotlib.pyplot as plt
import pandas as pd
#reading your sample data
df = pd.read_csv("test.txt", sep="\s{2,}", parse_dates=["Match Date"], index_col="ID", engine="python")
df.set_index('Match Date', inplace = True)
#creating df that count the wins
df1 = df[df["Winner"]=="England"].resample("1M").count()
#calculate and plot the percentage - if no game, NaN values are substituted with zero
df1.Winner.div(df.resample('1M').count()['Winner']).mul(100).fillna(0).plot()
plt.tight_layout()
plt.show()
Sample output:
I'm working on my python skills and I'm trying to scrape only the "Results" table from this page https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results . I'm new to web scraping, could anyone help me with an elegant solution for scraping the Results wikitable? Thanks!
The easiest way is to use Pandas to load the tables:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
# print second table (index 1):
print(tables[1])
Prints:
Date Venue Home team Away team Score Competition Winner Match report
0 7 March 2020 Twickenham Stadium England Wales 33–30 2020 Six Nations England BBC
1 22 February 2020 Principality Stadium Wales France 23–27 2020 Six Nations France BBC
2 8 February 2020 Aviva Stadium Ireland Wales 24–14 2020 Six Nations Ireland BBC
3 1 February 2020 Principality Stadium Wales Italy 42–0 2020 Six Nations Wales BBC
4 30 November 2019 Principality Stadium Wales Barbarians 43–33 Tour Match Wales BBC
.. ... ... ... ... ... ... ... ...
741 5 January 1884 Cardigan Fields England Wales 1G 2T–1G 1884 Home Nations Championship England NaN
742 8 January 1883 Raeburn Place Scotland Wales 3G–1G 1883 Home Nations Championship Scotland NaN
743 16 December 1882 St Helen's Wales England 0–2G 4T 1883 Home Nations Championship England NaN
744 28 January 1882 Lansdowne Road Ireland Wales 0–2G 2T NaN Wales NaN
745 19 February 1881 Richardson's Field England Wales 7G 6T 1D–0 NaN England NaN
[746 rows x 8 columns]
Trying to create a new column in DF1 that lists the home teams number of allstars for that year.
DF1
Date Visitor V_PTS Home H_PTS \
0 2012-10-30 19:00:00 Washington Wizards 84 Cleveland Cavaliers 94
1 2012-10-30 19:30:00 Dallas Mavericks 99 Los Angeles Lakers 91
2 2012-10-30 20:00:00 Boston Celtics 107 Miami Heat 120
3 2012-10-31 19:00:00 Dallas Mavericks 94 Utah Jazz 113
4 2012-10-31 19:00:00 San Antonio Spurs 99 New Orleans Pelicans 95
Attendance Arena Location Capacity \
0 20562 Quicken Loans Arena Cleveland, Ohio 20562
1 18997 Staples Center Los Angeles, California 18997
2 20296 American Airlines Arena Miami, Florida 19600
3 17634 Vivint Smart Home Arena Salt Lake City, Utah 18303
4 15358 Smoothie King Center New Orleans, Louisiana 16867
Yr Arena Opened Season
0 1994 2012-13
1 1992 2012-13
2 1999 2012-13
3 1991 2012-13
4 1999 2012-13
DF2
2012-13 2013-14 2014-15 2015-16 2016-17
Cleveland Cavaliers 1 1 2 1 3
Los Angeles Lakers 2 1 1 1 0
Miami Heat 3 3 2 2 1
Chicago Bulls 2 1 2 2 1
Detroit Pistons 0 0 0 1 1
Los Angeles Clippers 2 2 2 1 1
New Orleans Pelicans 0 1 1 1 1
Philadelphia 76ers 1 0 0 0 0
Phoenix Suns 0 0 0 0 0
Portland Trail Blazers 1 2 2 0 0
Toronto Raptors 0 1 1 2 2
DF1['H_Allstars']=DF2[DF1['Season'],DF1['Home']])
results in TypeError: 'Series' objects are mutable, thus they cannot be hashed
I understand the error just am not sure how else to do it.
I've removed the extra columns and just focused on the necessary ones for demonstration.
Input:
df1
Home 2012-13 2013-14 2014-15 2015-16 2016-17
0 Cleveland Cavaliers 1 1 2 1 3
1 Los Angeles Lakers 2 1 1 1 0
2 Miami Heat 3 3 2 2 1
3 Chicago Bulls 2 1 2 2 1
4 Detroit Pistons 0 0 0 1 1
5 Los Angeles Clippers 2 2 2 1 1
6 New Orleans Pelicans 0 1 1 1 1
7 Philadelphia 76ers 1 0 0 0 0
8 Phoenix Suns 0 0 0 0 0
9 Portland Trail Blazers 1 2 2 0 0
10 Toronto Raptors 0 1 1 2 2
df2
Visitor Home Season
0 Washington Wizards Cleveland Cavaliers 2012-13
1 Dallas Mavericks Los Angeles Lakers 2012-13
2 Boston Celtics Miami Heat 2012-13
3 Dallas Mavericks Utah Jazz 2012-13
4 San Antonio Spurs New Orleans Pelicans 2012-13
Step 1: Melt df1 to get the allstars column
df3 = pd.melt(df1, id_vars='Home', value_vars = df1.columns[df.columns.str.contains('20')], var_name = 'Season', value_name='H_Allstars')
Ouput:
Home Season H_Allstars
0 Cleveland Cavaliers 2012-13 1
1 Los Angeles Lakers 2012-13 2
2 Miami Heat 2012-13 3
3 Chicago Bulls 2012-13 2
4 Detroit Pistons 2012-13 0
5 Los Angeles Clippers 2012-13 2
6 New Orleans Pelicans 2012-13 0
7 Philadelphia 76ers 2012-13 1
8 Phoenix Suns 2012-13 0
...
Step 2: Merge this new dataframe with df2 to get the H_Allstars and V_Allstars columns
df4 = pd.merge(df2, df3, how='left', on=['Home', 'Season'])
Output:
Visitor Home Season H_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0
2 Boston Celtics Miami Heat 2012-13 3.0
3 Dallas Mavericks Utah Jazz 2012-13 NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0
Step 3: Add the V_Allstars column
# renaming column as required
df3.rename(columns={'Home': 'Visitor', 'H_Allstars': 'V_Allstars'}, inplace=True)
df5 = pd.merge(df4, df3, how='left', on=['Visitor', 'Season'])
Output:
Visitor Home Season H_Allstars V_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 NaN
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 NaN
2 Boston Celtics Miami Heat 2012-13 3.0 NaN
3 Dallas Mavericks Utah Jazz 2012-13 NaN NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 NaN
You can use pandas.melt . Bring your data df2 to long format, i.e. Home and season as columns and Allstars as values and then merge to df1 on 'Home' and 'Season'.
import pandas as pd
df2['Home'] = df2.index
df2 = pd.melt(df2, id_vars = 'Home', value_vars = ['2012-13', '2013-14', '2014-15', '2015-16', '2016-17'], var_name = 'Season', value_name='H_Allstars')
df = df1.merge(df2, on=['Home','Season'], how='left')
country state year area
usa iowa 2000 30
usa iowa 2001 30
usa iowa 2002 30
usa iowa 2003 30
usa kansas 2000 500
usa kansas 2001 500
usa kansas 2002 500
usa kansas 2003 500
usa washington 2000 245
usa washington 2001 245
usa washington 2002 245
usa washington 2003 245
In the dataframe above, I want to drop the rows where the % of total area < 10%. In this case that would be all rows with state as iowa. What is the best way to do it in pandas? I tried groupby but not sure how to proceed.
df.groupby('area').sum()
Another solution with drop_duplicates and double boolean indexing:
a = df.drop_duplicates(['state','area'])
print (a)
country state year area
0 usa iowa 2000 30
4 usa kansas 2000 500
8 usa washington 2000 245
states = a.loc[a.area.div(a.area.sum()) >.1, 'state']
print (states)
4 kansas
8 washington
Name: state, dtype: object
print (df[df.state.isin(states)])
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245
You want to take any of the area values within each state and sum them up. I take the first.
groupby('state').area.first().sum() is the thing we normalize by.
df[df.area.div(df.groupby('state').area.first().sum()) >= .1]
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245