I'm scraping National Hockey League (NHL) data for multiple seasons from this URL:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
Output
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood)
Code:
import pandas as pd
result = pd.DataFrame()
for i in range (2010,2020):
print(i)
year = str(i)
url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
#source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
df = pd.read_html(url,header=1)[0]
df['year'] = year
result = result.append(df, sort=False)
result = result[~result['Age'].str.contains("Age")]
result = result.reset_index(drop=True)
You can then save to file with result.to_csv('filename.csv',index=False)
Output:
print (result)
Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year
0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010
1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010
2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010
3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010
4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010
5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010
6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010
7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010
8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010
9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010
10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010
11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010
12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010
13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010
14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010
15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010
16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010
17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010
18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010
19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010
20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010
21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010
22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010
23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010
24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010
25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010
26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010
27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010
28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010
29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010
... ... .. ... .. .. ... ... ... ... ... ... ...
10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019
10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019
10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019
10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019
10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019
10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019
10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019
10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019
10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019
10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019
10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019
10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019
10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019
10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019
10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019
10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019
10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019
10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019
10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019
10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019
10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019
10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019
10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019
10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019
10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019
10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019
10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019
10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019
10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019
10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019
[10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it:
1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv
Input
Output
It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!
As reference, I've linked my own conversions below.
Parsed to Evernote
Parsed to Excel
Related
I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they've played.
Example:
Player Season Pos GP G A P +/- PIM P/GP ... PPG PPP SHG SHP OTG GWG S S% TOI/GP FOW%
0 Nathan MacKinnon 2022 1 65 32 56 88 22 42 1.35 ... 7 27 0 0 1 5 299 10.7 21.07 45.4
1 Nathan MacKinnon 2021 1 48 20 45 65 22 37 1.35 ... 8 25 0 0 0 2 206 9.7 20.37 48.5
2 Nathan MacKinnon 2020 1 69 35 58 93 13 12 1.35 ... 12 31 0 0 2 4 318 11.0 21.22 43.1
3 Nathan MacKinnon 2019 1 82 41 58 99 20 34 1.21 ... 12 37 0 0 1 6 365 11.2 22.08 43.7
4 Nathan MacKinnon 2018 1 74 39 58 97 11 55 1.31 ... 12 32 0 1 3 12 284 13.7 19.90 41.9
5 Nathan MacKinnon 2017 1 82 16 37 53 -14 16 0.65 ... 2 14 2 2 2 4 251 6.4 19.95 50.6
6 Nathan MacKinnon 2016 1 72 21 31 52 -4 20 0.72 ... 7 16 0 1 0 6 245 8.6 18.87 48.4
7 Nathan MacKinnon 2015 1 64 14 24 38 -7 34 0.59 ... 3 7 0 0 0 2 192 7.3 17.05 47.0
8 Nathan MacKinnon 2014 1 82 24 39 63 20 26 0.77 ... 8 17 0 0 0 5 241 10.0 17.35 42.9
9 J.T. Compher 2022 2 70 18 15 33 6 25 0.47 ... 4 6 1 1 0 0 102 17.7 16.32 51.4
10 J.T. Compher 2021 2 48 10 8 18 10 19 0.38 ... 1 2 0 0 0 2 47 21.3 14.22 45.9
11 J.T. Compher 2020 2 67 11 20 31 9 18 0.46 ... 1 5 0 3 1 3 106 10.4 16.75 47.7
12 J.T. Compher 2019 2 66 16 16 32 -8 31 0.48 ... 4 9 3 3 0 3 118 13.6 17.48 49.2
13 J.T. Compher 2018 2 69 13 10 23 -29 20 0.33 ... 4 7 2 2 2 3 131 9.9 16.00 45.1
14 J.T. Compher 2017 2 21 3 2 5 0 4 0.24 ... 1 1 0 0 0 1 30 10.0 14.93 47.6
15 Darren Helm 2022 1 68 7 8 15 -5 14 0.22 ... 0 0 1 2 0 1 93 7.5 10.55 44.2
16 Darren Helm 2021 1 47 3 5 8 -3 10 0.17 ... 0 0 0 0 0 0 83 3.6 14.68 66.7
17 Darren Helm 2020 1 68 9 7 16 -6 37 0.24 ... 0 0 1 2 0 0 102 8.8 13.73 53.6
18 Darren Helm 2019 1 61 7 10 17 -11 20 0.28 ... 0 0 1 4 0 0 107 6.5 14.57 44.4
19 Darren Helm 2018 1 75 13 18 31 3 39 0.41 ... 0 0 2 4 0 0 141 9.2 15.57 44.1
[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png
If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.
As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.
While not drastic, it compounds into the other metrics as well.
I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.
I don't want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years' worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.
I would really appreciate some assistance from anyone more experienced than me!
This can be accomplished by using groupby and apply. For example:
edited_players = (players
.groupby("Player")
.apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)
Round brackets for formatting purposes.
The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.
The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I'll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".
I am new to python and am doing a webscraping tutorial. I am having trouble getting my CSV file in the appropriate folder. Basically, I am not able to view the resulting CSV. Does anyone have a solution regarding this problem?
import pandas as pd
import re
from bs4 import BeautifulSoup
import requests
#Pulling in website source code#
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#Pulling in player rows
##Identify Player Rows
players = soup.find_all('tr', attrs= {'class':re.compile('row-player-10-')})
for players in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in players.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
I've checked your code.
I've found one issue.
This one.
for players in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in players.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
You have to use this. (players to player, filename with csv)
for player in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in player.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects\result.csv', index = False, sep =',', encoding='utf-8')
Few issues.
As stated in the previous solution, your for loop you need to change to for player in players: You cant use the same variable as the variable you are looping through
You shouldn't use . in your variables as you have temp.df. That indicates the use of a method. Use underscore instead _
You never define final_df, then try to call it in your pd.concat()
You never define columns and then try to use that (and it would then overwrite your temp_df as well). What you are wanting to do is change instead is temp_df.columns = columns. But note you need to define columns.
Your find_all() for the players is incorrect in that you're searching for a class that contains row-player-10-. There is no class with that. It is row player-10. Very subtle difference, but it's the difference of returning None elements, and 50 elements.
stats = [stat.get_text() for stat in player.findall('td')] - again needs to be referencing player from the for loop as mentioned in 1). And in fact, there's a few syntax things in there that we need to change to actually pull out the text. So that should be [stat.text for stat in player.find_all('td')]
You use pd.concat the temp_df to a final_df within your loop. You can do that (provided you create an initial final_dataframe or final_df (you use 2 different variable names...not sure which you really wanted), but that will lead to repeating the headers/column names in it and require an extra step. What I would rather do, is store each temp_df into a list. Then after it loops through all the players, THEN concat the list of dataframes into a final one.
So here is the full code:
import pandas as pd
import re
from bs4 import BeautifulSoup
import requests
#Pulling in website source code#
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#Pulling in player rows
##Identify Player Rows
players = soup.find_all('tr', attrs= {'class':re.compile('.*row player-10-.*')})
columns = soup.find('tr', {'class':'colhead'})
columns = [x.text for x in columns.find_all('td')]
#Initialize a list of dataframes
final_df_list = []
# Loop through the players
for player in players:
##Pulling stats for each players
stats = [stat.text for stat in player.find_all('td')]
##Create a data frame for the single player stats
temp_df = pd.DataFrame(stats).transpose()
temp_df.columns = columns
#Put temp_df in a list of dataframes
final_df_list.append(temp_df)
##Join your list of single players stats
final_dataframe = pd.concat(final_df_list, ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
Output:
print(final_dataframe)
PLAYER YRS G AB R H ... HR RBI BB SO SB CS BA
0 1 J.D. Martinez 11 54 211 38 74 ... 8 28 24 55 0 0 .351
1 2 Paul Goldschmidt 11 62 236 47 82 ... 16 56 35 50 3 0 .347
2 3 Xander Bogaerts 9 62 232 39 77 ... 6 31 23 50 3 0 .332
3 4 Rafael Devers 5 63 258 53 85 ... 16 40 18 49 1 0 .329
4 5 Manny Machado 10 63 244 46 80 ... 11 43 29 46 7 1 .328
5 6 Jeff McNeil 4 61 216 30 70 ... 4 32 16 27 2 0 .324
6 7 Ty France 3 63 249 29 79 ... 10 41 18 40 0 0 .317
7 8 Bryce Harper 10 58 225 46 71 ... 15 46 24 48 7 2 .316
8 9 Yordan Alvarez 3 57 205 39 64 ... 17 45 31 38 0 1 .312
9 10 Aaron Judge 6 61 232 53 72 ... 25 49 31 66 4 0 .310
10 11 Jose Ramirez 9 59 222 40 68 ... 16 62 34 19 11 3 .306
11 12 Andrew Benintendi 6 61 226 23 68 ... 2 22 24 37 0 0 .301
12 13 Michael Brantley 13 55 207 23 62 ... 4 21 28 24 1 1 .300
13 14 Trea Turner 7 62 242 32 72 ... 8 47 21 48 13 2 .298
14 15 J.P. Crawford 5 59 216 28 64 ... 5 16 28 37 3 1 .296
15 16 Dansby Swanson 6 64 234 39 69 ... 9 37 23 70 9 2 .295
16 17 Mike Trout 11 57 201 44 59 ... 18 38 30 64 0 0 .294
17 Josh Bell 6 65 235 33 69 ... 8 39 28 37 0 1 .294
18 19 Santiago Espinal 2 63 219 25 64 ... 5 31 18 40 3 2 .292
19 20 Trey Mancini 5 58 217 25 63 ... 6 25 24 47 0 0 .290
20 21 Austin Hays 4 60 228 33 66 ... 9 37 18 41 1 3 .289
21 22 Eric Hosmer 11 59 222 23 64 ... 4 29 22 38 0 0 .288
22 23 Freddie Freeman 12 62 241 40 69 ... 5 34 32 43 6 0 .286
23 24 C.J. Cron 8 64 249 36 71 ... 14 44 16 74 0 0 .285
24 Tommy Edman 3 63 246 52 70 ... 7 26 26 45 15 2 .285
25 26 Starling Marte 10 54 222 40 63 ... 7 34 10 45 8 5 .284
26 27 Ian Happ 5 61 209 30 59 ... 7 31 34 50 5 1 .282
27 28 Pete Alonso 3 64 239 41 67 ... 18 59 26 56 2 1 .280
28 29 Lourdes Gurriel Jr. 4 58 206 21 57 ... 3 25 15 41 2 1 .277
29 30 Nathaniel Lowe 3 58 217 25 60 ... 8 24 15 57 1 1 .276
30 31 Mookie Betts 8 60 245 53 67 ... 17 40 27 47 6 1 .273
31 32 Jose Abreu 8 59 224 34 61 ... 9 30 33 42 0 0 .272
32 Amed Rosario 5 53 217 31 59 ... 1 16 10 31 7 1 .272
33 Ke'Bryan Hayes 2 57 213 26 58 ... 2 22 26 53 7 3 .272
34 35 Nolan Arenado 9 61 229 28 62 ... 11 41 25 31 0 2 .271
35 George Springer 8 58 218 39 59 ... 12 33 20 51 4 1 .271
36 37 Ryan Mountcastle 2 53 211 28 57 ... 12 35 11 57 2 0 .270
37 Vladimir Guerrero Jr. 3 62 233 34 63 ... 16 39 27 45 0 1 .270
38 39 Cesar Hernandez 9 65 271 37 73 ... 0 16 17 55 2 2 .269
39 Ketel Marte 7 61 223 33 60 ... 4 22 22 45 4 0 .269
40 Connor Joe 2 60 238 32 64 ... 5 16 32 52 3 2 .269
41 42 Brandon Nimmo 6 57 209 36 56 ... 4 21 27 44 0 1 .268
42 Thairo Estrada 3 59 205 34 55 ... 4 26 14 31 9 1 .268
43 44 Shohei Ohtani 4 63 243 42 64 ... 13 37 24 67 7 5 .263
44 45 Randy Arozarena 3 61 233 30 61 ... 7 31 14 58 12 5 .262
45 46 Nelson Cruz 17 60 222 29 58 ... 7 36 25 50 2 0 .261
46 Hunter Dozier 5 55 203 25 53 ... 6 21 15 50 1 2 .261
47 48 Kyle Tucker 4 58 204 24 53 ... 12 39 31 41 11 1 .260
48 Bo Bichette 3 63 265 35 69 ... 10 33 17 65 4 3 .260
49 50 Charlie Blackmon 11 57 232 29 60 ... 10 33 17 41 2 1 .259
[50 rows x 16 columns]
Lastly, tables are a great way to learn how to use BeautifulSoup because of the structure. But do want to throw out there that pandas can parse <table> tags for you with less code:
import pandas as pd
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
final_dataframe = pd.read_html(url, header=1)[0]
final_dataframe = final_dataframe[final_dataframe['PLAYER'].ne('PLAYER')]
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far.
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe')
driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27')
# I plan to make a while or for loop later, that is why I used f strings
team = "Coastal Carolina"
first = driver.find_element_by_xpath(f'//a[text()="{team}"]')
# This was the way another similiarly asked question was answered but did not work
#tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
# This grabs data from the very first row of data... not the one I want
tds = first.find_element_by_xpath('//following-sibling::td[4]').text
total_points = first.find_element_by_xpath('//following-sibling::td[10]').text
ppg = first.find_element_by_xpath('//following-sibling::td[11]').text
print(tds, total_points, ppg)
driver.quit()
I have tried to look around for a similarly asked question and was able to find this snippet
tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!
No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you:
import pandas as pd
url = 'https://www.ncaa.com/stats/football/fbs/current/team/27'
dfs = pd.read_html(url)[0]
Output:
print(df)
Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG
0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5
1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3
2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7
3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0
4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7
5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5
6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7
7 - SMU 7 40 36 1 0 7 0 299.0 42.7
8 9 Texas 7 38 37 0 0 8 1 291.0 41.6
9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8
10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3
11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7
12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6
13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5
14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4
15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3
16 17 Houston 6 30 28 0 0 5 0 223.0 37.2
17 - TCU 6 29 28 0 0 7 0 223.0 37.2
18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9
19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9
20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3
21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1
22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9
23 - Memphis 7 33 26 3 0 7 0 251.0 35.9
24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7
25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6
26 27 Florida 7 33 29 1 0 4 0 241.0 34.4
27 - Missouri 7 31 31 0 0 8 0 241.0 34.4
28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3
29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3
30 31 UCF 6 28 26 0 0 3 1 205.0 34.2
31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0
32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8
33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7
34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6
35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8
36 37 LSU 7 29 29 0 0 8 0 227.0 32.4
37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3
38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3
39 - NC State 6 25 24 0 0 6 1 194.0 32.3
40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2
41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9
42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6
43 44 Utah 6 25 22 1 0 5 0 189.0 31.5
44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4
45 46 App State 7 27 24 0 0 11 0 219.0 31.3
46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0
47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0
48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0
49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0
I'm kind of new with pandas and now i have a question.
I read a table from a html site and set my header according to the table on the website.
df = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)
Now I have the my dataframe with a matching header BUT I have some rows that are the same as the header, like the example below.
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG
1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06 253
2 John Tavares, C NYI 82 38 48 86 5 46 1.05 278
...
10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95 264
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG
14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88 268
I know that it's possible to delete duplicate rows with panda but is it possible to delete rows that are duplicates of the header or a specific row?
Hope you can help me out !
You can use boolean indexing:
df = df[df.PLAYER != 'PLAYER']
If need also remove rows with PP in column PLAYER use isin:
Notice: I add [0] to the end of read_html, because it return list of dataframes an you need select first item of list:
df = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)[0]
print (df)
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G \
0 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
1 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
2 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
3 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
4 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
5 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
6 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
7 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
8 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
9 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
10 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
11 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
12 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
13 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
14 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
15 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
...
...
mask = df['PLAYER'].isin(['PLAYER', 'PP'])
print (df[~mask])
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG \
0 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06 253
1 2 John Tavares, C NYI 82 38 48 86 5 46 1.05 278
2 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09 237
3 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00 395
4 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99 221
5 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95 153
6 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08 280
7 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97 158
8 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93 226
9 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95 264
12 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92 182
13 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90 279
14 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89 101
15 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88 268
16 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94 203
17 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87 202
18 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85 261
19 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01 212
20 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91 191
21 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87 304
...
...