Web scrape Sports-Reference with Python Beautiful Soup - python

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans

sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.

Pandas Merge Not Working When Values Are an Exact Match

Below is my code and Dataframes. stats_df is much bigger. Not sure if it matters, but the column values are EXACTLY as they appear in the actual files. I can't merge the two DFs without losing 'Alex Len' even though both DFs have the same PlayerID value of '20000852'
stats_df = pd.read_csv('stats_todate.csv')
matchup_df = pd.read_csv('matchup.csv')
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']])
I have also tried:
stats_df['PlayerID'] = stats_df['PlayerID'].astype(str)
matchup_df['PlayerID'] = matchup_df['PlayerID'].astype(str)
stats_df['PlayerID'] = stats_df['PlayerID'].str.strip()
matchup_df['PlayerID'] = matchup_df['PlayerID'].str.strip()
Any ideas?
Here are my two Dataframes:
DF1:
PlayerID SeasonType Season Name Team Position
20001713 1 2018 A.J. Hammons MIA C
20002725 2 2022 A.J. Lawson ATL SG
20002038 2 2021 Élie Okobo BKN PG
20002742 2 2022 Aamir Simms NY PF
20000518 3 2018 Aaron Brooks MIN PG
20000681 1 2022 Aaron Gordon DEN PF
20001395 1 2018 Aaron Harrison DAL SG
20002680 1 2022 Aaron Henry PHI SF
20002005 1 2022 Aaron Holiday PHO PG
20001981 3 2018 Aaron Jackson HOU PF
20002539 1 2022 Aaron Nesmith BOS SF
20002714 1 2022 Aaron Wiggins OKC SG
20001721 1 2022 Abdel Nader PHO SF
20002251 2 2020 Abdul Gaddy OKC PG
20002458 1 2021 Adam Mokoka CHI SG
20002619 1 2022 Ade Murkey SAC PF
20002311 1 2022 Admiral Schofield ORL PF
20000783 1 2018 Adreian Payne ORL PF
20002510 1 2022 Ahmad Caver IND PG
20002498 2 2020 Ahmed Hill CHA PG
20000603 1 2022 Al Horford BOS PF
20000750 3 2018 Al Jefferson IND C
20001645 1 2019 Alan Williams BKN PF
20000837 1 2022 Alec Burks NY SG
20001882 1 2018 Alec Peters PHO PF
20002850 1 2022 Aleem Ford ORL SF
20002542 1 2022 Aleksej Pokuševski OKC PF
20002301 3 2021 Alen Smailagic GS PF
20001763 1 2019 Alex Abrines OKC SG
20001801 1 2022 Alex Caruso CHI SG
20000852 1 2022 Alex Len SAC C
DF2:
PlayerID Name Date Started Opponent GameStatus Matchup
20000681 Aaron Gordon 4/1/2022 1 MIN 16
20002005 Aaron Holiday 4/1/2022 0 MEM 21
20002539 Aaron Nesmith 4/1/2022 0 IND 13
20002714 Aaron Wiggins 4/1/2022 1 DET 14
20002311 Admiral Schofield 4/1/2022 0 TOR 10
20000603 Al Horford 4/1/2022 1 IND 13
20002542 Aleksej Pokuševski 4/1/2022 1 DET 14
20000852 Alex Len 4/1/2022 1 HOU 22
You need to specify the column you want to merge on using the on keyword argument:
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']], on=['PayerID'])
Otherwise it will merge using all of the shared columns.
Here is the explanation from the pandas docs:
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.

Parse urls python

So im using this code to get a list of URLs, the thing is that i need a column with the URLS and another one with the tags or the text
import requests
from bs4 import BeautifulSoup
getpage= requests.get
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
all_links= getpage_soup.findAll('a')
for link in all_links:
print (link)
What i'm expecting is a dataframe similar to this
pd.DataFrame({'link': 'https://drive.google.com/file/d/1t1hLPvUkfCde1wglfjAh--r8NpLONbRf/view?usp=sharing', 'tag': 'Estatal 2020'})
Using your first example of what you need, this may can help you:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published"
data = []
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find('div', {'class': 'article-body'}) # get div "article-body"
for ul in div.findAll('ul'): # get all 'ul' tags inside div "article-body"
for li in ul.findAll('li'): # get all 'li' inside 'ul'
for link in li.findAll('a', href=True): # get 'a' inside li
data.append([link['href'], link.text]) # link['href'] = url | link.text = "Estatal 2020"
dataframe = pd.DataFrame(data, columns=['link', 'tag'])
print(dataframe)
[OUTPUT]
link tag
0 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
1 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
2 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
3 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
[/OUTPUT]
You could try this:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
getpage= requests.get('https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published')
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
all_links= getpage_soup.findAll('a', attrs={'href': re.compile("(^http://)|(^https://)")}) #get all the urls with protocols http or https
data=[]
for link in all_links:
if link.text.strip()=='': #if the link doesn't have text, add the id
data.append([link['href'], link.get('id')])
else:
data.append([link['href'], link.text.strip()]) #add the text without trailing and leading whitespaces
df=pd.DataFrame(data, columns=['link', 'tag']) #create the dataframe
print(df)
Output:
df
link tag
0 https://coronavirus.gob.mx/ Información importante Coronavirus COVID-19
1 https://www.gob.mx/busqueda?utf8=✓ botbusca
2 https://www.gob.mx/sesnsp/acciones-y-programas... Transparencia
3 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
4 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
5 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
6 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
7 https://drive.google.com/open?id=1mgFsF3rdoYLE... Estatal 2016
8 https://drive.google.com/open?id=1RQhk58-fHNPr... Estatal 2015
9 https://drive.google.com/file/d/1WIzrjJTF24DCX... Estatal 2020
10 https://drive.google.com/open?id=1QtjDM7pczeST... Estatal 2019
11 https://drive.google.com/open?id=15l9hl4eUmFCM... Estatal 2018
12 https://drive.google.com/open?id=1FO4W0HK8cdPk... Estatal 2017
13 https://drive.google.com/open?id=1tDEjJ1XLdFP8... Estatal 2016
14 https://drive.google.com/open?id=1lCeFrMi_D-Gr... Estatal 2015
15 https://drive.google.com/file/d/1q8AdhfxpLdF_l... Estatal 2015 - 2020
16 https://drive.google.com/file/d/1jopZOChRppi6Q... Mayo 2020
17 https://drive.google.com/open?id=1CvHXHC48SYWT... Febrero 2020
18 https://drive.google.com/open?id=1QxUe0HwLNNZH... Enero 2020
19 https://drive.google.com/open?id=1KZzHGdTlH5ya... Diciembre 2019
20 https://drive.google.com/open?id=119VQ5-1JPnWZ... Noviembre 2019
21 https://drive.google.com/open?id=1CbNV3sTkSn3t... Octubre 2019
22 https://drive.google.com/open?id=1gpMM2pi6Ta-r... Septiembre 2019
23 https://drive.google.com/open?id=1dHUhpr-DbOPx... Agosto 2019
24 https://drive.google.com/open?id=18CQlwY07tTaa... Julio 2019
25 https://drive.google.com/open?id=1EnhF4IOFxqLr... Junio 2019
26 https://drive.google.com/open?id=1wrTEwP5Q3xwZ... Mayo 2019
27 https://drive.google.com/open?id=1ZuY20S-5Gi8l... Abril 2019
28 https://drive.google.com/open?id=1P2Xvs7kLLclg... Marzo 2019
29 https://drive.google.com/open?id=16FWEKbbJ83KL... Febrero 2019
30 https://drive.google.com/open?id=1mIw1XKJBY8ZV... Enero 2019
31 https://drive.google.com/open?id=1iTGBC1Ge4UWP... Diciembre 2018
32 https://drive.google.com/open?id=1Kmtir0rhQLf7... Noviembre 2018
33 https://drive.google.com/open?id=1r7SHNfKVXGfe... Octubre 2018
34 https://drive.google.com/open?id=1IKpGJbJuNQKW... Septiembre 2018
35 https://drive.google.com/open?id=1spqdNT0T0pen... Agosto 2018
36 https://drive.google.com/open?id=1k07ZSk2c4irk... Julio 2018
37 https://drive.google.com/open?id=1HX4SlChjRbMm... Junio 2018
38 https://drive.google.com/open?id=1ErSyO9-rfHi3... Mayo 2018
39 https://drive.google.com/open?id=1cK5lR33-mA6-... Abril 2018
40 https://drive.google.com/open?id=1MaqJaSfq2KxB... Marzo 2018
41 https://drive.google.com/open?id=1GaoDPWud-2Iy... Febrero 2018
42 https://drive.google.com/open?id=1OXITYyRrUBwj... Enero 2018
43 https://drive.google.com/file/d/1KwjGdNYez72_z... Estatal 2015 - 2020
44 https://drive.google.com/file/d/14fDk5sBry1DOo... Municipal 2015 - 2020
45 https://www.gob.mx/sesnsp/acciones-y-programas... Regresar al menú principal de Incidencia Delic...
46 https://www.facebook.com/sharer/sharer.php?u=h... Compartir
47 http://www.participa.gob.mx Participa
48 https://datos.gob.mx/ Datos
49 https://www.gob.mx/publicaciones Publicaciones Oficiales
50 https://www.infomex.org.mx/gobiernofederal/hom... Sistema Infomex
51 http://www.inai.org.mx INAI
52 http://www.ordenjuridico.gob.mx Marco Jurídico
53 https://www.facebook.com/gobmexico Facebook
54 https://twitter.com/GobiernoMX Twitter
And if you only want the ones that start with "Estatal", you can add this to the code above:
import numpy as np
mask=np.where(df.tag.str.startswith('Estatal'), True, False)
print(df[mask])
Output:
link tag
3 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
4 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
5 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
6 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
7 https://drive.google.com/open?id=1mgFsF3rdoYLE... Estatal 2016
8 https://drive.google.com/open?id=1RQhk58-fHNPr... Estatal 2015
9 https://drive.google.com/file/d/1WIzrjJTF24DCX... Estatal 2020
10 https://drive.google.com/open?id=1QtjDM7pczeST... Estatal 2019
11 https://drive.google.com/open?id=15l9hl4eUmFCM... Estatal 2018
12 https://drive.google.com/open?id=1FO4W0HK8cdPk... Estatal 2017
13 https://drive.google.com/open?id=1tDEjJ1XLdFP8... Estatal 2016
14 https://drive.google.com/open?id=1lCeFrMi_D-Gr... Estatal 2015
15 https://drive.google.com/file/d/1q8AdhfxpLdF_l... Estatal 2015 - 2020
43 https://drive.google.com/file/d/1KwjGdNYez72_z... Estatal 2015 - 2020

Python, Web Scraping a bar graph

I am currently trying to webscrape the bar graph/chart from this page, but am unsure what specific BeautifulSoup features are needed to extract these types of bar charts. Additionally, if anyone has a link to what BeautifulSoup features are used for scraping what types of charts/graphs, that would be greatly appreciated. https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/
Here is the code I have so far
import pandas as pd
import requests
from bs4 import BeautifulSoup
dp = 'https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/'
page = requests.get(dp).text
soup = BeautifulSoup(page, 'html.parser')
#This is what I am trying to figure out
new = soup.find("div", id="bar")
print(new)
This script will get all data from the bar graph:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tds = soup.select('#statTableHTML td')
data = []
for td1, td2 in zip(tds[::2], tds[1::2]):
data.append({'State':td1.text, 'Number': td2.text})
df = pd.DataFrame(data)
print(df)
Prints:
State Number
0 Texas 725,368
1 Florida 432,581
2 California 376,666
3 Virginia 356,963
4 Pennsylvania 271,427
5 Georgia 225,993
6 Arizona 204,817
7 North Carolina 181,209
8 Ohio 175,819
9 Alabama 168,265
10 Illinois 147,698
11 Wyoming 134,050
12 Indiana 133,594
13 Maryland 128,289
14 Tennessee 121,140
15 Washington 119,829
16 Louisiana 116,398
17 Colorado 112,691
18 Arkansas 108,801
19 New Mexico 105,836
20 South Carolina 99,283
21 Minnesota 98,585
22 Nevada 96,822
23 Kentucky 93,719
24 Utah 93,440
25 New Jersey 90,217
26 Missouri 88,270
27 Michigan 83,355
28 Oklahoma 83,112
29 New York 82,917
30 Wisconsin 79,639
31 Connecticut 74,877
32 Oregon 74,722
33 District of Columbia 59,832
34 New Hampshire 59,341
35 Idaho 58,797
36 Kansas 54,409
37 Mississippi 52,346
38 West Virginia 41,651
39 Massachusetts 39,886
40 Iowa 36,540
41 South Dakota 31,134
42 Nebraska 29,753
43 Montana 23,476
44 Alaska 20,520
45 North Dakota 19,720
46 Maine 17,410
47 Hawaii 8,665
48 Vermont 7,716
49 Delaware 5,281
50 Rhode Island 4,655
51 *Other US Territories 866
Maybe you can find more about Web Scraping from this web site https://www.datacamp.com/community/tutorials/web-scraping-using-python

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories