Struggling to grab data from baseball reference - python

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR

The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

Related

How do I use Python and BeautifulSoup to scrape data from an html table?

if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]

Python web scraping on tables with customizable dropdowns

I am a novice who is trying to put together my first webpage scraping function, and running into some HTTP problems that I don't fully understand. For reference, I am trying to scrape from FanGraphs ZIPS projection page.
Their page defaults to displaying a table with 30 players, but you can edit the webpage such that the table displays 10, 20, 30, 50, 100, 500 or 1000 results per page. Ideally I would select the option for 500 results per page, but when I change the dropdown menu, the URL remains unchanged, so I'm trying write a script to get the HTML code and edit the attributes of the HTML page before trying to request that variation of the webpage. (If you view the page source, it's on line 1247)
Alternatively, you can cycle through additional pages with arrow button hyperlinks on the page, but they do not change the URL either.
I got around this problem for positional players since I can scrape each position separately and then concatenate the separate Pandas data frames, but I'm pretty oblivious about how HTTP requests work and was hoping for some help / guidance.
I tried something like:
from requests import Request, Session
url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=pit&type=zips&team=0&lg=al&players=0'
page = requests.get(url)
soup=BeautifulSoup(page.text, 'html.parser')
resultSet = soup.findAll(value=30)
for result in resultSet:
result["value"]="500"
s = Session()
req = requests.Request('POST', url, data = {'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_Input value': 500})
resp = s.send(prepped)
print(resp.status_code)
My thought was to scrape the structure, edit the configuration to match my desired webpage, and then regurgitate the fixed structure to scrape the data. Then I realized that I had no idea what I was doing.
A few things:
Am I "not even wrong" here? Is this possible with requests or would I need something like Selenium? Am I improperly using POST?
Side note: If context helps, for positional players I did this (would be doing the same thing for pitchers):
base_url = 'https://www.fangraphs.com/projections.aspx?pos=&stats=bat&type=zips&team=0&lg=all&players=0'
positions = ['c', '1b', '2b', '3b', 'ss', 'lf', 'cf', 'rf']
def generate_positional_urls(urlRoot=base_url, pos=positions):
index = base_url.index('pos=') + 4 position after
urlList = []
numPositions = len(positions)
for i in range(numPositions):
position = pos[i]
tempURL = urlRoot[:index] + position + urlRoot[index:]
urlList.append(tempURL)
return urlList
Which feeds into this:
def generate_df(url):
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
masterTable = soup.findAll('table', {"class": "rgMasterTable"}, {"id": "ProjectionBoard1_dg1_ct100"})
table_rows = masterTable[0].find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
headers = masterTable[0].find_all("thead")
colElements = headers[0].find_all("tr")
soupColResults = colElements[1].findChildren("a")
colStrings = [element.text for element in soupColResults]
df = pd.DataFrame(data)
df.columns = colStrings
df.drop(columns=[''], inplace=True)
df = df.iloc[3:, :]
return df
Then it's easy enough to loop through and concatenate data frames, but for this HTTP problem, I really don't know what I am doing.
Note: I am doing this to try to learn and practice Python. I can easily avoid this by using my web browser to click on the hyperlink to export the data to a csv, so don't exert too much effort figuring out unnecessary tedium.
First of all, you are trying to scrape an html table into a pandas dataframe. This can be done without BeautifulSoup using pd.read_html.
Second, always keep track of what is happening in the network tab of Chrome or Firefox. You can see what gets loaded when you click to change the number of results. You can copy the request as curl and then convert this to a python request, for example using this tool. You can now load the data and pass them to pandas directly. Note that I had to remove __VIEWSTATE and __EVENTVALIDATION for readability, but you can get them from the page source code or from the curl command as indicated above.
import requests
params = (
('pos', 'all'),
('stats', 'pit'),
('type', 'zips'),
('team', '0'),
('lg', 'al'),
('players', '0'),
)
data = {
'RadScriptManager1_TSM': '',
'__EVENTTARGET': 'ProjectionBoard1$dg1',
'__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
'__VIEWSTATE': '', #text removed for readability
'__VIEWSTATEGENERATOR': 'C239D6F0',
'__SCROLLPOSITIONX': '0',
'__SCROLLPOSITIONY': '0',
'__EVENTVALIDATION': '', #text removed for readability
'ProjectionBoard1_tsStats_ClientState': '{"selectedIndexes":["1"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsPosition_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$rcbTeam': 'All Teams',
'ProjectionBoard1_rcbTeam_ClientState': '',
'ProjectionBoard1$rcbLeague': 'AL',
'ProjectionBoard1_rcbLeague_ClientState': '',
'ProjectionBoard1_tsProj_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsUpdate_ClientState': '{"selectedIndexes":[],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox': '1000',
'ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState': '{"logEntries":[],"value":"1000","text":"1000","enabled":true,"checkedIndices":[],"checkedItemsTextOverflows":false}',
'ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox': '30',
'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState': '',
'ProjectionBoard1_dg1_ClientState': ''
}
response = requests.post('https://www.fangraphs.com/projections.aspx', params=params, data=data)
df = pd.read_html(response.text)
read_html returns a list of all tables as dataframes, the table you want can be accessed as df[8]. Output df[8].head():
Name
Unnamed: 1
Team
W
L
ERA
GS
G
IP
H
ER
HR
SO
BB
WHIP
K/9
BB/9
FIP
WAR
ADP
0
Lucas Giolito
nan
CHW
16
7
3
30
30
180
129
60
21
248
56
1.03
12.4
2.8
2.98
5.8
18.3
1
Gerrit Cole
nan
NYY
16
8
3.21
32
32
193.3
155
69
27
260
49
1.06
12.1
2.28
3.18
5.7
7.4
2
Shane Bieber
nan
CLE
14
9
3.51
32
32
197.7
177
77
26
230
36
1.08
10.47
1.64
3.23
5.4
10
3
Jose Berrios
nan
MIN
13
10
3.89
31
31
180.3
167
78
24
187
56
1.24
9.33
2.79
3.98
3.5
82.9
4
Chris Sale
nan
BOS
9
5
3.35
20
20
121
103
45
14
147
28
1.08
10.93
2.08
3.23
3.4
297

How to loop through scraped items and add them to a dictionary or pandas dataframe?

For a project I'm scraping data from futbin players and I would like to add that scraped data to a dict or pandas dataframe. I'm stuck for a couple of hours and would like some help if possible. I will put my code below on what I have so far. This piece of code only prints out the data and from that I'm clueless about what to do.
Code:
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', id='info_content')
rows = info.find_all('td')
for info in rows:
print(info.text.strip())
The work you have already done to identify the table you want is good.
use read_html() to convert to a dataframe
basic transforms to turn it into columns rather than key value pairs
in list comprehension get details of all wanted footballers
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
def myhtml(url):
# use BS4 to get table that has required data
html = str(BeautifulSoup(requests.get(url).content, 'html.parser').find('div', id='info_content').find("table"))
# read_html() returns a list, take first one, first column are attribute name, transpose to build DF
return pd.read_html(html)[0].set_index(0).T
df = pd.concat([myhtml(u) for u in urls])
Name
Club
Nation
League
Skills
Weak Foot
Intl. Rep
Foot
Height
Weight
Revision
Def. WR
Att. WR
Added on
Origin
R.Face
B.Type
DOB
Robert Lewandowski FIFA 21 Career Mode
Age
1
Edson Arantes Nascimento
FUT 21 ICONS
Brazil
Icons
5
4
5
Right
173cm
5'8"
70
Icon
Med
High
2020-09-10
Prime
nan
Unique
23-10-1940
nan
1
Robert Lewandowski
FC Bayern
Poland
Bundesliga
4
4
4
Right
184cm
6'0"
80
TOTY
Med
High
2021-01-22
TOTY
nan
Unique
nan
Robert Lewandowski FIFA 21 Career Mode
I would do it with open() and write()
file = open ("filename.txt", "w")
The w specifies the following :
"w" - Write - Opens a file for writing, creates the file if it does not exist
And then :
file.write (text_to_save)
Be sure to include os.path!
import os.path

bs4: How to extract table "team-roster home" and "team-roster away"

I'm trying to get the team lineup for the home and away teams from here: http://www.veikkausliiga.com/tilastot/2016/veikkausliiga/ottelut/498251/kokoonpanot/.
I have tried using tr and td tags, as well as with id and class tags. I cannot get any information from the table.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thePage = urllib.request.urlopen(url)
soupData = BeautifulSoup(thePage, "html.parser")
return soupData
def makeTableData(url):
matchDataSaved= ""
soup = make_soup(url)
table = soup.find("table",{"class":"team-rosters home"})
return table
I expected the table rows to show, but the code outputs None. How can I get the players lineup?
I'm not sure how you're wanting the output. This does work as it returns the element taht has that table tag. You'd still need to go through it to get the tr and td tags.
The other option is since it does have the <table> tag, is just use Pandas and then manipulate the dataframes to extract what you want and in the format you want it. Probably not in the form I would like it, but gets you the data and you can work with it from there:
import pandas as pd
url = 'http://www.veikkausliiga.com/tilastot/2016/veikkausliiga/ottelut/498251/kokoonpanot/'
tables = pd.read_html(url)
results = pd.DataFrame()
ix_list = [2,3,4,5,7,8,9,10]
for ix in ix_list:
results = results.append(tables[ix])
results = results.reset_index(drop=True)
Output:
print (results)
0 ... 3
0 #21 T. Dähne ... NaN
1 #15 V. Jalasto ... #3 Taiwo
2 #20 Vincent ... NaN
3 #7 N. Alho ... NaN
4 #25 D. Assis ... NaN
5 #9 B. Span ... #11 J. Ibrahim
6 #4 P. Sparrdal Mantilla ... #7 T. Wirtanen
7 #1 W. Viitala ... NaN
[8 rows x 4 columns]

scraping table from site - finding blank cells, python

Here's the site I'm working with: http://www.fantasypros.com/mlb/probable-pitchers.php
What I want to do it run the code every day, and it return a list of pitchers that are pitching that day, so just the first column. Here's what I have so far.
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.fantasypros.com/mlb/probable-pitchers.php'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table',{'class': 'table table-condensed'})
table2 = table.find('tbody') #this find just the rows with pitchers (excludes dates)
daysOnPage = []
for row in table.findAll('th'):
daysOnPage.append(row.text)
daysOnPage.pop(0)
#print(daysOnPage)
pitchers = []
for row in table2.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(row.text)
This returns a list of every pitcher on the page. If every cell on the table was always filled, I could do something like deleting every nth player or something like that, but that seems pretty inelegant, and also doesn't work since you don't ever know which cells will be blank. I've looked through the table2.prettify() code but I can't find anything that indicates to me where a blank cell is coming.
Thanks for the help.
Edit: Tinkering a little bit, I've figured this much out:
for row in table2.find('tr'):
for a in row.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(a.text)
continue
That prints the first row of pitchers, which is also a problem I was going to tackle later. Why does the continue not make it iterate through the rows?
When I hear table, I think pandas. You can have pandas.read_html do the parsing for you then use pandas.Series.dropna return only valid values.
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('http://www.fantasypros.com/mlb/probable-pitchers.php')
In [3]: df = dfs[0].head(10) # get the first dataframe and just use the first 10 teams for this example
In [4]: print(df['Thu Aug 6']) # Selecting just one day by label
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
2 NaN
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
6 NaN
7 NaN
8 NaN
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
In [5]: active = df['Thu Aug 6'].dropna() # now just drop any fields that are NaNs
In [6]: print(active)
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
I suppose the last thing you'll want to do is parse the strings in the table to get just the pitchers name.
If you want to write the Series to a csv, you can do so directly by:
In [7]: active.to_csv('active.csv')
This gives you a csv that looks something like this:
0,#WSHJ. Hellickson(7-6)SP 126
1,MIAM. Wisler(5-1)SP 306
3,#NYYE. Rodriguez(6-3)SP 179
4,SFK. Hendricks(4-5)SP 51
5,STLM. Lorenzen(3-6)SP 301
9,KCB. Farmer(0-2)SP 267

Categories