I am a novice who is trying to put together my first webpage scraping function, and running into some HTTP problems that I don't fully understand. For reference, I am trying to scrape from FanGraphs ZIPS projection page.
Their page defaults to displaying a table with 30 players, but you can edit the webpage such that the table displays 10, 20, 30, 50, 100, 500 or 1000 results per page. Ideally I would select the option for 500 results per page, but when I change the dropdown menu, the URL remains unchanged, so I'm trying write a script to get the HTML code and edit the attributes of the HTML page before trying to request that variation of the webpage. (If you view the page source, it's on line 1247)
Alternatively, you can cycle through additional pages with arrow button hyperlinks on the page, but they do not change the URL either.
I got around this problem for positional players since I can scrape each position separately and then concatenate the separate Pandas data frames, but I'm pretty oblivious about how HTTP requests work and was hoping for some help / guidance.
I tried something like:
from requests import Request, Session
url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=pit&type=zips&team=0&lg=al&players=0'
page = requests.get(url)
soup=BeautifulSoup(page.text, 'html.parser')
resultSet = soup.findAll(value=30)
for result in resultSet:
result["value"]="500"
s = Session()
req = requests.Request('POST', url, data = {'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_Input value': 500})
resp = s.send(prepped)
print(resp.status_code)
My thought was to scrape the structure, edit the configuration to match my desired webpage, and then regurgitate the fixed structure to scrape the data. Then I realized that I had no idea what I was doing.
A few things:
Am I "not even wrong" here? Is this possible with requests or would I need something like Selenium? Am I improperly using POST?
Side note: If context helps, for positional players I did this (would be doing the same thing for pitchers):
base_url = 'https://www.fangraphs.com/projections.aspx?pos=&stats=bat&type=zips&team=0&lg=all&players=0'
positions = ['c', '1b', '2b', '3b', 'ss', 'lf', 'cf', 'rf']
def generate_positional_urls(urlRoot=base_url, pos=positions):
index = base_url.index('pos=') + 4 position after
urlList = []
numPositions = len(positions)
for i in range(numPositions):
position = pos[i]
tempURL = urlRoot[:index] + position + urlRoot[index:]
urlList.append(tempURL)
return urlList
Which feeds into this:
def generate_df(url):
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
masterTable = soup.findAll('table', {"class": "rgMasterTable"}, {"id": "ProjectionBoard1_dg1_ct100"})
table_rows = masterTable[0].find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
headers = masterTable[0].find_all("thead")
colElements = headers[0].find_all("tr")
soupColResults = colElements[1].findChildren("a")
colStrings = [element.text for element in soupColResults]
df = pd.DataFrame(data)
df.columns = colStrings
df.drop(columns=[''], inplace=True)
df = df.iloc[3:, :]
return df
Then it's easy enough to loop through and concatenate data frames, but for this HTTP problem, I really don't know what I am doing.
Note: I am doing this to try to learn and practice Python. I can easily avoid this by using my web browser to click on the hyperlink to export the data to a csv, so don't exert too much effort figuring out unnecessary tedium.
First of all, you are trying to scrape an html table into a pandas dataframe. This can be done without BeautifulSoup using pd.read_html.
Second, always keep track of what is happening in the network tab of Chrome or Firefox. You can see what gets loaded when you click to change the number of results. You can copy the request as curl and then convert this to a python request, for example using this tool. You can now load the data and pass them to pandas directly. Note that I had to remove __VIEWSTATE and __EVENTVALIDATION for readability, but you can get them from the page source code or from the curl command as indicated above.
import requests
params = (
('pos', 'all'),
('stats', 'pit'),
('type', 'zips'),
('team', '0'),
('lg', 'al'),
('players', '0'),
)
data = {
'RadScriptManager1_TSM': '',
'__EVENTTARGET': 'ProjectionBoard1$dg1',
'__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
'__VIEWSTATE': '', #text removed for readability
'__VIEWSTATEGENERATOR': 'C239D6F0',
'__SCROLLPOSITIONX': '0',
'__SCROLLPOSITIONY': '0',
'__EVENTVALIDATION': '', #text removed for readability
'ProjectionBoard1_tsStats_ClientState': '{"selectedIndexes":["1"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsPosition_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$rcbTeam': 'All Teams',
'ProjectionBoard1_rcbTeam_ClientState': '',
'ProjectionBoard1$rcbLeague': 'AL',
'ProjectionBoard1_rcbLeague_ClientState': '',
'ProjectionBoard1_tsProj_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsUpdate_ClientState': '{"selectedIndexes":[],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox': '1000',
'ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState': '{"logEntries":[],"value":"1000","text":"1000","enabled":true,"checkedIndices":[],"checkedItemsTextOverflows":false}',
'ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox': '30',
'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState': '',
'ProjectionBoard1_dg1_ClientState': ''
}
response = requests.post('https://www.fangraphs.com/projections.aspx', params=params, data=data)
df = pd.read_html(response.text)
read_html returns a list of all tables as dataframes, the table you want can be accessed as df[8]. Output df[8].head():
Name
Unnamed: 1
Team
W
L
ERA
GS
G
IP
H
ER
HR
SO
BB
WHIP
K/9
BB/9
FIP
WAR
ADP
0
Lucas Giolito
nan
CHW
16
7
3
30
30
180
129
60
21
248
56
1.03
12.4
2.8
2.98
5.8
18.3
1
Gerrit Cole
nan
NYY
16
8
3.21
32
32
193.3
155
69
27
260
49
1.06
12.1
2.28
3.18
5.7
7.4
2
Shane Bieber
nan
CLE
14
9
3.51
32
32
197.7
177
77
26
230
36
1.08
10.47
1.64
3.23
5.4
10
3
Jose Berrios
nan
MIN
13
10
3.89
31
31
180.3
167
78
24
187
56
1.24
9.33
2.79
3.98
3.5
82.9
4
Chris Sale
nan
BOS
9
5
3.35
20
20
121
103
45
14
147
28
1.08
10.93
2.08
3.23
3.4
297
Related
I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21
https://training.lczero.org/networks/?show_all=1
I want to extract the columns called Number, Run, Network, Elo, Games from this website. I was able to do so using Pandas, but then the pd.read_html() function does not extract the href values that I need to be able to download this data. I tried using BeautifulSoup but did not land anywhere. I managed to get all the urls but I need the other columns as well to make sense of it. Can anyone please help?
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://training.lczero.org/networks/?show_all=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = pd.read_html(str(soup))[0]
df["links"] = [
"https://training.lczero.org" + a["href"] for a in soup.select("td > a")
]
print(df.head())
Prints:
Number Run Network Elo Games Blocks Filters Time Ordo Elo links
0 805799 1 a13e6d41 141.26 12533 15 512 2022-10-22 12:33:33 +00:00 0 https://training.lczero.org/get_network?sha=a13e6d412e4d7a113ca604647a6f56845ad280b5584ede96ca6a7658dba7f897
1 805798 1 d6eea775 138.51 63008 15 512 2022-10-22 11:57:32 +00:00 0 https://training.lczero.org/get_network?sha=d6eea77581d45a0e3bc46203baa10eb94b7e345e15c246f0d18b98b9d5d425f6
2 805797 1 cdffe453 133.00 65478 15 512 2022-10-22 11:20:34 +00:00 133 https://training.lczero.org/get_network?sha=cdffe45321e8a843eabc7c6ee71254647b31b5a8798440035ee2b222acc3162a
3 805796 1 6271053e 131.00 66486 15 512 2022-10-22 10:43:30 +00:00 131 https://training.lczero.org/get_network?sha=6271053e90de21c67a25ba23981d8f03e888a4f7afe543f736a057ebb5d07fec
4 805795 1 0b03a5b0 136.00 63894 15 512 2022-10-22 10:07:32 +00:00 136 https://training.lczero.org/get_network?sha=0b03a5b0dbc019e936f075e6f5eacc603d888970e56bb12c6e747b05fda09b86
if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]
I'm trying to make an app that scrapes my top ten favorite space related stock prices. but
List item I have some trouble with my code and I'm new to scraping. Once I get this to work, I want to put it into a csv file and make a bar graph with it, I would love some help and suggestions. Also Im doing this in Anaconda :
#import libraries
import bs4
from bs4 import BeautifulSoup
#grequests is a unique library that allows you to use many urls with ease
#must install qrequest in annacode use : conda install -c conda-forge grequests
#if you know a better way to do this, please let me know
import grequests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
unsent_request = (grequests.get(url) for url in urls)
results = grequests.map(unsent_request)
def parsePrice(r):
soup = bs4.BeautifulSoup(r.text,"html")
price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
return price
for r in results:
parsePrice(r)
SO what code is bringing this error:
IndexError Traceback (most recent call last)
<ipython-input-6-9ac8cb94b6fb> in <module>
5
6 for r in results:
----> 7 parsePrice(r)
<ipython-input-6-9ac8cb94b6fb> in parsePrice(r)
1 def parsePrice(r):
2 soup = bs4.BeautifulSoup(r.text,"html")
----> 3 price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
4 return price
5
IndexError: list index out of range
whats up?
The data on the page is with in <table> tags. Use pandas' .read_html(), as it uses BeautifulSoup under the hood. That way you can grab more.
That data is also available through API/XHR, but won't get into that, as that'll be slightly more complex.
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('path/filename.csv', index=False)
Output:
print (df.to_string())
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...
But IF you must go the route of BeautifulSoup, your find_all() is incorrect. First the class is strictly the text between the double quotes after class=. You've included other attributes of the element such as datareact-id, and the actual content/text that you are wanting to pull. Secondly, that class is a child of the <span> tag, not the div tag. If you pull the div tag, that's fine, but then you'd still need to go inside THAT element to get the text.
Give this a try:
import bs4
import requests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
resp = requests.get(r)
soup = bs4.BeautifulSoup(resp.text,"html")
price=soup.find_all('span',{'class':'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'})[0].text
return price
for r in urls:
print (parsePrice(r))
Output:
8.06
32.76
6.60
72.22
215.54
145.14
343.28
4.1550
7.43
348.32
Here's the site I'm working with: http://www.fantasypros.com/mlb/probable-pitchers.php
What I want to do it run the code every day, and it return a list of pitchers that are pitching that day, so just the first column. Here's what I have so far.
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.fantasypros.com/mlb/probable-pitchers.php'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table',{'class': 'table table-condensed'})
table2 = table.find('tbody') #this find just the rows with pitchers (excludes dates)
daysOnPage = []
for row in table.findAll('th'):
daysOnPage.append(row.text)
daysOnPage.pop(0)
#print(daysOnPage)
pitchers = []
for row in table2.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(row.text)
This returns a list of every pitcher on the page. If every cell on the table was always filled, I could do something like deleting every nth player or something like that, but that seems pretty inelegant, and also doesn't work since you don't ever know which cells will be blank. I've looked through the table2.prettify() code but I can't find anything that indicates to me where a blank cell is coming.
Thanks for the help.
Edit: Tinkering a little bit, I've figured this much out:
for row in table2.find('tr'):
for a in row.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(a.text)
continue
That prints the first row of pitchers, which is also a problem I was going to tackle later. Why does the continue not make it iterate through the rows?
When I hear table, I think pandas. You can have pandas.read_html do the parsing for you then use pandas.Series.dropna return only valid values.
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('http://www.fantasypros.com/mlb/probable-pitchers.php')
In [3]: df = dfs[0].head(10) # get the first dataframe and just use the first 10 teams for this example
In [4]: print(df['Thu Aug 6']) # Selecting just one day by label
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
2 NaN
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
6 NaN
7 NaN
8 NaN
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
In [5]: active = df['Thu Aug 6'].dropna() # now just drop any fields that are NaNs
In [6]: print(active)
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
I suppose the last thing you'll want to do is parse the strings in the table to get just the pitchers name.
If you want to write the Series to a csv, you can do so directly by:
In [7]: active.to_csv('active.csv')
This gives you a csv that looks something like this:
0,#WSHJ. Hellickson(7-6)SP 126
1,MIAM. Wisler(5-1)SP 306
3,#NYYE. Rodriguez(6-3)SP 179
4,SFK. Hendricks(4-5)SP 51
5,STLM. Lorenzen(3-6)SP 301
9,KCB. Farmer(0-2)SP 267