Scraping Dynamic Webpages w/ a date selector

Scraping Dynamic Webpages w/ a date selector - python

I am looking to use the requests module in python to scrape:
https://www.lines.com/betting/nba/odds
This site contains historical betting odds data.
The main issues, is there is a date selector on this page, and i can not seem to find where the date value is stored. Ive tried looking in the headers and the cookies, and still cant seem to find where date is stored, in order to programmatically change it, to scrape data from different dates.
Looking on the network tab, it seems like it is pulling this data from:
https://www.lines.com/betting/nba/odds/best-line?date=2023-01-23'
However, even with using the headers, i am unable to access this site. It just returns the data from:
https://www.lines.com/betting/nba/odds
which is the current date.
I am looking to do so without using a different method (i.e. Selenium) which seems pretty straight forward (Open Page -> Download Data -> Click Previous Date -> Repeat)
Here is my code to do so:
import requests
url = 'https://www.lines.com/betting/nba/odds/'
requests.get(url).text
Thanks!

Try to pass headers={"X-Requested-With": "XMLHttpRequest"} to the request:
import requests
import pandas as pd
from itertools import cycle
from bs4 import BeautifulSoup
url = "https://www.lines.com/betting/nba/odds/best-line?date=2023-01-28"
soup = BeautifulSoup(
requests.get(url, headers={"X-Requested-With": "XMLHttpRequest"}).content,
"html.parser",
)
odds = []
for o in soup.select(".odds-list-col"):
matches = [t["title"] for t in o.select(".odds-list-team")]
teams = cycle(matches)
for od in o.select(".odds-list-val"):
odds.append(
[
next(teams),
" vs ".join(matches),
od.find_previous(class_="odds-col-title").text.strip(),
od.get_text(strip=True, separator=" "),
]
)
df = pd.DataFrame(odds, columns=["Team", "Match", "Odd", "Value"]).pivot(
index=["Team", "Match"], columns="Odd", values="Value"
)
print(df)
Prints:
Odd M/L O/U P/S
Team Match
Bucks Bucks vs Pacers -325 o237.0 (-110) -7.5 (-115)
Cavaliers Cavaliers vs Thunder -115 — -1.0 (-110)
Grizzlies Grizzlies vs Timberwolves -150 o237.0 (-110) -3.0 (-110)
Heat Magic vs Heat -275 u218.0 (-110) -7.0 (-110)
Magic Magic vs Heat +265 o218.0 (-110) +7.5 (-110)
Pacers Bucks vs Pacers +280 u237.5 (-110) +8.0 (-108)
Raptors Raptors vs Warriors +188 — +5.5 (-110)
Thunder Cavaliers vs Thunder -105 — +1.0 (-110)
Timberwolves Grizzlies vs Timberwolves +145 u237.5 (-110) +3.5 (-110)
Warriors Raptors vs Warriors -205 — -5.0 (-114)

Related

sportsreference API wrong data issue

I'm using the sportsreference API to get some data but I'm not sure if I am doing something wrong or there is an issue with the API. When I pull the data I need with the API it always says the away team won, even on games this is not true for.
Code snippet:
from sportsreference.nba.boxscore import Boxscores
from sportsreference.nba.boxscore import Boxscore
# Select range of dates to get boxscores from (year, month, day)
games = Boxscores(datetime(2017, 10, 17), datetime(2017, 10, 20))
# Get boxscore abbreviations to get more detailed game boxscores
boxscore_abvs = []
for key in games.games.keys():
for i in range(len(games.games[key])):
boxscore_abvs.append(games.games[key][i]['boxscore'])
# Get more detailed boxscores
df = pd.DataFrame()
for abv in boxscore_abvs:
game_data = Boxscore(abv)
temp_df = game_data.dataframe
df = df.append(temp_df)
Sample of wrong output from df (Cavs won this game, API reports Celtics):
away_assist_percentage away_assists away_block_percentage away_blocks away_defensive_rating ... losing_name pace winner winning_abbr winning_name
201710170CLE 66.7 24 6.6 4 102.7 ... Cleveland Cavaliers 99.3 Away BOS Boston Celtics

It's a known issue caused by the site changing its HTML layout. Seems like it should be fixed in the 0.6.0 release: https://github.com/roclark/sportsreference/pull/506.
In the meantime, you can install from git to get the fixed version:
pip install --force-reinstall git+https://github.com/roclark/sportsreference.git#master
With that, I get the correct result:
Boxscore('201710170CLE').dataframe[['away_points', 'home_points', 'winning_name']]
# away_points home_points winning_name
# 201710170CLE 99 102 Cleveland Cavaliers

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

I have been trying to scrape data from https://www.premierleague.com/players to get team rosters for premier league clubs for the past 10 years.
The following is the code I am using. In this particular example se=17 specifies season 2008/09 and cl=12 is for Manchester United.
url= 'https://www.premierleague.com/players?se=17&cl=12'
r=requests.get(url)
d= pd.read_html(r.text)
d[0]
Inspite of the url providing the correct data on the page, the table I get is the one for the current season 2019/20. I have tried multiple combinations of the url and still I am not able to scrape.
Can someone help?

I prefer to use BeautifulSoup to navigate the DOM. This works.
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.premierleague.com/players", params={"se":17,"cl":12})
soup = BeautifulSoup(resp.content.decode(), "html.parser")
html = soup.find("div", {"class":"table playerIndex"}).find("table")
df = pd.read_html(str(html))[0]
sample output
Player Position Nationality
Rolando Aarons Midfielder England
Tammy Abraham Forward England
Che Adams Forward England
Dennis Adeniran Midfielder England
Adrián Goalkeeper Spain
Adrien Silva Midfielder Portugal

How to get scrape all the td and tr data from NFL schedule

I am scraping data from the espn.com for the upcoming NFL schedule. However, I am only able to get the first line of table and the not the rest of the tables. I believe it is because of structure of the html and the each date has a different 'td'. I can get Thursday's game data but, not the rest
****Thursday, September 5****
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Green Bay
Chicago
8:20 PM NBC Tickets as low as $290 Soldier Field, Chicago
Sunday, September 8
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Tennessee
Cleveland
1:00 PM CBS Tickets as low as $121 FirstEnergy Stadium, Cleveland
Cincinnati
Seattle
4:05 PM CBS Tickets as low as $147 CenturyLink Field, Seattle
New York
Dallas
4:25 PM FOX Tickets as low as $50 AT&T Stadium, Arlington
Foxboro
Monday, September 9
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Houston
New Orleans
7:10 PM ESPN Tickets as low as $112 Mercedes-Benz Superdome, New Orleans
Denver
Oakland
10:20 PM ESPN Tickets as low as $72 Oakland Coliseum, Oakland
I have use beautifulsoup and was easily about to get the data, but parsing the data has been a challenged.
I have tried to just continuing using a for loop, but I can a stopiteration traceback. After reading the previous article about the traceback I realize that I need to try a different solution to the problem.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import pandas as pd
main_url = ['http://www.espn.com/nfl/schedule'][1]
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
rows = iter(rows)
df = [td.text for td in next(rows).find_all('td') if td.text]
df2 = [td.text for td in next(rows).find_all('td') if td.text]
[1]: https://www.espn.com/nfl/schedule

I believe that the problem lies in this line :
table = soup.find('table')
The thing is, the above mentioned page consists of 3 table elements that have the class = "schedule" attribute. However in your code, you used the find() function only, instead of find_all(). That's the major reason that you ended up with only the contents of the first table. So, I believe that if just handle that part correctly then you'll be good to go. Now, I'm not much familiar with the set notation used to fill up the lists, hence the code contains the good old for loop style.
#List to store the rows
df = []
#Collect all the tables
tables = soup.find_all('table', class_ = "schedule")
for table in tables:
rows = soup.find_all('tr')
#rows = iter(rows)
row_item = []
for row in rows:
#Collect all 'td' elements from the 'row' & append them to a list 'row_item'
data_items = row.find_all('td')
for data_item in data_items:
row_item.append(data_item.text)
#Append the list to the 'df'
df.append(row_item)
row_item = []
print(df)

If you're trying to pull <table> tags, you can use Pandas .read_html() to do that. It'll return a list of dataframes. In this case, you can append them all together into 1 table:
import pandas as pd
url = 'http://www.espn.com/nfl/schedule'
tables = pd.read_html(url)
df = pd.DataFrame()
for table in tables:
df = df.append(table)

BeautifulSoup elements output to list

I have an output using BeautifulSoup.
I need to convert the output from 'type' 'bs4.element.Tag' to a list and export the list into a DataFrame column, named COLUMN_A
I want my output to stop at the 14th element (the last three h2 are useless)
My code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
attraction_place=soup.find_all('h2', class_="sitename")
for attraction in attraction_place:
print(attraction.text)
type(attraction)
Output:
1 Vigeland Sculpture Park
2 Akershus Fortress
3 Viking Ship Museum
4 The National Museum
5 Munch Museum
6 Royal Palace
7 The Museum of Cultural History
8 Fram Museum
9 Holmenkollen Ski Jump and Museum
10 Oslo Cathedral
11 City Hall (Rådhuset)
12 Aker Brygge
13 Natural History Museum & Botanical Gardens
14 Oslo Opera House and Annual Music Festivals
Where to Stay in Oslo for Sightseeing
Tips and Tours: How to Make the Most of Your Visit to Oslo
More Related Articles on PlanetWare.com
I expect a list like:
attraction=[Vigeland Sculpture Park, Akershus Fortress, ......]
Thank you very much in advance.

A nice easy way is to take the alt attribute of the photos. This gets clean text output and only 14 without any need for slicing/indexing.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm')
soup = bs(r.content, 'lxml')
attractions = [item['alt'] for item in soup.select('.photo [alt]')]
print(attractions)

new = []
count = 1
for attraction in attraction_place:
while count < 15:
text = attraction.text
new.append(text)
count += 1

You can use slice.
for attraction in attraction_place[:14]:
print(attraction.text)
type(attraction)

Python parsing HTML with BeautifulSoup

I'm trying to take specific data from this [webpage][1] and eventually want to put it into a table of my own, except for right now, I just want to be able to get the data that I want to show up. With the code below I am able to get all the teams with class team even to show up, however I want to have both 'team odd' and 'team even' to show up preferably having team odd show up first then team even.
I'm only focused on taking the names out for now. Any help would be greatly appreciated I've been trying to figure this out all day and quite crack it! I just started learning python and don't want you to give me the answer, just point me in the correct direction.
Thanks!
import bs4, requests
from bs4 import BeautifulSoup
# Scraping all data from website
url = 'http://www.scoresandodds.com/index.html'
response = requests.get(url)
html = response.content
# Taking content from above and searching through it find certain elements with certain attributes
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
for row in table.findAll('tr', attrs={'class' : 'team even'}):
list_of_cells = []
for cell in row.findAll('td'):
text=cell.text.replace(' ', '')
list_of_cells.append(text)
print(list_of_cells)

To just get the names is simple, use class_="game"so you get both odd and even then just pull the td with the text name:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(tr.find("td","name").text.strip())
Which will give you:
951 SAN FRANCISCO GIANTS
952 PITTSBURGH PIRATES
953 SAN DIEGO PADRES
954 CINCINNATI REDS
955 CHICAGO CUBS
956 MIAMI MARLINS
957 NEW YORK METS
958 ATLANTA BRAVES
959 ARIZONA DIAMONDBACKS
960 COLORADO ROCKIES
961 SEATTLE MARINERS
962 DETROIT TIGERS
963 CHICAGO WHITE SOX
964 BOSTON RED SOX
965 OAKLAND ATHLETICS
966 LOS ANGELES ANGELS
967 PHILADELPHIA PHILLIES
968 MINNESOTA TWINS
To get multiple data, you can pass a list of classes:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(", ".join([td.text.strip() for td in tr.find_all("td",["name","pitcher","currentline","score"]) ]))
If we look at the source, you see some class names are repeated like line:
So we can also use the id's to get the current and runline etc.. data using partial id text:
for tr in table.find_all("tr", class_="team"):
print(tr.select_one("td[id*=Pitcher]").text)
print(tr.select_one("td[id*=Current]").text)
print(tr.select_one("td[id*=Line]").text)
print("")
Whic would give you:
(r) surez, a
8.5o15
+1.5(-207)
(l) niese, j
-108
-1.5(+190)
(l) friedrich, c
9.5o15
+1.5(-195)
(l) lamb, j
-115
-1.5(+179)
(l) lester, j
-156
-1.5(-105)
(l) chen, w
7.5o15
+1.5(-103)
(r) harvey, m
-155
-1.5(+106)
(r) wisler, m
7.5u15
+1.5(-115)
(r) greinke, z
-150
-1.5(+109)
(r) butler, e
10.5
+1.5(-118)
(r) sampson, a
10u15
+1.5(-170)
(l) norris, d
-123
-1.5(+156)
(r) shields, j
10o20
+1.5(+117)
(r) porcello, r
-235
-1.5(-127)
(r) graveman, k
8o15
+1.5(-170)
(r) lincecum, t
-133
-1.5(+156)
(r) eickhoff, j
8.5
+1.5(-154)
(r) nolasco, r
-151
-1.5(+142)
You should be able to piece it all together to get all the table data you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping Dynamic Webpages w/ a date selector - python

Related

sportsreference API wrong data issue

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

How to get scrape all the td and tr data from NFL schedule

BeautifulSoup elements output to list

Python parsing HTML with BeautifulSoup

Categories

Resources