Python parsing HTML with BeautifulSoup

Python parsing HTML with BeautifulSoup - python

I'm trying to take specific data from this [webpage][1] and eventually want to put it into a table of my own, except for right now, I just want to be able to get the data that I want to show up. With the code below I am able to get all the teams with class team even to show up, however I want to have both 'team odd' and 'team even' to show up preferably having team odd show up first then team even.
I'm only focused on taking the names out for now. Any help would be greatly appreciated I've been trying to figure this out all day and quite crack it! I just started learning python and don't want you to give me the answer, just point me in the correct direction.
Thanks!
import bs4, requests
from bs4 import BeautifulSoup
# Scraping all data from website
url = 'http://www.scoresandodds.com/index.html'
response = requests.get(url)
html = response.content
# Taking content from above and searching through it find certain elements with certain attributes
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
for row in table.findAll('tr', attrs={'class' : 'team even'}):
list_of_cells = []
for cell in row.findAll('td'):
text=cell.text.replace(' ', '')
list_of_cells.append(text)
print(list_of_cells)

To just get the names is simple, use class_="game"so you get both odd and even then just pull the td with the text name:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(tr.find("td","name").text.strip())
Which will give you:
951 SAN FRANCISCO GIANTS
952 PITTSBURGH PIRATES
953 SAN DIEGO PADRES
954 CINCINNATI REDS
955 CHICAGO CUBS
956 MIAMI MARLINS
957 NEW YORK METS
958 ATLANTA BRAVES
959 ARIZONA DIAMONDBACKS
960 COLORADO ROCKIES
961 SEATTLE MARINERS
962 DETROIT TIGERS
963 CHICAGO WHITE SOX
964 BOSTON RED SOX
965 OAKLAND ATHLETICS
966 LOS ANGELES ANGELS
967 PHILADELPHIA PHILLIES
968 MINNESOTA TWINS
To get multiple data, you can pass a list of classes:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(", ".join([td.text.strip() for td in tr.find_all("td",["name","pitcher","currentline","score"]) ]))
If we look at the source, you see some class names are repeated like line:
So we can also use the id's to get the current and runline etc.. data using partial id text:
for tr in table.find_all("tr", class_="team"):
print(tr.select_one("td[id*=Pitcher]").text)
print(tr.select_one("td[id*=Current]").text)
print(tr.select_one("td[id*=Line]").text)
print("")
Whic would give you:
(r) surez, a
8.5o15
+1.5(-207)
(l) niese, j
-108
-1.5(+190)
(l) friedrich, c
9.5o15
+1.5(-195)
(l) lamb, j
-115
-1.5(+179)
(l) lester, j
-156
-1.5(-105)
(l) chen, w
7.5o15
+1.5(-103)
(r) harvey, m
-155
-1.5(+106)
(r) wisler, m
7.5u15
+1.5(-115)
(r) greinke, z
-150
-1.5(+109)
(r) butler, e
10.5
+1.5(-118)
(r) sampson, a
10u15
+1.5(-170)
(l) norris, d
-123
-1.5(+156)
(r) shields, j
10o20
+1.5(+117)
(r) porcello, r
-235
-1.5(-127)
(r) graveman, k
8o15
+1.5(-170)
(r) lincecum, t
-133
-1.5(+156)
(r) eickhoff, j
8.5
+1.5(-154)
(r) nolasco, r
-151
-1.5(+142)
You should be able to piece it all together to get all the table data you want.

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR

The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

I have been trying to scrape data from https://www.premierleague.com/players to get team rosters for premier league clubs for the past 10 years.
The following is the code I am using. In this particular example se=17 specifies season 2008/09 and cl=12 is for Manchester United.
url= 'https://www.premierleague.com/players?se=17&cl=12'
r=requests.get(url)
d= pd.read_html(r.text)
d[0]
Inspite of the url providing the correct data on the page, the table I get is the one for the current season 2019/20. I have tried multiple combinations of the url and still I am not able to scrape.
Can someone help?

I prefer to use BeautifulSoup to navigate the DOM. This works.
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.premierleague.com/players", params={"se":17,"cl":12})
soup = BeautifulSoup(resp.content.decode(), "html.parser")
html = soup.find("div", {"class":"table playerIndex"}).find("table")
df = pd.read_html(str(html))[0]
sample output
Player Position Nationality
Rolando Aarons Midfielder England
Tammy Abraham Forward England
Che Adams Forward England
Dennis Adeniran Midfielder England
Adrián Goalkeeper Spain
Adrien Silva Midfielder Portugal

How to return result to a table or csv type of format from html

I'm trying to set up a scrape from a betting site I can run during the NFL season to get the odds into Excel/DB, but as I am very new to python and bs4 I'm running into trouble.
I'm using Python 3.7.4 with BS4
import requests
from bs4 import BeautifulSoup
result2 = requests.get("https://www.betfair.com/sport/american-football/nfl-kampe/green-bay-packers-chicago-bears/29202049")
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
print(item.text)
I would like output to be csv like this:
"Green Bay Packers", "2.3", "Chicago Bears", "1.55"
"Green Bay Packers", "1.7","+3,5", "Chicago Bears", "2.0","-3.5"
Current result (with big line breaks):
Green Bay Packers
2.3
Chicago Bears
1.55
Green Bay Packers
1.7
+3,5
etc

I can't access the site because it's blocked behind the firewall on the public wifi I am on so I can't test the code below, but instead of printing the items, put them into a list. Then take that list and convert to dataframe/table. So something like:
Note: Still work to be done to clean it up, but this gets you going
import requests
from bs4 import BeautifulSoup
import pandas as pd
result2 = requests.get("https://www.betfair.com/sport/american-football/nfl-kampe/green-bay-packers-chicago-bears/29202049")
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')
data = []
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
temp_data = [ alpha for alpha in item.text.split('\n') if alpha != '' ]
data.append(temp_data)
df = pd.DataFrame(data)
print(df)
df.to_csv('file.csv')
Output:
print (df.to_string())
0 1 2 3 4 5 6 7
0 Green Bay Packers 11/8 Chicago Bears 8/13 None None None None
1 Green Bay Packers 3/4 +3.5 Chicago Bears 11/10 -3.5 None None
2 Current Points: Over 20/23 +46 Under 19/20 +46 None
3 Green Bay Packers by 1-13 Pts 2/1 Green Bay Packers 14+ 5/1 Chicago Bears by 1-13 Pts 6/4 Chicago Bears 14+ 10/3

I think you can just replace the new line symbol with space?
import csv
with open('filename.csv', 'a') as csv_file:
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
x = item.text.replace('\n',',')
writer = csv.writer(csv_file)
writer.writerow([x])
Edit:
Added save to .csv file.

BeautifulSoup elements output to list

I have an output using BeautifulSoup.
I need to convert the output from 'type' 'bs4.element.Tag' to a list and export the list into a DataFrame column, named COLUMN_A
I want my output to stop at the 14th element (the last three h2 are useless)
My code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
attraction_place=soup.find_all('h2', class_="sitename")
for attraction in attraction_place:
print(attraction.text)
type(attraction)
Output:
1 Vigeland Sculpture Park
2 Akershus Fortress
3 Viking Ship Museum
4 The National Museum
5 Munch Museum
6 Royal Palace
7 The Museum of Cultural History
8 Fram Museum
9 Holmenkollen Ski Jump and Museum
10 Oslo Cathedral
11 City Hall (Rådhuset)
12 Aker Brygge
13 Natural History Museum & Botanical Gardens
14 Oslo Opera House and Annual Music Festivals
Where to Stay in Oslo for Sightseeing
Tips and Tours: How to Make the Most of Your Visit to Oslo
More Related Articles on PlanetWare.com
I expect a list like:
attraction=[Vigeland Sculpture Park, Akershus Fortress, ......]
Thank you very much in advance.

A nice easy way is to take the alt attribute of the photos. This gets clean text output and only 14 without any need for slicing/indexing.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm')
soup = bs(r.content, 'lxml')
attractions = [item['alt'] for item in soup.select('.photo [alt]')]
print(attractions)

new = []
count = 1
for attraction in attraction_place:
while count < 15:
text = attraction.text
new.append(text)
count += 1

You can use slice.
for attraction in attraction_place[:14]:
print(attraction.text)
type(attraction)

Iterating through list of URLs in Python - bs4

I have one .txt file (named test_1.txt) that is formatted as follows:
https://maps.googleapis.com/maps/api/directions/xml?origin=Bethesda,MD&destination=Washington,DC&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Miami,FL&destination=Mobile,AL&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Scranton,PA&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Baltimore,MD&destination=Charlotte,NC&sensor=false&mode=walking
If you go to one of the links above you'll see the output in XML. With the code written below, I've managed to get it to iterate through to the second directions request (Miami to Mobile) and it prints seemingly random data that isn't what I want. I also am able to get this working, printing exactly the data I need when just going to one URL at a time with the .txt but directly from the code. Is there any reason it is only going to the second URL and printing the wrong info? Python code is below:
import urllib2
from bs4 import BeautifulSoup
with open('test_1.txt', 'r') as f:
f.readline()
mapcalc = f.readline()
response = urllib2.urlopen(mapcalc)
soup = BeautifulSoup(response)
for leg in soup.select('route > leg'):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
EDIT:
This is the output of the Python Code in the Shell:
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA

Here's a link that could shed more light on the behavior you can get when opening files and reading lines, etc. (related to Lev Levitsky's comment).
One way:
import httplib2
from bs4 import BeautifulSoup
http = httplib2.Http()
with open('test_1.txt', 'r') as f:
for mapcalc in f:
status, response = http.request(mapcalc)
for leg in BeautifulSoup(response):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
f.close()
I'm new to this sort of thing but I got the above code to work with the following output:
4877
1 hour 21 mins
6582
4.1 mi
Bethesda, MD, USA
Washington, DC, USA
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA
190
3 mins
269
0.2 mi
Chicago, IL, USA
Scranton, PA, USA
12
1 min
15
49 ft
Baltimore, MD, USA
Charlotte, NC, USA

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parsing HTML with BeautifulSoup - python

Related

Struggling to grab data from baseball reference

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

How to return result to a table or csv type of format from html

BeautifulSoup elements output to list

Iterating through list of URLs in Python - bs4

Categories

Resources