ESPN.com Python web scraping issue - python

I am trying to pull data for the rosters for all college football teams because I want to run some analysis on team performance based on composition of their roster.
My script is working on the first page, and it iterates over each team and can open the rosters link for each team, but then the Beautiful Soup commands I am running on the rosters page for a team keep throwing Index Errors. When I look at the HTML, it seems as if the commands I am writing should work yet when I print the page source from the Beautiful Soup I don't see what I see in Developer Tools in Chrome. Is this some instance of JS being used to serve up the content? If so, I thought Selenium got around this?
My code...
import requests
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_soup = BeautifulSoup(teams_html, "html5lib")
i = 0
for link_html in teams_soup.find_all('a'):
if link_html.text == 'Roster':
roster_link = 'https://www.espn.com' + link_html['href']
roster_driver = webdriver.Firefox()
roster_driver.get(roster_link)
roster_html = teams_driver.page_source
roster_soup = BeautifulSoup(roster_html, "html5lib")
team_name_html = roster_soup.find_all('a', class_='sub-brand-title')[0]
team_name = team_name_html.find_all('b')[0].text
for player_html in roster_soup.find_all('tr', class_='oddrow'):
player_name = player_html.find_all('a')[0].text
player_pos = player_html.find_all('td')[2].text
player_height = player_html.find_all('td')[3].text
player_weight = player_html.find_all('td')[4].text
player_year = player_html.find_all('td')[5].text
player_hometown = player_html.find_all('td')[6].text
print(team_name)
print('\t', player_name)
roster_driver.close()
teams_driver.close()

In your for loop you're using the html of the 1st page (roster_html = teams_driver.page_source), so you get an index error when you try to select the 1st item of team_name_html because find_all returns an empty list.
Also you don't need to have all those instances of Firefox open, you can close the driver when you have the html.
teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_driver.quit()
But you don't have to use selenium for this task, you can get all the data with requests and bs4.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.espn.com/college-football/teams")
teams_soup = BeautifulSoup(r.text, "html5lib")
for link_html in teams_soup.find_all('a'):
if link_html.text == 'Roster':
roster_link = 'https://www.espn.com' + link_html['href']
r = requests.get(roster_link)
roster_soup = BeautifulSoup(r.text, "html5lib")
team_name = roster_soup.find('a', class_='sub-brand-title').find('b').text
for player_html in roster_soup.find_all('tr', class_='oddrow'):
player_name = player_html.find_all('a')[0].text
player_pos = player_html.find_all('td')[2].text
player_height = player_html.find_all('td')[3].text
player_weight = player_html.find_all('td')[4].text
player_year = player_html.find_all('td')[5].text
player_hometown = player_html.find_all('td')[6].text
print(team_name, player_name, player_pos, player_height, player_weight, player_year, player_hometown)

Related

Pulling p tags from multiple URLs

I've struggled on this for days and not sure what the issue could be - basically, I'm trying to extract the profile box data (picture below) of each link -- going through inspector, I thought I could pull the p tags and do so.
I'm new to this and trying to understand, but here's what I have thus far:
-- a code that (somewhat) succesfully pulls the info for ONE link:
import requests
from bs4 import BeautifulSoup
# getting html
url = 'https://basketball.realgm.com/player/Darius-Adams/Summary/28720'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
playerinfo = container.find_all('p')
print(playerinfo)
I then also have a code that pulls all of the HREF tags from multiple links:
from bs4 import BeautifulSoup
import requests
def get_links(url):
links = []
website = requests.get(url)
website_text = website.text
soup = BeautifulSoup(website_text)
for link in soup.find_all('a'):
links.append(link.get('href'))
for link in links:
print(link)
print(len(links))
get_links('https://basketball.realgm.com/dleague/players/2022')
get_links('https://basketball.realgm.com/dleague/players/2021')
get_links('https://basketball.realgm.com/dleague/players/2020')
So basically, my goal is to combine these two, and get one code that will pull all of the P tags from multiple URLs. I've been trying to do it, and I'm really not sure at all why this isn't working here:
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
profiles.append(profile.get('p'))
for profile in profiles:
print(profile)
get_profile('https://basketball.realgm.com/player/Darius-Adams/Summary/28720')
get_profile('https://basketball.realgm.com/player/Marial-Shayok/Summary/26697')
Again, I'm really new to web scraping with Python but any advice would be greatly appreciated. Ultimately, my end goal is to have a tool that can scrape this data in a clean way all at once.
(Player name, Current Team, Born, Birthplace, etc).. maybe I'm doing it entirely wrong but any guidance is welcome!
You need to combine your two scripts together and make requests for each player. Try the following approach. This searches for <td> tags that have the data-td=Player attribute:
import requests
from bs4 import BeautifulSoup
def get_links(url):
data = []
req_url = requests.get(url)
soup = BeautifulSoup(req_url.content, "html.parser")
for td in soup.find_all('td', {'data-th' : 'Player'}):
a_tag = td.a
name = a_tag.text
player_url = a_tag['href']
print(f"Getting {name}")
req_player_url = requests.get(f"https://basketball.realgm.com{player_url}")
soup_player = BeautifulSoup(req_player_url.content, "html.parser")
div_profile_box = soup_player.find("div", class_="profile-box")
row = {"Name" : name, "URL" : player_url}
for p in div_profile_box.find_all("p"):
try:
key, value = p.get_text(strip=True).split(':', 1)
row[key.strip()] = value.strip()
except: # not all entries have values
pass
data.append(row)
return data
urls = [
'https://basketball.realgm.com/dleague/players/2022',
'https://basketball.realgm.com/dleague/players/2021',
'https://basketball.realgm.com/dleague/players/2020',
]
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
for entry in data:
print(entry)

print text inside parent div beautifulsoup

i'm trying to fetch each product's name and price from
https://www.daraz.pk/catalog/?q=risk but nothing shows up.
containers = page_soup.find_all("div",{"class":"c2p6A5"})
for container in containers:
pname = container.findAll("div", {"class": "c29Vt5"})
name = pname[0].text
price1 = container.findAll("span", {"class": "c29VZV"})
price = price1[0].text
print(name)
print(price)
There is JSON data in the page, you can get it in the <script> tag using beautifulsoup but I dont think this is needed, because you can get it directly with json and re
import requests, json, re
html = requests.get('https://.......').text
jsonStr = re.search(r'window.pageData=(.*?)</script>', html).group(1)
jsonObject = json.loads(jsonStr)
for item in jsonObject['mods']['listItems']:
print(item['name'])
print(item['price'])
if the page is dynamic, Selenium should take care of that
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.daraz.pk/catalog/?q=risk')
r = browser.page_source
page_soup = bs4.BeautifulSoup(r,'html.parser')
containers = page_soup.find_all("div",{"class":"c2p6A5"})
for container in containers:
pname = container.findAll("div", {"class": "c29Vt5"})
name = pname[0].text
price1 = container.findAll("span", {"class": "c29VZV"})
price = price1[0].text
print(name)
print(price)
browser.close()
output:
Risk Strategy Game
Rs. 5,900
Risk Classic Board Game
Rs. 945
RISK - The Game of Global Domination
Rs. 1,295
Risk Board Game
Rs. 1,950
Risk Board Game - Yellow
Rs. 3,184
Risk Board Game - Yellow
Rs. 1,814
Risk Board Game - Yellow
Rs. 2,086
Risk Board Game - The Game of Global Domination
Rs. 975
...
I was wrong. The info to calculate the page count is present in the json so you can get all results. No regex needed as you can extract the relevant script tag. Also, you can create the page url in a loop.
import requests
from bs4 import BeautifulSoup
import json
import math
def getNameAndPrice(url):
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
data = json.loads(soup.select('script')[2].text.strip('window.pageData='))
if url == startingPage:
resultCount = int(data['mainInfo']['totalResults'])
resultsPerPage = int(data['mainInfo']['pageSize'])
numPages = math.ceil(resultCount/resultsPerPage)
result = [[item['name'],item['price']] for item in data['mods']['listItems']]
return result
resultCount = 0
resultsPerPage = 0
numPages = 0
link = "https://www.daraz.pk/catalog/?page={}&q=risk"
startingPage = "https://www.daraz.pk/catalog/?page=1&q=risk"
results = []
results.append(getNameAndPrice(startingPage))
for links in [link.format(page) for page in range(2,numPages + 1)]:
results.append(getNameAndPrice(links))
Referring to the JSON answer to someone who is very new like me.
You can use Selenium to navigate to search result page like this:
PS: Thanks for #ewwink very much. You saved my day!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time #time delay when load web
import requests, json, re
keyword = 'fan'
opt = webdriver.ChromeOptions()
opt.add_argument('headless')
driver = webdriver.Chrome(options = opt)
# driver = webdriver.Chrome()
url = 'https://www.lazada.co.th/'
driver.get(url)
search = driver.find_element_by_name('q')
search.send_keys(keyword)
search.send_keys(Keys.RETURN)
time.sleep(3) #wait for web load for 3 secs
page_html = driver.page_source #Selenium way of page_html = webopen.read() for BS
driver.close()
jsonStr = re.search(r'window.pageData=(.*?)</script>', page_html).group(1)
jsonObject = json.loads(jsonStr)
for item in jsonObject['mods']['listItems']:
print(item['name'])
print(item['sellerName'])

BS4 Not Locating Element in Python

I am somewhat new to Python and can't for the life of me figure out why the following code isn’t pulling the element I am trying to get.
It currently returns:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
Currently returning:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
Note: iterating through children of the wrapper object returns:
< div class="table_outer_container" > as part of the tree.
Thanks!
Make sure that table contains the data you expect.
For example https://www.basketball-reference.com/players/a/abdulka01.html doesn't seem to contain a div with id='all_advanced_pbp'
Try to explicitly pass the html instead:
bs.BeautifulSoup(the_html, 'html.parser')
I trie to extract data from the url you gave but it did not get full DOM. after then i try to access the page with browser with javascrip and without javascrip, i know website need javascrip to load some data. But the page like players it need not. The simple way to get dynamic data is using selenium
This is my test code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()

bs4 scraping python get contents until specific class name

I want to scrape this site
https://www.eduvision.edu.pk/institutions-detail.php?city=51I&institute=5_allama-iqbal-open-university-islamabad
and i want only the bachelor data in this url which is under class name=academicsList and i don't want below MS(MASTERS) data.
I want my scraper to stop before ms data. my logic is that we can set temporary incrementor on class=academicsHead and it should stop when it gets second academicsHead
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()
header = {'user-agent':ua.chrome}
response = requests.get('https://www.eduvision.edu.pk/institutions-detail.php?city=51I&institute=5_allama-iqbal-open-university-islamabad',headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
disciplines = soup.findAll("ul", {"class": "academicsList"})
#temp = soup.findAll("ul",{"class":"academicsHead"})
#stop at second academicsHead
for d in disciplines:
print(d.findAll('li')[0].text)
We can check if the class is 'academicsHead' and if it is just check if the text is BACHELOR if not break the loop.
Something like this would work:
disciplines = soup.findAll('ul',attrs={'class':re.compile(r'academics+(.)+')})
for i in disciplines:
if i['class'][0] == 'academicsHead':
if i.find('li').text.strip() != 'BACHELOR':
break
else:
print(i.find('li').text.strip())

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.
The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).
You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.
I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)

Categories