I have had no issues grabbing three stats: hits, runs and rbi's. Here is the code I have been working with so far:
#import modules
from bs4 import BeautifulSoup
import requests, os
from selenium import webdriver
#start webdriver
os.chdir('C:\webdrivers')
header = {'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
options = webdriver.ChromeOptions(); options.add_argument("--start-
maximized")
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.baseball-reference.com/leagues/MLB/2018-standard-
batting.shtml')
#grab html
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
#parse three stats: rbi's, runs and hits
hits = [i.text for i in soup.find_all('td', {'data-stat': 'H'})]
runs = [i.text for i in soup.find_all('td', {'data-stat': 'R'})]
rbi = [i.text for i in soup.find_all('td', {'data-stat': 'RBI'})]
#print data
print(hits, runs, rbi)
The code above works great. When I try to grab the batter's names, however, I run into some problems. The batter's names are not parsed correctly. I would like just their first and last name if possible.
Here is what I tried:
print(soup.find_all('td', {'data-stat': 'player'}))
The batter's names are in the code but there is a lot of extra data. Also, my computer slowed down a lot when I tried this line of code. Any suggestions? Thanks in advance for any help you may offer!
The data is not in source page, please refer to this link:
https://d3k2oh6evki4b7.cloudfront.net/short/inc/players_search_list.csv
this is the csv file you can directly download this file or you can fetch desired data with code as well.
How to get batter's names:
just request the player data directly, I found this url when I watch the page load, get player name from this url will very easy:
https://d3k2oh6evki4b7.cloudfront.net/short/inc/players_search_list.csv
How to speeder your code:
First: Using selenium to load the webdriver will cost the most part time in your code.
For your grab target, I suggest you use requests directly instead selenium
Second: lxml parser will speeder than the html parser, but you should install it if you never use it, just run "pip install lxml" will help you.
installing-a-parser and summarizes the advantages and disadvantages of each parser library
for example:
import requests
from bs4 import BeautifulSoup
# start requests
target_url = 'https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml'
headers = {'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
page_source = requests.get(target_url, headers=headers).text
#grab html
soup = BeautifulSoup(page_source, 'lxml')
#parse three stats: rbi's, runs and hits
hits = [i.text for i in soup.find_all('td', {'data-stat': 'H'})]
runs = [i.text for i in soup.find_all('td', {'data-stat': 'R'})]
rbi = [i.text for i in soup.find_all('td', {'data-stat': 'RBI'})]
#print data
print(hits, runs, rbi)
Related
I want to get all the images within a div, but everytime I try the output returns 'none' or just an empyt list. The issue just seems to happens when I try to scrape between a div, or a class. Even using different user-agents, .find or .find_all .
from bs4 import BeautifulSoup
import requests
abcde = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'}
r = requests.get('https://www.gettyimages.com.br/fotos/randon', headers=abcde)
soup = BeautifulSoup(r.content, 'html.parser')
check = soup.find_all('img', class_="GalleryItems-module__searchContent___DbMmK"})
print(check)
Would recommend to work with an api, while there is on https://developers.gettyimages.com/docs/
To answer your question concerning just images - Classes are not the best identifier cause often they are dynamic, also there is a gallery(fixed) and a mosaic view.
Simply select the <article> and its child <img> to get your goal:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.gettyimages.com.br/fotos/randon?assettype=image&sort=mostpopular&phrase=randon',
headers = {'User-Agent': 'Mozilla/5.0'}
)
soup = BeautifulSoup(r.text)
for e in soup.select('article img'):
print(e.get('src'))
Output
https://media.gettyimages.com/photos/randon-norway-picture-id974597088?k=20&m=974597088&s=612x612&w=0&h=EIwbJNzCld1tbU7rTyt42pie2yCEk5z4e6L6Z4kWhdo=
https://media.gettyimages.com/photos/caption-patrick-roanhouse-a-266-member-chats-about-some-software-on-picture-id97112678?k=20&m=97112678&s=612x612&w=0&h=zmwqIlVv2f-M9Vz_qcpITPzj-3SON99G3P69h69J5Gs=
https://media.gettyimages.com/photos/12th-and-f-streets-nw-washington-dc-pedestrians-teofila-randon-left-picture-id97102402?k=20&m=97102402&s=612x612&w=0&h=potzNUgMo3gKab5eS_pwyNggS2YGn6sCnDQYxdGUHqc=
https://media.gettyimages.com/photos/randon-perdue-kari-barnhart-attend-the-other-nashville-society-one-picture-id969787702?k=20&m=969787702&s=612x612&w=0&h=kcaYmOKruLb57Vqga68xvEZB1V12wSPPYkC6GdvXO18=
https://media.gettyimages.com/photos/death-of-duguesclin-to-chateauneuf-de-randon-july-13-1380-during-the-picture-id959538894?k=20&m=959538894&s=612x612&w=0&h=lx3DHDSf3kBc_h-O2kjR2D6UYDjPPvhn8xJ_KM0cmMc=
https://media.gettyimages.com/photos/ski-de-randone-a-saintefoy-au-dessus-du-couloir-de-la-croix-savoie-mr-picture-id945817638?k=20&m=945817638&s=612x612&w=0&h=fRd3M2KCa5dd0z8ePnPw2IkAKhXYJpuCFuUTz7jpVPU=
...
I am trying to scrape information from this website https://www.flashscore.ro/baschet/ from the live tab. I want to receive an email every time something happens.
but my problem is with scraping
the code I have until now returns None . I want for now to get the name of the home team.
I am kinda new to this scraping with python thing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.flashscore.ro/baschet/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
def find_price():
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
home_team = soup.html.find('div', {'class': 'event__participant event__participant--home'})
return home_team
print(find_price())
The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.flashscore.ro/baschet/"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for page to fully render
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tag in soup.find_all(
"div", {"class": "event__participant event__participant--home"}
):
print(tag.text)
driver.quit()
Output:
Lyon-Villeurbanne
Fortitudo Bologna
Virtus Roma
Treviso
Trieste
Trento
Unicaja
Gran Canaria
Galatasaray
Horizont Minsk 2 F
...And on
How to get the paragraph having strong tag and the three paragraphs below it using BeautifulSoup and Python requests?
Screenshot of the paragraphs i want to get using BeautifulSoup.
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200601'
headers={
'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find('p', text=re.compile('^Examples'))
print(data)
This script will get all paragraphs under the header "Examples""
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200420'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_paragraphs(first_paragraph):
current_paragraph = first_paragraph
while True:
yield current_paragraph
current_paragraph = current_paragraph.find_next('p')
if not current_paragraph or (current_paragraph.strong and current_paragraph.strong.text.strip() != ''):
break
for p in get_paragraphs(soup.select_one('p:contains("Examples")')):
print(p)
Prints:
<p><strong>Examples</strong></p>
<p><br/>My friend keeps banging on about where he’s going to go when he buys his new car. It’s really frustrating.</p>
<p>That person on the bus was really annoying. She kept banging on about how the prices had gone up.</p>
<p>Will you please stop banging on about my project!? If you think you could do a better job, you can do my work for me.</p>
For url = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200601' it prints:
<p><strong>Examples<br/></strong>He’s always flexing on social media, - showing these glamorous pictures of his holidays! </p>
<p>People who flex are so annoying! It’s just showing off. </p>
<p>She can’t stop flexing about her new house, but it’s actually not that nice.<strong> </strong></p>
The question asked is very simple, but for me, it doesn't work and I don't know!
I want to scrape the rating beer from this page https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone with BeautifulSoup, but it doesn't work.
This is my code:
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url).text
soup = BeautifulSoup(test_html, "lxml")
rating = soup.findAll("span", class_="ratingValue")
rating
When I finish, it doesn't work, but if I do the same thing with another page is work... I don't know. Someone can help me? The result of rating is 4.58
Thanks everybody!
If you print the test_html, you'll find you get a 403 forbidden response.
You should add a header (at least a user-agent : ) ) to your GET request.
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
# 4.58
The reason behind getting forbidden status code (HTTP error 403) which means the server will not fulfill your request despite understanding the response. You will definitely get this error if you try scrape a lot of the more popular websites which will have security features to prevent bots. So you need to disguise your request!
For that you need use Headers.
Also you need correct your tag attribute whose data you're trying to get i.e. itemprop
use lxml as your tree builder, or any other of your choice
import requests
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
# Add this
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'lxml')
rating = soup.find('span', {'itemprop':'ratingValue'})
print(rating.text)
The page you are requesting response as 403 forbidden so you might not be getting an error but it will provide you blank result as []. To avoid this situation we add user agent and this code will get you the desired result.
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")
rating = soup.find('span', {'itemprop':'ratingValue'})
rating.text
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking
clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
you are facing this error because some websites can't be scraped by beautiful soup. So for these kinds of websites, you have to use selenium
download latest chrome driver from this link according to your operating system
install selenium driver by this command "pip install selenium"
# import required modules
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os
curren_dir = os.getcwd()
print(curren_dir)
# concatinate web driver with your current dir && if you are using window change "/" to '\' .
# make sure , you placed chromedriver in current directory
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)
# it fetch data html data from driver
super_html = driver.page_source
# now convert raw data with 'html.parser'
soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text
I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?
I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()