How to do scraping from a page with BeautifulSoup - python

The question asked is very simple, but for me, it doesn't work and I don't know!
I want to scrape the rating beer from this page https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone with BeautifulSoup, but it doesn't work.
This is my code:
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url).text
soup = BeautifulSoup(test_html, "lxml")
rating = soup.findAll("span", class_="ratingValue")
rating
When I finish, it doesn't work, but if I do the same thing with another page is work... I don't know. Someone can help me? The result of rating is 4.58
Thanks everybody!

If you print the test_html, you'll find you get a 403 forbidden response.
You should add a header (at least a user-agent : ) ) to your GET request.
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
# 4.58

The reason behind getting forbidden status code (HTTP error 403) which means the server will not fulfill your request despite understanding the response. You will definitely get this error if you try scrape a lot of the more popular websites which will have security features to prevent bots. So you need to disguise your request!
For that you need use Headers.
Also you need correct your tag attribute whose data you're trying to get i.e. itemprop
use lxml as your tree builder, or any other of your choice
import requests
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
# Add this
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'lxml')
rating = soup.find('span', {'itemprop':'ratingValue'})
print(rating.text)

The page you are requesting response as 403 forbidden so you might not be getting an error but it will provide you blank result as []. To avoid this situation we add user agent and this code will get you the desired result.
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")
rating = soup.find('span', {'itemprop':'ratingValue'})
rating.text

import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking
clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)

you are facing this error because some websites can't be scraped by beautiful soup. So for these kinds of websites, you have to use selenium
download latest chrome driver from this link according to your operating system
install selenium driver by this command "pip install selenium"
# import required modules
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os
curren_dir = os.getcwd()
print(curren_dir)
# concatinate web driver with your current dir && if you are using window change "/" to '\' .
# make sure , you placed chromedriver in current directory
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)
# it fetch data html data from driver
super_html = driver.page_source
# now convert raw data with 'html.parser'
soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text

Related

I am trying to get one element of a website but it prints "none" (Python Requests)

from bs4 import BeautifulSoup
import requests
url = "https://www.gamerdvr.com/gamer/cookz/videos"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
element = soup.find('span', id_="most-recorded")
print(element)
This always prints "none" but when I go to the website, I can see it. I even deleted all cookies and it's still there.
Without specifying a user agent, the site does not give you the tag you need.
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = "https://www.gamerdvr.com/gamer/cookz/videos"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
element = soup.find('span', {'id': "most-recorded"}).get_text(strip=True)
print(element)
OUTPUT:
Fortnite

Element is not in response Python Requests

I would to scrape the last odds in archive from this page https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/ but I can't get it with requests. How can I get it without interact with Selenium?
To trigger the archive odds page in the Developer Tools I need to hover on the odd.
Code
url = "https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/14/"
headers = {
"Referer": "https://www.betexplorer.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
Json = requests.get(url, headers=headers).json()
As the site is being loaded by JavaScript, requests doesn't work. I have used selenium to load the page, extract the complete source code after everything is loaded.
Then used beautifulsoup to create a soup object to get required data.
From the source code you can see that the data-bid of the <tr> are what are being passed to get the odds data.
I extracted all the data-bid and passed them to the URL you've provided at the very end of your question one by one.
This code will get all the odds data in JSON format
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
base_url = 'https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/'
driver = webdriver.Chrome()
driver.get(base_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
t = soup.find('table', attrs= {'id': 'sortable-1'})
trs = t.find('tbody').findAll('tr')
for i in trs:
data_bid = i['data-bid']
url = f"https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/{data_bid}/"
headers = {"Referer": "https://www.betexplorer.com",'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
Json = requests.get(url, headers=headers).json()
# Do what you wish to do withe JSON data here....

BeautifulSoup returning 'None' object type

After watching a video I tried to fetch price for an item from a amazon.de website using BeautifulSoup api.
#My CODE
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text()
price = soup.find(id='priceblock_ourprice')
print(title) #returns correct string from the URL above
print(price)
#returns 'None'. Unexpected. Expecting price with some extensions from <span id="priceblock_ourprice"
Anyone who finds something wrong in my code would be really helpful for me.
Thanks in Advance!
Can not reproduce the 'None', code works fine, just added get_text() to the price and strip() both variables, to make the result a little bit cleaner.
import requests, time
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Cache-Control': 'no-cache'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text().strip()
# to prevent script from crashing when there isn't a price for the product
try:
price = soup.find(id='priceblock_ourprice').get_text().strip()
#convert price to float by slicing
convertedPrice = price[:8]
except:
price = 'not loaded'
convertedPrice = 'not loaded'
print(title) #returns correct string from the URL above
print(price)
print(convertedPrice)
Output
Neues Apple iPhone 12 Pro (128 GB) - Graphit
1.120,00 €
1.120,00
But
As #Chase mentioned, if it is an dynamicly genereated content, you may give Selenium a try, this can handle the load with its Waits - By adding a delay, you can wait until page is loaded, dynamicly generated content to and then grap your information.

This code for Web Scraping using python returning None. Why? Any help would be appreciated

from bs4 import BeautifulSoup
import requests
headers = {'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
url = 'https://www.amazon.com/Sony-Alpha-a6400-Mirrorless-Camera/dp/B07MV3P7M8/ref=sr_1_4?keywords=sony+alpha&qid=1581656953&s=electronics&sr=1-4'
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
print(title)
print(price)
Your code works fine but there is a robot check before the product page so your request looks for the span tag in that robot check page, fails and returns None.
Here is a link which may help you: python requests & beautifulsoup bot detection

I am unable to get all the links present on the page

I am using beautifulsoup and urllib to extract a webpage , I have set the user agent and the cookie , and yet i fail to receive all the links from the webpage...
Heres my code :
import bs4 as bs
import urllib.request
import requests
#sauce = urllib.request.urlopen('https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93').read()
#soup = bs.BeautifulSoup(sauce,'lxml')
'''
session = requests.Session()
response = session.get(url)
print(session.cookies.get_dict())
'''
url = 'https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Cookie' : '_gh_sess=eyJzZXNzaW9uX2lkIjoiMDNhMGI2NjQxZjY4Mjc1YmQ3ZjAyNmJiODM2YzIzMTUiLCJfY3NyZl90b2tlbiI6IlJJOUtrd3E3WVFOYldVUzkwdmUxZ0Z4MHZLN3M2eE83SzhIdVJTUFVsVVU9In0%3D--4485d36d4c86aec01cde254e34db68005193546e
logged_in: no'}
response = requests.get(url,headers=headers)
print(response.cookies)
soup = bs.BeautifulSoup(response.content,'lxml')
for url in soup.find_all('a'):
print(url.get('href'))
Is there something I'm missing? Inside a browser I get the links to all the code whereas in the script I get only a few of the links , none with the code...
The webpage opening perfectly in the browser...

Categories