Error when trying to web scraping with urllib.reques - python

I am trying to get the html of the following web: https://betway.es/es/sports/cpn/tennis/230 in order to get the matches' names and the odds
with the code in python:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://betway.es/es/sports/cpn/tennis/230'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
But when I run the code it throws the next exception: HTTPError: HTTP Error 403: Forbidden
I have seen that maybe with headers could be possible, but I am completely new with this module so no idea how to use them. Any advice? In addition, although I am able to download the url, I cannot find the odds, anyone knows what can be a reason?

I'm unfortunately part of a country blocked by this site.
But, using the requests package:
import requests as rq
from bs4 import BeautifulSoup as bs
url = 'https://betway.es/es/sports/cpn/tennis/230'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
page = rq.get(url, headers=headers)
You can find your headers in F12 -> Networks -> random line -> Headers Tab
It is, as a result, a partial answer.

Related

Limit of requests/Beautifulsoup libraries in Python

I am trying to scrape a website using requests and BeautifulSoup4 in Python, here is my code:
import requests
import bs4
result = requests.get("https://wolt.com/en/svk/bratislava/restaurant/la-donuteria-bratislava")
soup = bs4.BeautifulSoup(result.content,"html5lib")
for i in soup.find_all("div", {"class": re.compile("MenuItem-module__itemContainer____.*")}):
print(i.text)
print()
When I do this with the given url I get all results. However whenever I try to scrape this url for instance:
To be scraped
The result is truncated and I only get 43 results back. Is this a limitation of requests/BS4 or am I doing something else wrong?
Thanks
I think you get an error for too many request, i guess if you use this request you will not get banned from the API, use it and tell me!
req = Request(
"https://drand.cloudflare.com/public/latest",
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0"
},
)

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

beautiful soup returns none when the element exists in browser

I have looked through the previous answers but none seemed to be applicable. I am building an open source quizlet scraper to extract all links from a class (e.g. https://quizlet.com/class/3675834/). In this case, the tag is a and class is "UILink". But when I use the following code, the list returned does not contain the element that I am looking for. Is it because of the JavaScript issue described here?
I tried to use the previous method of importing folder as written here but it does not contain the urls.
How can I scrape these urls?
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
}
url = 'https://quizlet.com/class/8536895/'
response = requests.get(url, verify=False, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
b = soup.find_all("a", class_="UILink")
You wouldn't be able to directly scrape dynamic webpages using just requests. What you see browser is fully rendered page taken care by browser.
Inorder to scrape data from these kind of webpages, you following any of below approaches.
Use requests-html instead of requests
pip install requests-html
scraper.py
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
url = 'https://quizlet.com/class/8536895/'
response = session.get(url)
response.html.render() # render the webpage
# access html page source with html.html
soup = BeautifulSoup(response.html.html, 'html.parser')
b = soup.find_all("a", class_="UILink")
print(len(b))
Note: this uses headless browser(chromium) under the hood to render the page. So it can timeout or be a little slow at times.
Use selenium webdriver
Use driver.get(url) to get the page and pass the page source to beautiful Soup with driver.page_source
Note: run this in headless mode as well and there might be some latency at times.

BeautifulSoup 'find()' returns NoneType Value

I've just started to try and code a price tracker with Python, and have already ran into an error I don't understand. This is the code:
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0."
"4103.116 Safari/537.36"}
targetPrice = 150
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text() # Error happens here
print(price)
if True:
getPrice()
I see this part soup.find(id="priceblock_ourprice") returns a value of 'None' thus the AttributeError. I don't understand why it returns a 'None' value. Only ONCE did the code actually work and printed the product price, and never again. I ran the script again after the single successful attempt without changing anything, and got the AttributeError consistantly again.
I've also tried the following:
Used html5lib and lxml instead of html.parser.
Different id's, to see if I can access different parts of a site.
Other User Agents.
I also downloaded a similar program from github that uses the exact same code to see if it would run, but it didn't either.
What is happening here? Any help would be appreciated.
You're getting captcha page. Try to set more HTTP headers as in browser to get correct page. When I set Accept-Language http header I cannot reproduce the error anymore:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0",
'Accept-Language': 'en-US,en;q=0.5',
}
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text()
print(price)
getPrice()
Prints:
$195.99
Try printing soup after soup = BeautifulSoup(page.content, 'html.parser').
Amazon knows that you are trying to crawl them and hence the page you think they are returning, they are not.
Getting blocked when scraping Amazon (even with headers, proxies, delay)

Trying to parse a div class but can't get the correct results

I'm trying to parse the div class titled "dealer-info" from the URL below.
https://www.nissanusa.com/dealer-locator.html
I tried this:
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Normally, I would expect that to work, but I'm getting this result: HTTPError: Forbidden
Also, tried this.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
print(data)
That gives me all the HTML on the site, but it's pretty ugly to look at, or make any sense of at all.
I'm trying to get a structured data set, of "dealer-info". I am using Python 3.6.
You might be being rejected by the server in your first example due to not pretending to be an ordinary browser. You should try combining the user agent code from the second example with the Beautiful Soup code from the first:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
text = response.read()
soup = BeautifulSoup(text, "lxml")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Keep in mind that if the web site is explicitly trying to keep Beautiful Soup or other non-recognized user agents out, they may take issue with you scraping their web site data. You should consult and obey https://www.nissanusa.com/robots.txt as well as any terms of use or terms of service agreements you may have agreed to.

Categories