I am trying to scrape a website using requests and BeautifulSoup4 in Python, here is my code:
import requests
import bs4
result = requests.get("https://wolt.com/en/svk/bratislava/restaurant/la-donuteria-bratislava")
soup = bs4.BeautifulSoup(result.content,"html5lib")
for i in soup.find_all("div", {"class": re.compile("MenuItem-module__itemContainer____.*")}):
print(i.text)
print()
When I do this with the given url I get all results. However whenever I try to scrape this url for instance:
To be scraped
The result is truncated and I only get 43 results back. Is this a limitation of requests/BS4 or am I doing something else wrong?
Thanks
I think you get an error for too many request, i guess if you use this request you will not get banned from the API, use it and tell me!
req = Request(
"https://drand.cloudflare.com/public/latest",
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0"
},
)
Related
I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.
I am trying to get the html of the following web: https://betway.es/es/sports/cpn/tennis/230 in order to get the matches' names and the odds
with the code in python:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://betway.es/es/sports/cpn/tennis/230'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
But when I run the code it throws the next exception: HTTPError: HTTP Error 403: Forbidden
I have seen that maybe with headers could be possible, but I am completely new with this module so no idea how to use them. Any advice? In addition, although I am able to download the url, I cannot find the odds, anyone knows what can be a reason?
I'm unfortunately part of a country blocked by this site.
But, using the requests package:
import requests as rq
from bs4 import BeautifulSoup as bs
url = 'https://betway.es/es/sports/cpn/tennis/230'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
page = rq.get(url, headers=headers)
You can find your headers in F12 -> Networks -> random line -> Headers Tab
It is, as a result, a partial answer.
I've just started to try and code a price tracker with Python, and have already ran into an error I don't understand. This is the code:
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0."
"4103.116 Safari/537.36"}
targetPrice = 150
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text() # Error happens here
print(price)
if True:
getPrice()
I see this part soup.find(id="priceblock_ourprice") returns a value of 'None' thus the AttributeError. I don't understand why it returns a 'None' value. Only ONCE did the code actually work and printed the product price, and never again. I ran the script again after the single successful attempt without changing anything, and got the AttributeError consistantly again.
I've also tried the following:
Used html5lib and lxml instead of html.parser.
Different id's, to see if I can access different parts of a site.
Other User Agents.
I also downloaded a similar program from github that uses the exact same code to see if it would run, but it didn't either.
What is happening here? Any help would be appreciated.
You're getting captcha page. Try to set more HTTP headers as in browser to get correct page. When I set Accept-Language http header I cannot reproduce the error anymore:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0",
'Accept-Language': 'en-US,en;q=0.5',
}
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text()
print(price)
getPrice()
Prints:
$195.99
Try printing soup after soup = BeautifulSoup(page.content, 'html.parser').
Amazon knows that you are trying to crawl them and hence the page you think they are returning, they are not.
Getting blocked when scraping Amazon (even with headers, proxies, delay)
I am trying to make a price tracker for Amazon by viewing a youtube tutorial, I am new to python and web scraping, Somehow I wrote this code and It should return Product name, But Instead its giving me "None" as an output, Can you please help me with this?
I tried with different URL's still its not working.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/57.36 (HTML, like Gecko) Chrome/75.0.30.100 Safari/537.4'}
page =requests.get(URL,headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find(id="productTitle")
print(title)import requests
I was inspecting the returned HTML, and realized that Amazon sends a (somewhat malformed?) HTML that trips the default html.parser, but using lxml I was able to scrape title just fine.
import requests
from bs4 import BeautifulSoup
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
})
res.raise_for_status()
return BeautifulSoup(res.text, 'lxml')
def parse_product_page(soup: BeautifulSoup) -> dict:
title = soup.select_one('#productTitle').text.strip()
return {
'title': title
}
if __name__ == "__main__":
url = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
soup = make_soup(url)
info = parse_product_page(soup)
print(info)
output:
{'title': "Nike Men's Zoom Rival M 9 Track and Field Shoes"}
You can make your locator more specific using .select(). You need to change the parser as well.
Try this instead:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
page = requests.get(URL,headers={"User-Agent":'Mozilla/5.0'})
soup = BeautifulSoup(page.text,'lxml') #make sure you use "lxml' or "html5lib" parser instead of "html.parser"
title = soup.select_one("h1 > #productTitle").get_text(strip=True)
print(title)
Output:
Nike Men's Zoom Rival M 9 Track and Field Shoes
Bot detection is pretty pervasive these days. No major site with any data worth mining, especially retail, is going to let you use requests on their site.
You're going to have to at the very least use Selenium / ChromeDriver to get a response from any reputable site. Even then if they use something like Distil for bot detection they will stop even Selenium.
Try a less popular site with Selenium, and you will get data back.
I am trying to scrape reddit pages for the videos. I am using python and beautiful soup to do the job.The following code sometimes return the result and sometimes not when I rerun the code.I'm not sure where i'm going wrong. Can someone help? I'm a newbie to python so please bear with me.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
if you do print (page) after your page = requests.get('https:/.........'), you'll see you get a successful <Response [200]>
But if you run it quickly again, you'll get the <Response [429]>
"The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting")." Source here
Additonally, if you look at the html source, you'd see:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>
To add headers and avoid the 429 add in:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
Full code:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
Output:
<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]
and have had no issues rerunning multiple times after waiting a second or 2
I have tried below code and it is working for me at every request, Added timeout of 30 sec.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'lxml')
source_tags = soup.find_all('source')
print(source_tags)
else:
print(page.status_code, page)