I am trying to scrape reddit pages for the videos. I am using python and beautiful soup to do the job.The following code sometimes return the result and sometimes not when I rerun the code.I'm not sure where i'm going wrong. Can someone help? I'm a newbie to python so please bear with me.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
if you do print (page) after your page = requests.get('https:/.........'), you'll see you get a successful <Response [200]>
But if you run it quickly again, you'll get the <Response [429]>
"The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting")." Source here
Additonally, if you look at the html source, you'd see:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>
To add headers and avoid the 429 add in:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
Full code:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
Output:
<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]
and have had no issues rerunning multiple times after waiting a second or 2
I have tried below code and it is working for me at every request, Added timeout of 30 sec.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'lxml')
source_tags = soup.find_all('source')
print(source_tags)
else:
print(page.status_code, page)
Related
The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here
I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.
Hi i would like to know why my app is giving me that error i've tried already everything what i found in google and still have not idea why is that happeing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.co.uk/XLTOK-Charging-Transfer-Charger-Nintendo/dp/B0828RYQ7W/ref=sr_1_1_sspa?dchild=1&keywords=type+c&qid=1598485860&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE3TDNSNUlITUNKTUMmZW5jcnlwdGVkSWQ9QTAwNDg4MTMyUlFQN0Y4RllGQzE2JmVuY3J5cHRlZEFkSWQ9QTAxNDk0NzMyMFNLSUdPU0taVUpRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
When called:
C:\Proyecto nuevo>Python main.py
None
So if anyone would like to help me will be amazing!!
If you look at the webpage code (the one your are trying to scrape), you will find that it is pretty much all javascript that populates the webpage when it loads. The requests library gets this code and doesn't run it. Your "find title" gets none because the code doesn't contain that.
To scrape this page you will have to run the javascript on it. You can check out the Selenium WebDriver in python to do this.
I've just started to try and code a price tracker with Python, and have already ran into an error I don't understand. This is the code:
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0."
"4103.116 Safari/537.36"}
targetPrice = 150
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text() # Error happens here
print(price)
if True:
getPrice()
I see this part soup.find(id="priceblock_ourprice") returns a value of 'None' thus the AttributeError. I don't understand why it returns a 'None' value. Only ONCE did the code actually work and printed the product price, and never again. I ran the script again after the single successful attempt without changing anything, and got the AttributeError consistantly again.
I've also tried the following:
Used html5lib and lxml instead of html.parser.
Different id's, to see if I can access different parts of a site.
Other User Agents.
I also downloaded a similar program from github that uses the exact same code to see if it would run, but it didn't either.
What is happening here? Any help would be appreciated.
You're getting captcha page. Try to set more HTTP headers as in browser to get correct page. When I set Accept-Language http header I cannot reproduce the error anymore:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Corsair-Platinum-Mechanical-Keyboard-Backlit/dp/B082GR814B/'
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0",
'Accept-Language': 'en-US,en;q=0.5',
}
def getPrice():
page = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text()
print(price)
getPrice()
Prints:
$195.99
Try printing soup after soup = BeautifulSoup(page.content, 'html.parser').
Amazon knows that you are trying to crawl them and hence the page you think they are returning, they are not.
Getting blocked when scraping Amazon (even with headers, proxies, delay)
i would like to scrape amazon top 10 bestsellers in baby-products.
i want just the titel text but it seems that i have a problem.
im getting 'None' when I'm trying this code.
after getting "result" i want to iterate it using "content" and print the titles.
thanks!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.amazon.com/gp/bestsellers/baby-products"
r=requests.get(url, headers=headers)
print("status: ", r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
print("url: ", r.url)
result = soup.find("ol", {"id": "zg-ordered-list"})
content = result.findAll("div", {"class": "a-section a-spacing-none aok-relative"})
print(result)
print(content)
You won't be able to scrape the Amazon website in this way. You are using requests.get to get the HTTP response body of the url provided. Pay attention to what that response actually is (e.g. by print(r.content)). What you can see in your web browser is different than the raw HTTP response, because of client-side rendering technologies used by Amazon (typically JavaScript and others).
I advice you to use Selenium, which sorts of "emulates" the typical browser inside the Python runtime, renders the site like the normal browser would do and allows you to access properties of the same website you see in your web browser.