Error 403 web scraping python beautiful soup - python

'''
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url_,headers=headers)
'''
I am trying to scrape this website "https://allegro.pl/uzytkownik/feni44/lampy-przednie-i-elementy-swiatla-do-jazdy-dziennej-drl-255102?bmatch=cl-e2101-d3793-c3792-fd-60-aut-1-3-0412"
Everything was working good till yesterday, but suddenly I get 403 error.
I have used proxies/VPN but still the error persists.

When scraping a website, you must be careful of the website's anti-DDOS protection strategies. One form of DDOS is submitting many load requests at once via refresh, which can increase a server's load and hinder its performance. Using a web scraper does exactly that as it goes through each link, and so the website can mistake your bot as a DDOS'er and block it's IP address, making it FORBIDDEN (error 403) to access the website from it's IP address.
Usually this is only temporary, so after 12 hours or 24 hours (or however long the website sets a block period) it should be good to go. If you don't want to avoid a future 403 FORBIDDEN error, then consider sleeping for 10 seconds between each request.

Try to use some proxy services like Bright proxy. They have more than 72million+ proxies. I think this issue will resolve on rotating the proxy and useragent.

Related

How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url

Can we get IP banned even using Selenium?

I am using Python to scrape pages. Until now I didn't have any issues. I use Selenium for this purpose, but i also do hear that people get IP banned from some websites. I didn't faced that. Those people used beautifulsoup, lxml and requests libraries...
Selenium feels like a user is using the browser and not the bots, but can it also IP banned from some sites?
I am also using a header user_agent as:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
yes, It depends on the requests you send to a website, usually while datascraping a website can get you banned using the user agent is a plus because some websites wont let you in if that is not set up
if you dont want to get banned use a proxy IP.

getting past ReadTimeout from Python Requests

I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.

urllib getting HTML but missing data

Right basically I'm getting and displaying HTML which displays the data I'm looking for just fine in a normal browser but not in a HTML dump with urllib.
Example URL: https://betfred.mobi/sports/horses/event/4315034.2
Example data: Horse names like "She Is No Lady"
Displays just fine under a browser. Doesn't need any login or preexisting cookies or anything.
I thought maybe it was waiting to see an actual user agent or something but that should be fine as well. I'm setting one and I've checked - it's working.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36')]
response = opener.open("https://betfred.mobi/sports/horses/event/4315034.2")
print response.read()
It's showing something alright and I'm getting a HTML dump of the site but horse names for example are not showing up.
Am I missing something blindingly obvious here?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
At present, Mechanize doesn't handle JavaScript.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

Scrapy crawl blocked with 403/503

I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?
Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.
It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.
For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
I simply set AutoThrottle_ENABLED to True and my script was able to run.

Categories