Scrapy crawl blocked with 403/503 - python

I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?

Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.

It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.

For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

I simply set AutoThrottle_ENABLED to True and my script was able to run.

Related

Getting reCaptcha in python requests

I am using python module requests to send some requests to google but after some requests, a reCaptcha pops up.I am using user agent but it still pops up!
What should I do?
I used user agent, it did change the browser looks but it did no effect on the Captcha problem
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
sleep(2)
headers = {'User-Agent': user_agent}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
file = requests.get(f'https://www.google.com/search?q=contact+email+{keyword}+site:{site}&num=100', headers=headers)
I used sleep but in vain. Any suggestions?
That’s kind of the entire point of captchas. They help deter against bots and spammers. Most captchas can’t be bypassed easily, so just changing the user agent won’t make the captcha go away. Since it sounds like the captchas only appear after a certain number of requests, you could use rotating residential proxies and change the session’s IP address whenever a captcha is detected.
Alternatively, you can use a captcha solving service like Anti-Captcha or DeathByCaptcha which involves parsing information about the captcha and then sending it to a service that has workers manually complete it for you. It’s not exactly convenient or efficient, though, and it can often take up to ~30 seconds for a worker to complete a single captcha. Both options cost money.

Python requests in IIS always times out

I have a flask application running on a IIS server. Everything works fine, however I always get a timeout error when using requests.
import requests
r = requests.get('https://github.com')
Using web services is therefore impossible.
I have tried using headers with the requests. But still the same result:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://github.com', headers=headers)
Also tried increasing the timeouts limits, both in code and in the IIS.
Also tried changing the Identity field under Process Model section to LocalSystem.
I'm not familiar with IIS and I cannot think of anything else. Need help.
According to your description, I think this issue is not related with the IIS. It seems your network issue.
I suggest you could firstly check your server's firewall to make sure you let your server could access the internet.
If you need to use proxy to access the internet, I suggest you could try to add below settings in your web.config for your flask application.
<system.net>
<defaultProxy>
<proxy
proxyaddress="The IP address"
bypassonlocal="true"
/>
</defaultProxy>
</system.net>
Details, you could see this article.

getting past ReadTimeout from Python Requests

I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.

How do I know which browser is used to crawl in Scrapy framework?

What's my context:
As you know, website HTML structure on Chrome, Firefox, Safari are quite different. So when I'm using CSS-Selector to get data in an element tag from HTML structure, sometimes It that tag is already have with Chrome browser but the other is not. So that, I just want to focus on only one browser to reduce my effort.
When I crawl data from urls by using Scrapy framework, I don't know which browser will be used by Scrapy to crawl data. Therefore, I also don't know what kind of HTML response body be returned. I checked the response and I found that sometimes the structure is the same as getting from Chrome but sometimes It's not. It seems that Scrapy framework used many different web browsers to crawl data.
What I want:
I want to use only Chrome browser for crawling data in Scrapy framework
The structure of the HTML response body must be obtained from Chrome
What I ask:
Does anyone have any Ideas or tips to help me deal with that issue?
Can I config the Webdriver in Scrapy Framework as Selenium does? (If It's possible, please show me Where and How?)
Thank you!
Scrapy does not use Browser, it parser for static html like BeautifulSoup. if you want to parse dynamic page (javascript generated) use selenium and if you want you can send the page source to Scrapy.
To set Scrapy to use custom user agent (Chrome), in settings.py add
USER_AGENT = Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
or in my_spider.py
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(self.start_urls, callback=self.parse, headers={"User-Agent": "Your Custom User Agent"})
You can set the user agent in your setting file, something like this
USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
So for web server it will look like the request is generating from Chrome.

Cookies not showing in Scrapy output even I have enabled it

By default Cookies are enabled in Python Scrapy
I have this in settings.py
COOKIES_DEBUG = True
It works in all other projects and shows cookies in terminal when I run code.
But it is not showing received cookies in terminal for a specific project.
I have searched internet but I am not sure what to do.
PS:
The website I am scraping of course sets cookies, I can see cookies when I visit that site from browser
What I can be missing?
From the dicsussions with OP, it appears that this website does not send Set-Cookie headers when using scrapy's default User-Agent string.
Changing the User-Agent string to something like this (in settings.py for example):
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36)'
fixes the issue.

Categories