While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this:
DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm)
To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm again (maybe a few more times), I should get it. How do I set the number of tries for that?
If it matters, I'm only allowing the crawler to use a very specific starting url and then only follow "next page" links, so it should go in order through a small number of pages - hence why I need it to either retry, e.g., page 34, or come back to it later. Scrapy documentation says it should retry 20 times by default, but I don't see it retrying at all. Also if it helps: All redirects go to the same page (a "go away" page, the foo.com above) - is there a way to tell Scrapy that that particular page "doesn't count" and if it's getting redirected there, to keep retrying? I saw something in the downloader middleware referring to particular http codes in a list - can I add 302 to the "always keep trying this" list somehow?
I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_meta_refresh
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'redirect %d' %response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = HtmlXPathSelector(response)
# test for captcha page
captcha = hxs.select(".//input[contains(#id, 'captchacharacters')]").extract()
if captcha:
log.msg("captcha page %s" %url, level=log.INFO)
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:
DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
}
You can handle 302 responses by adding handle_httpstatus_list = [302] at the beginning of your spider like so:
class MySpider(CrawlSpider):
handle_httpstatus_list = [302]
def parse(self, response):
if response.status == 302:
# Store response.url somewhere and go back to it later
Related
I'm trying to access a site and check if no links redirecting to a page within the site that are down. As there is no sitemap available, I'm using Scrapy to crawl the site and get all links on every page, but I can't get it to output a file with all the links found and their status code. The site I'm using to test the code is quotes.toscrape.com and my code is:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
I've tried adding the following lines after yield:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
But it gets me a bunch of duplicates and no url with status 404 or 301. Does anyone know how I can get all the urls with the status?
Scrapy by default does not return any unsuccessful requests, but you can fetch them and handle them in one of your functions if you set errback on the request.
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
The parameter failure will contain more information on the exact reason for failure, because it could be HTTP errors (where you can fetch a response), but also DNS lookup errors and such (where there is no response).
The documentation contains an example how to use failure to determine the error reason and access Response if available:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
You should use the HTTPERROR_ALLOW_ALL in your settings or set the meta key handle_httpstatus_all = Truein all your requests, please refer to the docs for more information.
I'm writing a scrapy application that crawls a website main page, saves his url, and also checks for his menu items, so it would do the same process recursively, to them.
class NeatSpider(scrapy.Spider):
name = "example"
start_urls = ['https://example.com/main/']
def parse(self, response):
url = response.url
yield url
# check for in-menu article links
menu_items = response.css(MENU_BAR_LINK_ITEM).extract()
if menu_items is not None:
for menu_item in menu_items:
yield scrapy.Request(response.urljoin(menu_item), callback=self.parse)
In the example website, each menu item leads to another page with another menu items.
Some pages responses get to the 'parse' method, and so their url gets saved, while others not.
Those who are not, are giving back a 200 status (when I enter their address manually, in the browser), don't throw any exceptions, and pretty much shows the same behavior as other pages who do get to the parse method.
Addition Information: ALL of the menu items get to the last line in code (without any errors), and if I provide an 'errback' callback method, no request ever gets there.
EDIT: here is the log: http://pastebin.com/2j5HMkqN
There might be chance that the website you are scraping showing Captchas.
You can debug your scraper like this, this will open the scraped webpage in your OS default browser.
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
if "item name" not in response.body:
open_in_browser(response)
I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.
DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>
I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:
DEBUG: Filtered offsite request to 'www.foo.com'
Here is my attempt to dynamically add allowed_domains:
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return Request = (response.url,callback=self.parse_callback)
else:
return self.parse_it(response,1)
My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.
I figured out the answer to my own question.
I edited the offsite middleware to get the updated list of allowed domains before it filters and I dynamically add to the allowed domain list in parse_start_url method.
I added this function to OffisteMiddleware
def update_regex(self,spider):
self.host_regex = self.get_host_regex(spider)
I also edited this function inside OffsiteMiddleware
def should_follow(self, request, spider):
#Custom code to update regex
self.update_regex(spider)
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
host = urlparse_cached(request).hostname or ''
return bool(regex.search(host))
Lastly for my use case I added this code to my spider
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return self.parse_it(response,1)
This code will add the redirected domain for any start_urls that get redirected and then will crawl those redirected sites.
Is there a way to stop a url from redirecting?
driver.get('http://loginrequired.com')
This redirects me to another page but I want it to stay on that page without redirecting by default.
There are two ways that what users call "redirection" typically happens:
You load a page and the page loads some JavaScript code which performs a test and decides to load a different page. This process can be interrupted in some browsers by hitting the ESCAPE key. Selenium can send an ESCAPE key.
However, this redirection could happen before Selenium gives control back to your script. Whether it would work in any specific case depends on the page being loaded.
You load a page and get an HTTP 3xx (301, 303, 304, etc.) response from the server. There are no opportunities for users to interrupt these redirections in their browser, so Selenium does not provide the means to interrupt or prevent them.
So there is no surefire way to prevent a redirection in Selenium.
A solution, in case you do not need to visualize the page but access to the source of "http://loginrequired.com" would be the usage of Selenium with Scrapy.
Basically you tell the Scrapy middleware to stop redirecting, and while the spider access to the page the redirect is handle the redirection (302).
In the setting.py you have to set
"REDIRECT_ENABLED=False"
The spider code is:
class LoginSpider(CrawlSpider):
name = "login"
allowed_domains = ['loginrequired.com']
start_urls = ['http://loginrequired.com']
handle_httpstatus_list = [302]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="http://loginrequired.com", callback=self.after_302)
def after_302(self, response):
print response.url
# Your code to analysis the page by here
Idea taken from how to handle 302 redirect in scrapy
I am getting this message once in the logs:
2014-01-16 12:41:45+0100 [mybot] DEBUG: Filtered duplicate request: <GET https://mydomain/someurl> - no more duplicates will be shown (see DUPEFILTER_CLASS)
The url was requested using Request() and it says it's a duplicate on the very first time it requests it. I don't know what's causing this. What can I do to debug this? How do I make it print all the duplicate urls that it's filtering?
You probably need to create your own DupeFilter
from scrapy.dupefilter import RFPDupeFilter
class VerboseRFPDupeFilter(RFPDupeFilter):
def log(self, request, spider):
fmt = "Filtered duplicate request: %(request)s"
log.msg(format=fmt, request=request, level=log.DEBUG, spider=spider)
and set DUPEFILTER_CLASS setting to this new Class (http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class)
Try exact url with curl -v URL and see if the headers contains a 301 or 302. Alternatively you can try scrapy shell URL.
I've seen some sites that redirects to the same page when the parameters are not in the same order or the expected letter case. Scrapy doesn't consider the order or the letter case of the parameters when comparing two request objects.