How do I make Scrapy print all duplicate urls? - python

I am getting this message once in the logs:
2014-01-16 12:41:45+0100 [mybot] DEBUG: Filtered duplicate request: <GET https://mydomain/someurl> - no more duplicates will be shown (see DUPEFILTER_CLASS)
The url was requested using Request() and it says it's a duplicate on the very first time it requests it. I don't know what's causing this. What can I do to debug this? How do I make it print all the duplicate urls that it's filtering?

You probably need to create your own DupeFilter
from scrapy.dupefilter import RFPDupeFilter
class VerboseRFPDupeFilter(RFPDupeFilter):
def log(self, request, spider):
fmt = "Filtered duplicate request: %(request)s"
log.msg(format=fmt, request=request, level=log.DEBUG, spider=spider)
and set DUPEFILTER_CLASS setting to this new Class (http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class)

Try exact url with curl -v URL and see if the headers contains a 301 or 302. Alternatively you can try scrapy shell URL.
I've seen some sites that redirects to the same page when the parameters are not in the same order or the expected letter case. Scrapy doesn't consider the order or the letter case of the parameters when comparing two request objects.

Related

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.

Scrapy Crawl all websites in start_url even if redirect

I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.
DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>
I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:
DEBUG: Filtered offsite request to 'www.foo.com'
Here is my attempt to dynamically add allowed_domains:
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return Request = (response.url,callback=self.parse_callback)
else:
return self.parse_it(response,1)
My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.
I figured out the answer to my own question.
I edited the offsite middleware to get the updated list of allowed domains before it filters and I dynamically add to the allowed domain list in parse_start_url method.
I added this function to OffisteMiddleware
def update_regex(self,spider):
self.host_regex = self.get_host_regex(spider)
I also edited this function inside OffsiteMiddleware
def should_follow(self, request, spider):
#Custom code to update regex
self.update_regex(spider)
regex = self.host_regex
# hostname can be None for wrong urls (like javascript links)
host = urlparse_cached(request).hostname or ''
return bool(regex.search(host))
Lastly for my use case I added this code to my spider
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return self.parse_it(response,1)
This code will add the redirected domain for any start_urls that get redirected and then will crawl those redirected sites.

Scrapy retry or redirect middleware

While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this:
DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm)
To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm again (maybe a few more times), I should get it. How do I set the number of tries for that?
If it matters, I'm only allowing the crawler to use a very specific starting url and then only follow "next page" links, so it should go in order through a small number of pages - hence why I need it to either retry, e.g., page 34, or come back to it later. Scrapy documentation says it should retry 20 times by default, but I don't see it retrying at all. Also if it helps: All redirects go to the same page (a "go away" page, the foo.com above) - is there a way to tell Scrapy that that particular page "doesn't count" and if it's getting redirected there, to keep retrying? I saw something in the downloader middleware referring to particular http codes in a list - can I add 302 to the "always keep trying this" list somehow?
I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_meta_refresh
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'redirect %d' %response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = HtmlXPathSelector(response)
# test for captcha page
captcha = hxs.select(".//input[contains(#id, 'captchacharacters')]").extract()
if captcha:
log.msg("captcha page %s" %url, level=log.INFO)
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:
DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
}
You can handle 302 responses by adding handle_httpstatus_list = [302] at the beginning of your spider like so:
class MySpider(CrawlSpider):
handle_httpstatus_list = [302]
def parse(self, response):
if response.status == 302:
# Store response.url somewhere and go back to it later

How to add Headers to Scrapy CrawlSpider Requests?

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.
As per this question, I checked
response.request.headers.get('Referer', None)
in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).
I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.
Does anyone know how to modify request headers dynamically?
You can pass REFERER manually to each request using headers argument:
yield Request(parse=..., headers={'referer':...})
RefererMiddleware does the same, automatically taking the referrer url from the previous response.
You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware
In short, you need to add this middleware to your project's settings file.
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
Then in your response parsing method, you can use, response.request.headers.get('Referrer', None), to get the referer.

Crawling redirected url in scrapy

I am working in scrapy.
I am fetching a site which consists of a list of urls.
So I requested the main url in start_url and I got all the href tags(links to fetch data) in a list, I again requested each and every url in the list further for fetching data, but some of the urls are redirecting like below:
Redirecting (301) to <GET example.com/sch/mobile-68745.php> from Redirecting (301) to <GET example.com/sch/mobile-8974.php>
I came to know that scrapy ignores the redirected links, but I want to catch the redirected url and want to scrape the same like the url with 200 status
Is there anyway to catch that redirect url and scrape the data from them, I mean do we need disable redirect middleware? Or do we need to use any meta tag in Request command, can youu provide me an example of that?
I’ve got no experience with Scrapy, but apparently, you can define middlewares that change they way Scrapy works when resolving content.
There is the RedirectMiddleware that supports and handles redirects out of the box, so all you’d need to do is to enable that.
DOWNLOADER_MIDDLEWARES = {
'apy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 123,
}

Categories