Why is Scrapy not following all rules / running all callbacks?

Why is Scrapy not following all rules / running all callbacks? - python

I have two spiders inheriting from a parent spider class as follows:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
class SpiderOpTest(CrawlSpider):
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543},
}
httperror_allowed_codes = [301]
def parse_tournament(self, response):
print(f"Parsing tournament - {response.url}")
def parse_tournament_page(self, response):
print(f"Parsing tournament page - {response.url}")
class SpiderOpTest1(SpiderOpTest):
name = "test_1"
start_urls = ["https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/"]
rules = (Rule(LinkExtractor(allow="/page/"), callback="parse_tournament_page"),)
class SpiderOpTest2(SpiderOpTest):
name = "test_2"
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(LinkExtractor(allow="/atp-buenos-aires/results/"), callback="parse_tournament", follow=True),
Rule(LinkExtractor(allow="/page/"), callback="parse_tournament_page"),
)
process = CrawlerProcess()
process.crawl(<spider_class>)
process.start()
The parse_tournament_page callback for the Rule in first spider works fine.
However, the second spider only runs the parse_tournament callback from the first Rule despite the fact that the second Rule is the same as the first spider and is operating on the same page.
I'm clearly missing something really simple but for the life of me I can't figure out what it is...
As key bits of the pages load via Javascript then it might be useful for me to include the Selenium middleware I'm using:
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumMiddleware:
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(options=options)
def spider_closed(self, spider):
self.driver.close()
Edit:
So I've managed to create a third spider which is able to execute the parse_tournament_page callback from inside parse_tournament:
class SpiderOpTest3(SpiderOpTest):
name = "test_3"
start_urls = ["https://www.oddsportal.com/tennis/results/"]
httperror_allowed_codes = [301]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
follow=True,
),
)
def parse_tournament(self, response):
print(f"Parsing tournament - {response.url}")
xtr = LinkExtractor(allow="/page/")
links = xtr.extract_links(response)
for p in links:
yield response.follow(p.url, dont_filter=True, callback=self.parse_tournament_page)
def parse_tournament_page(self, response):
print(f"Parsing tournament PAGE - {response.url}")
The key here seems to be dont_filter=True - if this is left as the default False then the parse_tournament_page callback isn't executed. This suggests Scrapy is somehow interpreting the second page as a duplicate which I far as I can tell it isn't. That aside, from what I've read if I want to get around this then I need to add unique=False to the LinkExtractor. However, doing this doesn't result in the parse_tournament_page callback executing :(
Update:
So I think I've found the source of the issue. From what I can tell the request_fingerprint method of RFPDupeFilter creates the same hash for https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/ as https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/#/page/2/.
From reading around I need to subclass RFPDupeFilter to reconfigure the way request_fingerprint works. Any advice on why the same hashes are being generated and/or tips on how to do subclass correctly would be greatly appreciated!

The difference between the two URLs mentioned in the update is in the fragment #/page/2/. Scrapy ignores them by default: Also, servers usually ignore fragments in urls when handling requests, so they are also ignored by default when calculating the fingerprint. If you want to include them, set the keep_fragments argument to True (for instance when handling requests with a headless browser). (from scrapy/utils/request.py)
Check DUPEFILTER_CLASS settings for more information.
The request_fingerprint from scrapy.utils.request can already handle the fragments. When subclassing pass keep_fragments=True.
Add the your class in the custom_settings of SpiderOpTest.

Related

Crawling of relative URL's with Scrapy [Python]

I'm SEO specialist, not really into coding.. But want to try to create a broken links checker in Python with Scrapy module, which will crawl my website and will show me all internal links with 404 code..
So far I have managed to write this code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import Broken
class Spider(CrawlSpider):
name = 'example'
handle_httpstatus_list = [404]
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_info', follow=True)]
def parse_info(self, response):
report = [404]
if response.status in report:
Broken_URLs = Broken()
#Broken_URLs['title']= response.xpath('/html/head/title').get()
Broken_URLs['referer'] = response.request.headers.get('Referer', None)
Broken_URLs['status_code']= response.status
Broken_URLs['url']= response.url
Broken_URLs['anchor']= response.meta.get('link_text')
return Broken_URLs
It's crawling well, as long as we have absolute url's in the site structure.
But there some cases when the crawler comes across with relative url's and end up with this kind of links:
Normally should be:
https://www.example.com/en/...
But it gives me:
https://www.example.com/en/en/... - double language folder, which end up with 404 code.
I'm trying to find a way to override this language duplication, with correct structure at the end.
Does somebody know the way how to fix it? Will much appreciate it!

Scrapy use urllib.parse.urljoin for working with relative urls.
You can fix it by adding custom function into process_request in Rule definition:
def fix_urls():
def process_request(request, response):
return request.replace(url=request.url.replace("/en/en/", "/en/"))
return process_request
class Spider(CrawlSpider):
name = 'example'
...
rules = [Rule(LinkExtractor(), process_request=fix_urls(), callback='parse_info', follow=True)]

Scrapy CrawlSpider and rules

I'm begining with Scrapy and I made a couple of spiders attacking to the same site succesfully.
The first one gets the products listed in the entire site except their prices (because prices are hidden for not logged users) and the second one do login in the website.
My problem looks a bit weird, when I merge both codes: The result is not working! The main problem is that the rules aren't processed is like they aren't called by Scrapy.
Because the program have to login in the website, I have to override start_requests but when I override it the rules are not processed. I'm diving into the documentation but I don't understand how the methods/funcions are called by the framework and why the rules aren't processed.
Here is it my spider code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from oled.items import OledItem
from scrapy.utils.response import open_in_browser
class OledMovilesSpider(CrawlSpider):
name = 'webiste-spider'
allowed_domains = ['website.com']
rules = {
# Para cada item
# Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[contains(text(), '>')]'))),
Rule(LinkExtractor(allow=(), restrict_xpaths=('//h2[#class="product-name"]/a')), callback='parse_item',
follow=False)
}
def start_requests(self):
return [scrapy.FormRequest('https://website.com/index.php?route=account/login',
formdata={'email':'website#website.com','password':'website#'},
callback=self.logged_in)]
def logged_in(self, response):
urls = ['https://gsmoled.com/index.php?route=product/category&path=33_61']
print('antes de return')
return [scrapy.Request(url=url, callback=self.parse_item) for url in urls]
def parse_item(self, response):
print("Dentro de Parse")
open_in_browser(response)
ml_item = OledItem()
# info de producto
ml_item['nombre'] = response.xpath('normalize-space(//title/text())').extract_first()
ml_item['descripcion'] = response.xpath('normalize-space(//*[#id="product-des"])').extract()
ml_item['stock'] = response.xpath('normalize-space(//span[#class=available])').extract()
#ml_item['precio'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[1]/span)').extract()
#ml_item['categoria'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[2]/span)').extract()
yield ml_item
Could someone tell me why the rules are not processing.

I think you're bypassing the rules by overwriting the start_requests. The parse-method is never called, so the rules aren't processed.
If you want to process the rules for page https://gsmoled.com/index.php?route=product/category&path=33_61 after you're logged in, you can try changing the callback of the logged_in method to parse like this: return [scrapy.Request(url=url, callback=self.parse) for url in urls].
The rules should be processed at that moment, and because you specified 'parse_item' as a callback in the rules, the parse_item method will be executed for all urls generated by the rules.

How to start a Scrapy spider from another one

I have two spiders in one Scrapy project. Spider1 crawls a list of page or an entire website and analyzes the content. Spider2 uses Splash to fetch URLs on google and pass that list to Spider1.
So, Spider1 crawls and analyze content and can be used without being called by Spider2
# coding: utf8
from scrapy.spiders import CrawlSpider
import scrapy
class Spider1(scrapy.Spider):
name = "spider1"
tokens = []
query = ''
def __init__(self, *args, **kwargs):
'''
This spider works with two modes,
if only one URL it crawls the entire website,
if a list of URLs only analyze the page
'''
super(Spider1, self).__init__(*args, **kwargs)
start_url = kwargs.get('start_url') or ''
start_urls = kwargs.get('start_urls') or []
query = kwargs.get('q') or ''
if google_query != '':
self.query = query
if start_url != '':
self.start_urls = [start_url]
if len(start_urls) > 0:
self.start_urls = start_urls
def parse(self, response):
'''
Analyze and store data
'''
if len(self.start_urls) == 1:
for next_page in response.css('a::attr("href")'):
yield response.follow(next_page, self.parse)
def closed(self, reason):
'''
Finalize crawl
'''
The code for Spider2
# coding: utf8
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class Spider2(scrapy.Spider):
name = "spider2"
urls = []
page = 0
def __init__(self, *args, **kwargs):
super(Spider2, self).__init__(*args, **kwargs)
self.query = kwargs.get('q')
self.url = kwargs.get('url')
self.start_urls = ['https://www.google.com/search?q=' + self.query]
def start_requests(self):
splash_args = {
'wait:': 2,
}
for url in self.start_urls:
splash_args = {
'wait:': 1,
}
yield SplashRequest(url, self.parse, args=splash_args)
def parse(self, response):
'''
Extract URLs to self.urls
'''
self.page += 1
def closed(self, reason):
process = CrawlerProcess(get_project_settings())
for url in self.urls:
print(url)
if len(self.urls) > 0:
process.crawl('lexi', start_urls=self.urls, q=self.query)
process.start(False)
When running Spider2 I have this error : twisted.internet.error.ReactorAlreadyRunning and Spider1 is called without the list of URLs.
I tried using CrawlRunner as advised by Scrapy documentation but it's the same problem.
I tried using CrawlProcess inside parse method, it "works" but, I still have the error message. When using CrawlRunner inside parse method, it doesn't work.

Currently it is not possible to start a spider from another spider if you're using scrapy crawl command (see https://github.com/scrapy/scrapy/issues/1226). It is possible to start spider from a spider if you write a startup script yourselves - the trick is to use the same CrawlerProcess/CrawlerRunner instance.
I'd not do that though, you're fighting agains the framework. It'd be nice to support this use case, but it is not really supported now.
An easier way is to either rewrite your code to use a single Spider class, or to create a script (bash, Makefile, luigi/airflow if you want to be fancy) which runs scrapy crawl spider1 -o items.jl followed by scrapy crawl spider2; second spider can read items created by the first spider and generate start_requests accordingly.
FTR: combining SplashRequests and regular scrapy.Requests in a single spider is fully supported (it should just work), you don't have to create separate spiders for them.

Python Scrapy 301 redirects

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is:
import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'rust'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
#if response.status == 301:
print response.url
However, this does not print the redirected urls. Any help will be appreciated.
Thank you.

To parse any responses that are not 200 you'd need to do one of these things:
Project-wide
You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.
Spider-wide
Add handle_httpstatus_list parameter to your spider. In your case something like:
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301]
# or
handle_httpstatus_all = True
Request-wide
You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
To learn more see HttpErrorMiddleware

Failed to crawl element of specific website with scrapy spider

I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[#class="job-title"]/#href, but when I execute the spider with command :
scrapy spider auseek -a addsthreshold=3
the variable "urls" used to preserve values is empty, can someone help me to figure it,
here is my code:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals
from myProj.items import ADItem
import time
class AuSeekSpider(CrawlSpider):
name = "auseek"
result_address = []
addressCount = int(0)
addressThresh = int(0)
allowed_domains = ["seek.com.au"]
start_urls = [
"http://www.seek.com.au/jobs/in-australia/"
]
def __init__(self,**kwargs):
super(AuSeekSpider, self).__init__()
self.addressThresh = int(kwargs.get('addsthreshold'))
print 'init finished...'
def parse_start_url(self,response):
print 'This is start url function'
log.msg("Pipeline.spider_opened called", level=log.INFO)
hxs = Selector(response)
urls = hxs.xpath('//article/dl/dd/h2/a[#class="job-title"]/#href').extract()
print 'urls is:',urls
print 'test element:',urls[0].encode("ascii")
for url in urls:
postfix = url.getAttribute('href')
print 'postfix:',postfix
url = urlparse.urljoin(response.url,postfix)
yield Request(url, callback = self.parse_ad)
return
def parse_ad(self, response):
print 'this is parse_ad function'
hxs = Selector(response)
item = ADItem()
log.msg("Pipeline.parse_ad called", level=log.INFO)
item['name'] = str(self.name)
item['picNum'] = str(6)
item['link'] = response.url
item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))
self.addressCount = self.addressCount + 1
if self.addressCount > self.addressThresh:
raise CloseSpider('Get enough website address')
return item
The problems is:
urls = hxs.xpath('//article/dl/dd/h2/a[#class="job-title"]/#href').extract()
urls is empty when I tried to print it out, I just cant figure out why it doesn't work and how can I correct it, thanks for your help.

Here is a working example using selenium and phantomjs headless webdriver in a download handler middleware.
class JsDownload(object):
#check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:
def check_spider_middleware(method):
#functools.wraps(method)
def wrapper(self, request, spider):
msg = '%%s %s middleware step' % (self.__class__.__name__,)
if self.__class__ in spider.middleware:
spider.log(msg % 'executing', level=log.DEBUG)
return method(self, request, spider)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return None
return wrapper
settings.py:
DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}
for wrapper to work all spiders must have at minimum:
middleware = set([])
to include a middleware:
middleware = set([MyProj.middleware.ModuleName.ClassName])
You could have implemented this in a request callback (in spider) but then the http request would be happening twice. This isn't a full proof solution but it works for stuff that loads on .ready(). If you spend some time reading into selenium you can wait for specific event's to trigger before saving page source.
Another example: https://github.com/scrapinghub/scrapyjs
More info: What's the best way of scraping data from a website?
Cheers!

Scrapy does not evaluate Javascript. If you run the following command, you will see that the raw HTML does not contain the anchors you are looking for.
curl http://www.seek.com.au/jobs/in-australia/ | grep job-title
You should try PhantomJS or Selenium instead.
After examining the network requests in Chrome, the job listing appear to have originated from this JSONP request. It should be easy to retrieve whatever you need from it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is Scrapy not following all rules / running all callbacks? - python

Related

Crawling of relative URL's with Scrapy [Python]

Scrapy CrawlSpider and rules

How to start a Scrapy spider from another one

Python Scrapy 301 redirects

Failed to crawl element of specific website with scrapy spider

Categories

Resources