Scrapy: only scrape parts of website

Scrapy: only scrape parts of website - python

I'd like to scrape parts of a number of very large websites using Scrapy. For instance, from northeastern.edu I would like to scrape only pages that are below the URL http://www.northeastern.edu/financialaid/, such as http://www.northeastern.edu/financialaid/contacts or http://www.northeastern.edu/financialaid/faq. I do not want to scrape the university's entire web site, i.e. http://www.northeastern.edu/faq should not be allowed.
I have no problem with URLs in the format financialaid.northeastern.edu (by simply limiting the allowed_domains to financialaid.northeastern.edu), but the same strategy doesn't work for northestern.edu/financialaid. (The whole spider code is actually longer as it loops through different web pages, I can provide details. Everything works apart from the rules.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northestern.edu/financialaid']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
The results look like this:
2015-05-12 14:10:46-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: finaid_scraper)
2015-05-12 14:10:46-0700 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:10:46-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finaid_scraper.spiders', 'SPIDER_MODULES': ['finaid_scraper.spiders'], 'FEED_URI': '/Users/hugo/Box Sync/finaid/ScrapedSiteText_check/Northeastern.json', 'USER_AGENT': 'stanford_sociology', 'BOT_NAME': 'finaid_scraper'}
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:10:46-0700 [graphspider] INFO: Spider opened
2015-05-12 14:10:46-0700 [graphspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-12 14:10:46-0700 [graphspider] DEBUG: Redirecting (301) to <GET http://www.northeastern.edu/financialaid/> from <GET http://www.northeastern.edu/financialaid>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/financialaid/> (referer: None)
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'assistive.usablenet.com': <GET http://assistive.usablenet.com/tt/http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.northeastern.edu': <GET http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/pages/Boston-MA/NU-Student-Financial-Services/113143082891>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/NUSFS>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'nusfs.wordpress.com': <GET http://nusfs.wordpress.com/>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'northeastern.edu': <GET http://northeastern.edu/howto>
2015-05-12 14:10:47-0700 [graphspider] INFO: Closing spider (finished)
2015-05-12 14:10:47-0700 [graphspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 431,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9574,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 12, 21, 10, 47, 94112),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 6,
'offsite/filtered': 32,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 5, 12, 21, 10, 46, 566538)}
2015-05-12 14:10:47-0700 [graphspider] INFO: Spider closed (finished)
The second strategy I attempted was to use allow-rules of the LxmlLinkExtractor and to limit the crawl to everything within the sub-domain, but in that case the entire web page is scraped. (Deny-rules do work.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['www.northestern.edu']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
I also tried:
rules = (
Rule(LxmlLinkExtractor(allow=(r"northeastern.edu/financialaid",)), callback='parse_site', follow=True),
)
The log is too long to be posted here, but these lines show that Scrapy ignores the allow-rule:
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/tag/north-carolina/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Scraped from <200 http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/>
Here is my items.py:
from scrapy.item import Item, Field
class FinAidScraperItem(Item):
# define the fields for your item here like:
url=Field()
linkedurls=Field()
internal_linkedurls=Field()
external_linkedurls=Field()
http_status=Field()
title=Field()
text=Field()
I am using Mac, Python 2.7, Scrapy version 0.24.4. Similar questions have been posted before, but none of the suggested solutions fixed my problem.

You have a typo in your URLs used inside spiders, see:
northeastern
vs
northestern
Here is the spider that worked for me (it follows "financialaid" links only):
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northeastern.edu']
start_urls = ['http://www.northeastern.edu/financialaid/']
rules = (
Rule(LinkExtractor(allow=r"financialaid/"), callback='parse_item', follow=True),
)
def parse_item(self, response):
print response.url
Note that I'm using LinkExtractor shortcut and a string for the allow argument value.
I've also edited your question and fixed the indentation problems assuming they were just "posting" issues.

Related

Scrapy terminated unexpected

I follow book published by o'Reilly to create a spider as below:
articleSpider.py
from scrapy.spiders import CrawlSpider, Rule
from TestScrapy.items import Article
from scrapy.linkextractors import LinkExtractor
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Object-oriented_programming"]
rules = [Rule(LinkExtractor(allow=('(/wiki/)((?!:).)*$'), ),callback="parse_item", follow=True)]
def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print "Title is:" + title
item['title'] = title
return item
Items.py
from scrapy import Item, Field
class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
However, when I run this spider, it just display one result and the terminates. I expect it to run until I terminate it.
Please see the result and debug info from Scrapy:
2016-06-06 15:45:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: TestScrapy)
2016-06-06 15:45:28 [scrapy] INFO: Optional features available: ssl, http11
2016-06-06 15:45:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TestScrapy.spiders', 'SPIDER_MODULES': ['TestScrapy.spiders'], 'BOT_NAME': 'TestScrapy'}
2016-06-06 15:45:29 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-06 15:45:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-06 15:45:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-06 15:45:30 [scrapy] INFO: Enabled item pipelines:
2016-06-06 15:45:30 [scrapy] INFO: Spider opened
2016-06-06 15:45:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-06 15:45:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-06 15:45:33 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Object-oriented_programming> (referer: None)
Title is:Object-oriented programming
2016-06-06 15:45:33 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Object-oriented_programming>
{'title': u'Object-oriented programming'}
2016-06-06 15:45:33 [scrapy] INFO: Closing spider (finished)
2016-06-06 15:45:33 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 246,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 51238,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 6, 7, 45, 33, 441000),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 6, 6, 7, 45, 30, 614000)}
2016-06-06 15:45:33 [scrapy] INFO: Spider closed (finished)

Change the method name from parse to parse_item.
Now you're just crawling the start url but when filtering the rules there is no method to callback, thus the spider ends the execution.
Check this example of CrawlSpider:
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example

You can also use start_product_requests instead of parse here.

Why doesn't the spider return any response for this site?

I'm using scrapy to scrap this site but when I run the spider I don't see any response.
I tried reddit.com and quora.com and they both returned data (started to crawl) but not the site I want.
Here is my simple spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class FirstSpider(CrawlSpider):
name = "jobs"
allowed_domains = ["bayt.com"]
start_urls = (
'http://www.bayt.com/',
)
rules = [
Rule(
LinkExtractor(allow=['.*']),
)
]
I tried several combinations of urls in the start_urls but nothing seemed to work.
Here is the log after running the spider:
2015-12-13 20:31:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: bayt)
2015-12-13 20:31:45 [scrapy] INFO: Optional features available: ssl, http11
2015-12-13 20:31:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bayt.spiders', 'SPIDER_MODULES': ['bayt.spiders'], 'BOT_NAME': 'bayt'}
2015-12-13 20:31:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-13 20:31:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-13 20:31:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-13 20:31:45 [scrapy] INFO: Enabled item pipelines:
2015-12-13 20:31:45 [scrapy] INFO: Spider opened
2015-12-13 20:31:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-13 20:31:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-13 20:31:45 [scrapy] DEBUG: Redirecting (302) to <GET http://www.bayt.com/en/jordan/> from <GET http://www.bayt.com/>
2015-12-13 20:31:46 [scrapy] DEBUG: Crawled (200) <GET http://www.bayt.com/en/jordan/> (referer: None)
2015-12-13 20:31:46 [scrapy] INFO: Closing spider (finished)
2015-12-13 20:31:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 881,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2320,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 13, 18, 31, 46, 212468),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 13, 18, 31, 45, 138408)}
2015-12-13 20:31:46 [scrapy] INFO: Spider closed (finished)

the problem is that you are not using the rules as you mentioned, you have your own parse method, which is not ok, CrawlSpider uses the parse method so you shouldn't override that method.
Now, if you are still getting items when overriding the parse method, it is because parse is the default method for the start_urls requests, so the requests are not really following the rules, but only crawling urls inside start_urls
Just change the name of your parsing method from parse to a different one, and specify that on your rule as a callback.

I did Curl www.bayt.com in the command line and it seems that they redirect the request to http://www.bayt.com/en/jordan/
I put that as My start_urls and it worked and changed the user agent in the settings.py to localhost and it worked.

Scrapy: Define items dynamically

As I started to learn scrapy, i have come accross a requirement to dynamically build the Item attributes. I'm just scraping a webpage which has a table structure and I wanted to form the item and field attributes while crawling. I have gone through this example Scraping data without having to explicitly define each field to be scraped but couldn't make much of it.
Should I be writing an item pipleline to capture the info dynamically. I have also looked at Item loader function, but if anyone can explain in detail, it will be really helpful.

Just use a single Field as an arbitrary data placeholder. And then when you want to get the data out, instead of saying for field in item, you say for field in item['row']. You don't need pipelines or loaders to accomplish this task, but they are both used extensively for good reason: they are worth learning.
spider:
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
class TableItem(Item):
row = Field()
class TestSider(BaseSpider):
name = "tabletest"
start_urls = ('http://scrapy.org?finger', 'http://example.com/toe')
def parse(self, response):
item = TableItem()
row = dict(
foo='bar',
baz=[123, 'test'],
)
row['url'] = response.url
if 'finger' in response.url:
row['digit'] = 'my finger'
row['appendage'] = 'hand'
else:
row['foot'] = 'might be my toe'
item['row'] = row
return item
outptut:
stav#maia:/srv/stav/scrapie/oneoff$ scrapy crawl tabletest
2013-03-14 06:55:52-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: oneoff)
2013-03-14 06:55:52-0600 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'oneoff.spiders', 'SPIDER_MODULES': ['oneoff.spiders'], 'USER_AGENT': 'Chromium OneOff 24.0.1312.56 Ubuntu 12.04 (24.0.1312.56-0ubuntu0.12.04.1)', 'BOT_NAME': 'oneoff'}
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled item pipelines:
2013-03-14 06:55:53-0600 [tabletest] INFO: Spider opened
2013-03-14 06:55:53-0600 [tabletest] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Crawled (200) <GET http://scrapy.org?finger> (referer: None)
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Scraped from <200 http://scrapy.org?finger>
{'row': {'appendage': 'hand',
'baz': [123, 'test'],
'digit': 'my finger',
'foo': 'bar',
'url': 'http://scrapy.org?finger'}}
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://example.com/toe>
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example> from <GET http://www.iana.org/domains/example/>
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example> (referer: None)
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Scraped from <200 http://www.iana.org/domains/example>
{'row': {'baz': [123, 'test'],
'foo': 'bar',
'foot': 'might be my toe',
'url': 'http://www.iana.org/domains/example'}}
2013-03-14 06:55:53-0600 [tabletest] INFO: Closing spider (finished)
2013-03-14 06:55:53-0600 [tabletest] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1066,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 3833,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 848735),
'item_scraped_count': 2,
'log_count/DEBUG': 13,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 99635)}
2013-03-14 06:55:53-0600 [tabletest] INFO: Spider closed (finished)

Use this class:
class Arbitrary(Item):
def __setitem__(self, key, value):
self._values[key] = value
self.fields[key] = {}

The custom __setitem__ solution didn't work for me when using item loaders in Scrapy 1.0.3 because the item loader accesses the fields attribute directly:
value = self.item.fields[field_name].get(key, default)
The custom __setitem__ is only called for item-level accesses like item['new field']. Since fields is just a dict, I realized I could simply create an Item subclass that uses a defaultdict to gracefully handle these situations.
In the end, just two extra lines of code:
from collections import defaultdict
class FlexItem(scrapy.Item):
"""An Item that creates fields dynamically"""
fields = defaultdict(scrapy.Field)

In Scrapy 1.0+ the better way could be to yield Python dicts instead of Item instances if you don't have a well-defined schema. Check e.g. an example on http://scrapy.org/ front page - there is no Item defined.

I was more xpecting about explanation in handling the data with item loaders and pipelines
Assuming:
fieldname = 'test'
fieldxpath = '//h1'
It's (in recent versions) very simple...
item = Item()
l = ItemLoader(item=item, response=response)
item.fields[fieldname] = Field()
l.add_xpath(fieldname, fieldxpath)
return l.load_item()

Scrapy parse_item callback not being called

I'm having a problem getting my Scrapy spider to run its callback method.
I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract
class CrawlerSpider(CrawlSpider):
name = "crawler"
def __init__(self, initial_url):
log.msg('initing...', level=log.WARNING)
CrawlSpider.__init__(self)
if not initial_url.startswith('http'):
initial_url = 'http://' + initial_url
ext = tldextract.extract(initial_url)
initial_domain = ext.domain + '.' + ext.tld
initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
self.start_urls = [initial_url]
self.rules = [
Rule(SgmlLinkExtractor(), callback='parse_item'),
Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
]
self._compile_rules()
def parse_item(self, response):
log.msg('parse_item...', level=log.WARNING)
hxs = HtmlXPathSelector(response)
links = hxs.select("//a/#href").extract()
for link in links:
log.msg('link', level=log.WARNING)
Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.
$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines:
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 363,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)
Thanks in advance!

The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:
Test
which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:
['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']
this Request gets filtered by the OffsiteMiddleware:
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
therefore parse_item will not be called for this url.

Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)

python - scrapy doesn't follow links

I'm trying to parse site with Scrapy. The urls I need to parse formed like this http://example.com/productID/1234/. This links can be found on pages with address like: http://example.com/categoryID/1234/. The thing is that my crawler fetches first categoryID page (http://www.example.com/categoryID/79/, as you can see from trace below), but nothing more. What am I doing wrong? Thank you.
Here is my Scrapy code:
# -*- coding: UTF-8 -*-
#THIRD-PARTY MODULES
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class ExampleComSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["http://www.example.com/"]
start_urls = [
"http://www.example.com/"
]
rules = (
# Extract links matching 'categoryID/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/categoryID/(\d*)/', ), )),
# Extract links matching 'productID/xxx' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('/productID/(\d*)/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
Here is a trace of Scrapy:
2012-01-31 12:38:56+0000 [scrapy] INFO: Scrapy 0.14.1 started (bot: parsers)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled item pipelines:
2012-01-31 12:38:57+0000 [example.com] INFO: Spider opened
2012-01-31 12:38:57+0000 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-01-31 12:38:58+0000 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2012-01-31 12:38:58+0000 [example.com] DEBUG: Filtered offsite request to 'www.example.com': <GET http://www.example.com/categoryID/79/>
2012-01-31 12:38:58+0000 [example.com] INFO: Closing spider (finished)
2012-01-31 12:38:58+0000 [example.com] INFO: Dumping spider stats:
{'downloader/request_bytes': 199,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 121288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 1, 31, 12, 38, 58, 409806),
'request_depth_max': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 1, 31, 12, 38, 57, 127805)}
2012-01-31 12:38:58+0000 [example.com] INFO: Spider closed (finished)
2012-01-31 12:38:58+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 26992640, 'memusage/startup': 26992640}

It can be a difference between "www.example.com" and "example.com". If it helps, you can use them both this way
allowed_domains = ["www.example.com", "example.com"]

Replace:
allowed_domains = ["http://www.example.com/"]
with:
allowed_domains = ["example.com"]
That should do the trick.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: only scrape parts of website - python

Related

Scrapy terminated unexpected

Why doesn't the spider return any response for this site?

Scrapy: Define items dynamically

Scrapy parse_item callback not being called

python - scrapy doesn't follow links

Categories

Resources