I'm trying to parse site with Scrapy. The urls I need to parse formed like this http://example.com/productID/1234/. This links can be found on pages with address like: http://example.com/categoryID/1234/. The thing is that my crawler fetches first categoryID page (http://www.example.com/categoryID/79/, as you can see from trace below), but nothing more. What am I doing wrong? Thank you.
Here is my Scrapy code:
# -*- coding: UTF-8 -*-
#THIRD-PARTY MODULES
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class ExampleComSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["http://www.example.com/"]
start_urls = [
"http://www.example.com/"
]
rules = (
# Extract links matching 'categoryID/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/categoryID/(\d*)/', ), )),
# Extract links matching 'productID/xxx' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('/productID/(\d*)/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
Here is a trace of Scrapy:
2012-01-31 12:38:56+0000 [scrapy] INFO: Scrapy 0.14.1 started (bot: parsers)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled item pipelines:
2012-01-31 12:38:57+0000 [example.com] INFO: Spider opened
2012-01-31 12:38:57+0000 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-01-31 12:38:58+0000 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2012-01-31 12:38:58+0000 [example.com] DEBUG: Filtered offsite request to 'www.example.com': <GET http://www.example.com/categoryID/79/>
2012-01-31 12:38:58+0000 [example.com] INFO: Closing spider (finished)
2012-01-31 12:38:58+0000 [example.com] INFO: Dumping spider stats:
{'downloader/request_bytes': 199,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 121288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 1, 31, 12, 38, 58, 409806),
'request_depth_max': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 1, 31, 12, 38, 57, 127805)}
2012-01-31 12:38:58+0000 [example.com] INFO: Spider closed (finished)
2012-01-31 12:38:58+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 26992640, 'memusage/startup': 26992640}
It can be a difference between "www.example.com" and "example.com". If it helps, you can use them both this way
allowed_domains = ["www.example.com", "example.com"]
Replace:
allowed_domains = ["http://www.example.com/"]
with:
allowed_domains = ["example.com"]
That should do the trick.
Related
I'm trying to scrape the prices for shoes on the website in the code. I have no idea of knowing if my syntax is even correct. I could really use some help.
from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector
def Yeezy(Item):
price = Field()
class YeezySpider(BaseSpider):
name = "yeezy"
allowed_domains = ["https://www.grailed.com/"]
start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']
def parse(self, response):
hxs = HtmlXPathSelector(response)
price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
items = []
for price in price:
item = Yeezy()
item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
items.append(item)
yield item
The code is reporting this to the console:
ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
At first I thought it was a problem with the css elements I entered but now I'm not so sure. This is my first time trying a project like this, I could really use some insight. Thank you in advance.
EDIT: So I tried simulating an xhr request in my code by following another example. This is what I have:
import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem
class YeezySpider(scrapy.Spider):
name = "yeezy"
allowed_domains = ["www.grailed.com"]
start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]
def parse(self, response):
for i in range(0,2):
yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad',
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather'
OR category_path:'footwear.hitop_sneakers' OR
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
"category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"]
page:2"}, callback=self.data_parse())
def data_parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.xpath("//p").extract()
for prices in prices:
price = prices.select("a/text()").extract()
print price
I had to reformat things a little to fit the indentation differences between Python and Stackoverflow.
These are the logs reported in the terminal, again thanks for the help:
C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
Seems like the products are retrieved by AJAX (see related: Can scrapy be used to scrape dynamic content from websites that are using AJAX?).
If you open up browsers webinspector, select network tab and look for XHR requests when the page loads, you can see this:
Seems like a POST type request is being made with categories, filter etc. and a json of products is returned. You can reverse engineer it and replicate it in scrapy.
I follow book published by o'Reilly to create a spider as below:
articleSpider.py
from scrapy.spiders import CrawlSpider, Rule
from TestScrapy.items import Article
from scrapy.linkextractors import LinkExtractor
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Object-oriented_programming"]
rules = [Rule(LinkExtractor(allow=('(/wiki/)((?!:).)*$'), ),callback="parse_item", follow=True)]
def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print "Title is:" + title
item['title'] = title
return item
Items.py
from scrapy import Item, Field
class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
However, when I run this spider, it just display one result and the terminates. I expect it to run until I terminate it.
Please see the result and debug info from Scrapy:
2016-06-06 15:45:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: TestScrapy)
2016-06-06 15:45:28 [scrapy] INFO: Optional features available: ssl, http11
2016-06-06 15:45:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TestScrapy.spiders', 'SPIDER_MODULES': ['TestScrapy.spiders'], 'BOT_NAME': 'TestScrapy'}
2016-06-06 15:45:29 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-06 15:45:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-06 15:45:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-06 15:45:30 [scrapy] INFO: Enabled item pipelines:
2016-06-06 15:45:30 [scrapy] INFO: Spider opened
2016-06-06 15:45:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-06 15:45:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-06 15:45:33 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Object-oriented_programming> (referer: None)
Title is:Object-oriented programming
2016-06-06 15:45:33 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Object-oriented_programming>
{'title': u'Object-oriented programming'}
2016-06-06 15:45:33 [scrapy] INFO: Closing spider (finished)
2016-06-06 15:45:33 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 246,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 51238,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 6, 7, 45, 33, 441000),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 6, 6, 7, 45, 30, 614000)}
2016-06-06 15:45:33 [scrapy] INFO: Spider closed (finished)
Change the method name from parse to parse_item.
Now you're just crawling the start url but when filtering the rules there is no method to callback, thus the spider ends the execution.
Check this example of CrawlSpider:
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example
You can also use start_product_requests instead of parse here.
I'm using scrapy to scrap this site but when I run the spider I don't see any response.
I tried reddit.com and quora.com and they both returned data (started to crawl) but not the site I want.
Here is my simple spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class FirstSpider(CrawlSpider):
name = "jobs"
allowed_domains = ["bayt.com"]
start_urls = (
'http://www.bayt.com/',
)
rules = [
Rule(
LinkExtractor(allow=['.*']),
)
]
I tried several combinations of urls in the start_urls but nothing seemed to work.
Here is the log after running the spider:
2015-12-13 20:31:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: bayt)
2015-12-13 20:31:45 [scrapy] INFO: Optional features available: ssl, http11
2015-12-13 20:31:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bayt.spiders', 'SPIDER_MODULES': ['bayt.spiders'], 'BOT_NAME': 'bayt'}
2015-12-13 20:31:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-13 20:31:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-13 20:31:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-13 20:31:45 [scrapy] INFO: Enabled item pipelines:
2015-12-13 20:31:45 [scrapy] INFO: Spider opened
2015-12-13 20:31:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-13 20:31:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-13 20:31:45 [scrapy] DEBUG: Redirecting (302) to <GET http://www.bayt.com/en/jordan/> from <GET http://www.bayt.com/>
2015-12-13 20:31:46 [scrapy] DEBUG: Crawled (200) <GET http://www.bayt.com/en/jordan/> (referer: None)
2015-12-13 20:31:46 [scrapy] INFO: Closing spider (finished)
2015-12-13 20:31:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 881,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2320,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 13, 18, 31, 46, 212468),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 13, 18, 31, 45, 138408)}
2015-12-13 20:31:46 [scrapy] INFO: Spider closed (finished)
the problem is that you are not using the rules as you mentioned, you have your own parse method, which is not ok, CrawlSpider uses the parse method so you shouldn't override that method.
Now, if you are still getting items when overriding the parse method, it is because parse is the default method for the start_urls requests, so the requests are not really following the rules, but only crawling urls inside start_urls
Just change the name of your parsing method from parse to a different one, and specify that on your rule as a callback.
I did Curl www.bayt.com in the command line and it seems that they redirect the request to http://www.bayt.com/en/jordan/
I put that as My start_urls and it worked and changed the user agent in the settings.py to localhost and it worked.
I'd like to scrape parts of a number of very large websites using Scrapy. For instance, from northeastern.edu I would like to scrape only pages that are below the URL http://www.northeastern.edu/financialaid/, such as http://www.northeastern.edu/financialaid/contacts or http://www.northeastern.edu/financialaid/faq. I do not want to scrape the university's entire web site, i.e. http://www.northeastern.edu/faq should not be allowed.
I have no problem with URLs in the format financialaid.northeastern.edu (by simply limiting the allowed_domains to financialaid.northeastern.edu), but the same strategy doesn't work for northestern.edu/financialaid. (The whole spider code is actually longer as it loops through different web pages, I can provide details. Everything works apart from the rules.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northestern.edu/financialaid']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
The results look like this:
2015-05-12 14:10:46-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: finaid_scraper)
2015-05-12 14:10:46-0700 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:10:46-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finaid_scraper.spiders', 'SPIDER_MODULES': ['finaid_scraper.spiders'], 'FEED_URI': '/Users/hugo/Box Sync/finaid/ScrapedSiteText_check/Northeastern.json', 'USER_AGENT': 'stanford_sociology', 'BOT_NAME': 'finaid_scraper'}
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:10:46-0700 [graphspider] INFO: Spider opened
2015-05-12 14:10:46-0700 [graphspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-12 14:10:46-0700 [graphspider] DEBUG: Redirecting (301) to <GET http://www.northeastern.edu/financialaid/> from <GET http://www.northeastern.edu/financialaid>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/financialaid/> (referer: None)
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'assistive.usablenet.com': <GET http://assistive.usablenet.com/tt/http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.northeastern.edu': <GET http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/pages/Boston-MA/NU-Student-Financial-Services/113143082891>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/NUSFS>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'nusfs.wordpress.com': <GET http://nusfs.wordpress.com/>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'northeastern.edu': <GET http://northeastern.edu/howto>
2015-05-12 14:10:47-0700 [graphspider] INFO: Closing spider (finished)
2015-05-12 14:10:47-0700 [graphspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 431,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9574,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 12, 21, 10, 47, 94112),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 6,
'offsite/filtered': 32,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 5, 12, 21, 10, 46, 566538)}
2015-05-12 14:10:47-0700 [graphspider] INFO: Spider closed (finished)
The second strategy I attempted was to use allow-rules of the LxmlLinkExtractor and to limit the crawl to everything within the sub-domain, but in that case the entire web page is scraped. (Deny-rules do work.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['www.northestern.edu']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
I also tried:
rules = (
Rule(LxmlLinkExtractor(allow=(r"northeastern.edu/financialaid",)), callback='parse_site', follow=True),
)
The log is too long to be posted here, but these lines show that Scrapy ignores the allow-rule:
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/tag/north-carolina/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Scraped from <200 http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/>
Here is my items.py:
from scrapy.item import Item, Field
class FinAidScraperItem(Item):
# define the fields for your item here like:
url=Field()
linkedurls=Field()
internal_linkedurls=Field()
external_linkedurls=Field()
http_status=Field()
title=Field()
text=Field()
I am using Mac, Python 2.7, Scrapy version 0.24.4. Similar questions have been posted before, but none of the suggested solutions fixed my problem.
You have a typo in your URLs used inside spiders, see:
northeastern
vs
northestern
Here is the spider that worked for me (it follows "financialaid" links only):
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northeastern.edu']
start_urls = ['http://www.northeastern.edu/financialaid/']
rules = (
Rule(LinkExtractor(allow=r"financialaid/"), callback='parse_item', follow=True),
)
def parse_item(self, response):
print response.url
Note that I'm using LinkExtractor shortcut and a string for the allow argument value.
I've also edited your question and fixed the indentation problems assuming they were just "posting" issues.
I'm having a problem getting my Scrapy spider to run its callback method.
I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract
class CrawlerSpider(CrawlSpider):
name = "crawler"
def __init__(self, initial_url):
log.msg('initing...', level=log.WARNING)
CrawlSpider.__init__(self)
if not initial_url.startswith('http'):
initial_url = 'http://' + initial_url
ext = tldextract.extract(initial_url)
initial_domain = ext.domain + '.' + ext.tld
initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
self.start_urls = [initial_url]
self.rules = [
Rule(SgmlLinkExtractor(), callback='parse_item'),
Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
]
self._compile_rules()
def parse_item(self, response):
log.msg('parse_item...', level=log.WARNING)
hxs = HtmlXPathSelector(response)
links = hxs.select("//a/#href").extract()
for link in links:
log.msg('link', level=log.WARNING)
Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.
$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines:
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 363,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)
Thanks in advance!
The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:
Test
which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:
['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']
this Request gets filtered by the OffsiteMiddleware:
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
therefore parse_item will not be called for this url.
Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)