I follow book published by o'Reilly to create a spider as below:
articleSpider.py
from scrapy.spiders import CrawlSpider, Rule
from TestScrapy.items import Article
from scrapy.linkextractors import LinkExtractor
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Object-oriented_programming"]
rules = [Rule(LinkExtractor(allow=('(/wiki/)((?!:).)*$'), ),callback="parse_item", follow=True)]
def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print "Title is:" + title
item['title'] = title
return item
Items.py
from scrapy import Item, Field
class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
However, when I run this spider, it just display one result and the terminates. I expect it to run until I terminate it.
Please see the result and debug info from Scrapy:
2016-06-06 15:45:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: TestScrapy)
2016-06-06 15:45:28 [scrapy] INFO: Optional features available: ssl, http11
2016-06-06 15:45:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TestScrapy.spiders', 'SPIDER_MODULES': ['TestScrapy.spiders'], 'BOT_NAME': 'TestScrapy'}
2016-06-06 15:45:29 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-06 15:45:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-06 15:45:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-06 15:45:30 [scrapy] INFO: Enabled item pipelines:
2016-06-06 15:45:30 [scrapy] INFO: Spider opened
2016-06-06 15:45:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-06 15:45:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-06 15:45:33 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Object-oriented_programming> (referer: None)
Title is:Object-oriented programming
2016-06-06 15:45:33 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Object-oriented_programming>
{'title': u'Object-oriented programming'}
2016-06-06 15:45:33 [scrapy] INFO: Closing spider (finished)
2016-06-06 15:45:33 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 246,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 51238,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 6, 7, 45, 33, 441000),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 6, 6, 7, 45, 30, 614000)}
2016-06-06 15:45:33 [scrapy] INFO: Spider closed (finished)
Change the method name from parse to parse_item.
Now you're just crawling the start url but when filtering the rules there is no method to callback, thus the spider ends the execution.
Check this example of CrawlSpider:
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example
You can also use start_product_requests instead of parse here.
Related
I am trying to scrap a site using Scrapy and Selenium.
I can get the web-browser to open using selenium but i am unable to get the start url into the web-browser. At present, the web-browser opens, does nothing and then closes whilst i get the error "<405 https://etc etc>: HTTP status code is not handled or not allowed".
Which, as far as i understand, confirms that i am not being able to pass the url to the web-browser.
What am i doing wrong here?
import scrapy
import time
from selenium import webdriver
from glassdoor.items import GlassdoorItem
class glassdoorSpider(scrapy.Spider):
name = "glassdoor"
allowed_domains = ["glassdoor.co.uk"]
start_urls = ["https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm",
]
def __init__(self):
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response):
self.driver.get(response.url)
time.sleep(5)
for sel in response.xpath('//*[#id="EmpStats"]'):
item = GlassdoorItem()
item['rating'] = sel.xpath('//*[#class="notranslate ratingNum"]/text()').extract()
# item['recommend'] = sel.xpath('//*[#class="address"]/text()').extract()
# item['approval'] = sel.xpath('//*[#class="address"]/text()').extract()
yield item
# self.driver.close()
the logs I get from the above are:
2017-01-26 21:49:02 [scrapy] INFO: Scrapy 1.0.5 started (bot: glassdoor)
2017-01-26 21:49:02 [scrapy] INFO: Optional features available: ssl, http11
2017-01-26 21:49:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'glassdoor.spiders', 'SPIDER_MODULES': ['glassdoor.spiders'], 'BOT_NAME': 'glassdoor'}
2017-01-26 21:49:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-01-26 21:49:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:58378/session {"requiredCapabilities": {}, "desiredCapabilities": {"platform": "ANY", "browserName": "chrome", "version": "", "chromeOptions": {"args": [], "extensions": []}, "javascriptEnabled": true}}
2017-01-26 21:49:06 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-01-26 21:49:06 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-01-26 21:49:06 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-01-26 21:49:06 [scrapy] INFO: Enabled item pipelines:
2017-01-26 21:49:06 [scrapy] INFO: Spider opened
2017-01-26 21:49:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-26 21:49:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-26 21:49:07 [scrapy] DEBUG: Crawled (405) <GET https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm> (referer: None)
2017-01-26 21:49:07 [scrapy] DEBUG: Ignoring response <405 https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm>: HTTP status code is not handled or not allowed
2017-01-26 21:49:07 [scrapy] INFO: Closing spider (finished)
2017-01-26 21:49:07 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 269,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7412,
'downloader/response_count': 1,
'downloader/response_status_count/405': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 26, 21, 49, 7, 388000),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 1, 26, 21, 49, 6, 572000)}
2017-01-26 21:49:07 [scrapy] INFO: Spider closed (finished)
ok, as suggest by both the replies, i was not passing the correct response to selenium.
Hence, by adding the line:
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
and therefore changing one line of the code as well:
for sel in response1.xpath('//*[#id="EmpStats"]'):
the new code is (which works):
import scrapy
import time
from selenium import webdriver
from glassdoor.items import GlassdoorItem
class glassdoorSpider(scrapy.Spider):
header = {"User-Agent":"Mozilla/5.0 Gecko/20100101 Firefox/33.0"}
name = "glassdoor"
allowed_domains = ["glassdoor.co.uk"]
start_urls = ["https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm",
]
def __init__(self):
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response):
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
time.sleep(5)
for sel in response1.xpath('//*[#id="EmpStats"]'):
item = GlassdoorItem()
item['rating'] = sel.xpath('//*[#class="notranslate ratingNum"]/text()').extract()
# item['recommend'] = sel.xpath('//*[#class="address"]/text()').extract()
# item['approval'] = sel.xpath('//*[#class="address"]/text()').extract()
yield item
# self.driver.close()
I'm using scrapy to scrap this site but when I run the spider I don't see any response.
I tried reddit.com and quora.com and they both returned data (started to crawl) but not the site I want.
Here is my simple spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class FirstSpider(CrawlSpider):
name = "jobs"
allowed_domains = ["bayt.com"]
start_urls = (
'http://www.bayt.com/',
)
rules = [
Rule(
LinkExtractor(allow=['.*']),
)
]
I tried several combinations of urls in the start_urls but nothing seemed to work.
Here is the log after running the spider:
2015-12-13 20:31:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: bayt)
2015-12-13 20:31:45 [scrapy] INFO: Optional features available: ssl, http11
2015-12-13 20:31:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bayt.spiders', 'SPIDER_MODULES': ['bayt.spiders'], 'BOT_NAME': 'bayt'}
2015-12-13 20:31:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-13 20:31:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-13 20:31:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-13 20:31:45 [scrapy] INFO: Enabled item pipelines:
2015-12-13 20:31:45 [scrapy] INFO: Spider opened
2015-12-13 20:31:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-13 20:31:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-13 20:31:45 [scrapy] DEBUG: Redirecting (302) to <GET http://www.bayt.com/en/jordan/> from <GET http://www.bayt.com/>
2015-12-13 20:31:46 [scrapy] DEBUG: Crawled (200) <GET http://www.bayt.com/en/jordan/> (referer: None)
2015-12-13 20:31:46 [scrapy] INFO: Closing spider (finished)
2015-12-13 20:31:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 881,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2320,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 13, 18, 31, 46, 212468),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 13, 18, 31, 45, 138408)}
2015-12-13 20:31:46 [scrapy] INFO: Spider closed (finished)
the problem is that you are not using the rules as you mentioned, you have your own parse method, which is not ok, CrawlSpider uses the parse method so you shouldn't override that method.
Now, if you are still getting items when overriding the parse method, it is because parse is the default method for the start_urls requests, so the requests are not really following the rules, but only crawling urls inside start_urls
Just change the name of your parsing method from parse to a different one, and specify that on your rule as a callback.
I did Curl www.bayt.com in the command line and it seems that they redirect the request to http://www.bayt.com/en/jordan/
I put that as My start_urls and it worked and changed the user agent in the settings.py to localhost and it worked.
I'm trying to scrape data from twitter . But I have some problems doing that. I think myrspider can't login twitter but i'm not sure .
Here is my EXACT code :
Class Clause(CrawlSpider):
name="Clause"
allowed_domains=['twitter.com']
login_url=['http://twitter.com/login']
dont_filter=True
Rules=(
Rule(SgmlLinkExtractor(allow= ('twittre.com.+')),callback='Myparse',follow=True),
)
def start_requests(self):
print "\n\n\n start_requests\n\n\n"
yield Request(url=self.login_url,
callback=self.login,
dont_filter=True
)
def login(self,response):
print "\n\n\n login is running \n\n\n"
return FormRequest.from_response(response,
formdata={'session[username_or_email]':'s.shahryar75#gmail.com','session[password]':'********'},
callback=self.check_login)
def check_login(self,response):
print "\n\n\n check login is running\n\n\n"
if "SlgShahryar" in response.body:
print "\n\n\n ************successfully logged in************\n\n\n "
return Request(url='http://twitter.com/SlgShahryar',callback='Myparse',dont_filter=True)
else:
print "\n\n\n __________authentication failed :(((( ___________ \n\n\n"
return
def Myparse(self,response):
hxs=HtmlXPathSelector(response)
print "***************My parse is running!*********************"
tweets=hxs.select('//li')
items=list()
for tweet in tweets:
item=ClauseItem()
item['Text']=tweets.select('//p/text()').extract()
item['writter']=tweets.select('#data-name')
my programme runs the start_requests() and then runs the login() BUT then doesn't run the check_login and quits . here is the output I got :
C:\Users\Shahryar\Desktop\FootBallFanFinder\crawling\Clause>scrapy crawl
Clause -o scraped_data4.csv -t csv
2015-03-20 11:10:55+0330 [scrapy] INFO: Scrapy 0.24.5 started (bot: Clause)
2015-03-20 11:10:55+0330 [scrapy] INFO: Optional features available: ssl, http11
2015-03-20 11:10:55+0330 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'Clause.spiders', 'FEED_URI': 'scraped_data4.csv', 'DEPTH_LIMIT': 50, 'SPIDER_
MODULES': ['Clause.spiders'], 'BOT_NAME': 'Clause', 'FEED_FORMAT': 'csv', 'DOWNL
OAD_DELAY': 0.8}
2015-03-20 11:11:05+0330 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-03-20 11:11:45+0330 [scrapy] INFO: Enabled item pipelines:
2015-03-20 11:11:45+0330 [Clause] INFO: Spider opened
2015-03-20 11:11:45+0330 [Clause] INFO: Crawled 0 pages (at 0 pages/min), scrape
d 0 items (at 0 items/min)
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
start_requests
2015-03-20 11:11:46+0330 [Clause] DEBUG: Redirecting (301) to <GET https://www.t
witter.com/login> from <GET http://www.twitter.com/login>
2015-03-20 11:11:47+0330 [Clause] DEBUG: Redirecting (301) to <GET https://twitt
er.com/login> from <GET https://www.twitter.com/login>
2015-03-20 11:11:49+0330 [Clause] DEBUG: Crawled (200) <GET https://twitter.com/
login> (referer: None)
login is running
2015-03-20 11:11:50+0330 [Clause] DEBUG: Crawled (404) <POST https://twitter.com
/sessions/change_locale> (referer: https://twitter.com/login)
2015-03-20 11:11:50+0330 [Clause] DEBUG: Ignoring response <404 https://twitter.
com/sessions/change_locale>: HTTP status code is not handled or not allowed
2015-03-20 11:11:50+0330 [Clause] INFO: Closing spider (finished)
2015-03-20 11:11:50+0330 [Clause] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1572,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 15533,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 20, 7, 41, 50, 205000),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 3, 20, 7, 41, 45, 97000)}
2015-03-20 11:11:50+0330 [Clause] INFO: Spider closed (finished)
I'm not sure about the part in the login function where I have written : session[username_or_email] and sessions[password] Do you know what should i write there ? is it correct ?(I have written the name attribute of the those fields in the login page due to examples I have seen)
could you please help me ?
with lots of thanks in advance .
I'm having a problem getting my Scrapy spider to run its callback method.
I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract
class CrawlerSpider(CrawlSpider):
name = "crawler"
def __init__(self, initial_url):
log.msg('initing...', level=log.WARNING)
CrawlSpider.__init__(self)
if not initial_url.startswith('http'):
initial_url = 'http://' + initial_url
ext = tldextract.extract(initial_url)
initial_domain = ext.domain + '.' + ext.tld
initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
self.start_urls = [initial_url]
self.rules = [
Rule(SgmlLinkExtractor(), callback='parse_item'),
Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
]
self._compile_rules()
def parse_item(self, response):
log.msg('parse_item...', level=log.WARNING)
hxs = HtmlXPathSelector(response)
links = hxs.select("//a/#href").extract()
for link in links:
log.msg('link', level=log.WARNING)
Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.
$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines:
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 363,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)
Thanks in advance!
The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:
Test
which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:
['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']
this Request gets filtered by the OffsiteMiddleware:
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
therefore parse_item will not be called for this url.
Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)
I'm trying to parse site with Scrapy. The urls I need to parse formed like this http://example.com/productID/1234/. This links can be found on pages with address like: http://example.com/categoryID/1234/. The thing is that my crawler fetches first categoryID page (http://www.example.com/categoryID/79/, as you can see from trace below), but nothing more. What am I doing wrong? Thank you.
Here is my Scrapy code:
# -*- coding: UTF-8 -*-
#THIRD-PARTY MODULES
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class ExampleComSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["http://www.example.com/"]
start_urls = [
"http://www.example.com/"
]
rules = (
# Extract links matching 'categoryID/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/categoryID/(\d*)/', ), )),
# Extract links matching 'productID/xxx' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('/productID/(\d*)/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
Here is a trace of Scrapy:
2012-01-31 12:38:56+0000 [scrapy] INFO: Scrapy 0.14.1 started (bot: parsers)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled item pipelines:
2012-01-31 12:38:57+0000 [example.com] INFO: Spider opened
2012-01-31 12:38:57+0000 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-01-31 12:38:58+0000 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2012-01-31 12:38:58+0000 [example.com] DEBUG: Filtered offsite request to 'www.example.com': <GET http://www.example.com/categoryID/79/>
2012-01-31 12:38:58+0000 [example.com] INFO: Closing spider (finished)
2012-01-31 12:38:58+0000 [example.com] INFO: Dumping spider stats:
{'downloader/request_bytes': 199,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 121288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 1, 31, 12, 38, 58, 409806),
'request_depth_max': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 1, 31, 12, 38, 57, 127805)}
2012-01-31 12:38:58+0000 [example.com] INFO: Spider closed (finished)
2012-01-31 12:38:58+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 26992640, 'memusage/startup': 26992640}
It can be a difference between "www.example.com" and "example.com". If it helps, you can use them both this way
allowed_domains = ["www.example.com", "example.com"]
Replace:
allowed_domains = ["http://www.example.com/"]
with:
allowed_domains = ["example.com"]
That should do the trick.