A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.
my Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[#id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = titles.xpath('//*[#id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests -----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.
This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[#id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[#id="spnDate"]/text()').extract()
yield item
Related
I am trying to write a web crawler using scrapy and PyQuery.The full spider code is as follows.
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class gotspider(CrawlSpider):
name='gotspider'
allowed_domains=['fundrazr.com']
start_urls = ['https://fundrazr.com/find?category=Health']
rules = [
Rule(LinkExtractor(allow=('/find/category=Health')), callback='parse',follow=True)
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print(response.body)
The web page skeleton
<div id="header">
<h2 class="title"> Township </h2>
<p><strong>Client: </strong> Township<br>
<strong>Location: </strong>Pennsylvania<br>
<strong>Size: </strong>54,000 SF</p>
</div>
output of the crawler, The crawler fetches the Requesting URL and its hitting the correct web target but the parse_item or parse method is not getting the response. The Response.URL is not printing. I tried to verify this by running the spider without logs scrapy crawl rsscrach--nolog but nothing is printed as logs. The problem is very granular.
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: rsscrach)
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'rsscrach', 'NEWSPIDER_MODULE': 'rsscrach.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['rsscrach.spiders']}
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-26 18:07:12 [scrapy.core.engine] INFO: Spider opened
2017-11-26 18:07:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-26 18:07:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-11-26 18:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/robots.txt> (referer: None)
2017-11-26 18:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/find?category=Health> (referer: None)
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-26 18:07:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 605,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13510,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 26, 10, 7, 15, 46516),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 52465664,
'memusage/startup': 52465664,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 11, 26, 10, 7, 12, 198182)}
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Spider closed (finished)
How do I get the Client, Location and Size of the attributes ?
I made standalone script with Scrapy which test different methods to get data and it works without problem. Maybe it helps you find your problem.
import scrapy
import pyquery
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://fundrazr.com/find?category=Health']
def parse(self, response):
print('--- css 1 ---')
for title in response.css('h2'):
print('>>>', title)
print('--- css 2 ---')
for title in response.css('h2'):
print('>>>', title.extract()) # without _first())
print('>>>', title.css('a').extract_first())
print('>>>', title.css('a ::text').extract_first())
print('-----')
print('--- css 3 ---')
for title in response.css('h2 a ::text'):
print('>>>', title.extract()) # without _first())
print('--- pyquery 1 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2'):
print('>>>', title, title.text, '<<<') # `title.text` gives "\n"
print('--- pyquery 2 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2').text():
print('>>>', title)
print(p('h2').text())
print('--- pyquery 3 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2 a'):
print('>>>', title, title.text)
# ---------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
I have wasted days to get my mind around Scrapy, reading the docs and other Scrapy Blogs and Q&A ... and now I am about to do what men hate most: Ask for directions ;-) The problem is: My spider opens, fetches the start_urls, but apparently does nothing with them. Instead it closes immediately and that was that. Apparently, I do not even get to the first self.log() statement.
What I've got so far is this:
# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *
class KiSpider(CrawlSpider):
name = "KiSpider"
allowed_domains = ['www.kiweb.de', 'kiweb.de']
start_urls = (
# ST Regra start page:
'https://www.kiweb.de/default.aspx?pageid=206',
# follow ST Regra links in the form of:
# https://www.kiweb.de/default.aspx?pageid=206&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
# ST Thermo start page:
'https://www.kiweb.de/default.aspx?pageid=202&page=1',
# follow ST Thermo links in the form of:
# https://www.kiweb.de/default.aspx?pageid=202&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
)
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
# Once an article page is reached, check whether a login is necessary:
def parse_init(self, response):
self.log('Parsing article: %s' % response.url)
if not response.xpath('input[#value="Logout"]'):
# Note: response.xpath() is a shortcut of response.selector.xpath()
self.log('Not logged in. Logging in...\n')
return self.login(response)
else:
self.log('Already logged in. Continue crawling...\n')
return self.parse_item(response)
def login(self, response):
self.log("Trying to log in...\n")
self.username = self.settings['KI_USERNAME']
self.password = self.settings['KI_PASSWORD']
return FormRequest.from_response(
response,
formname='Form1',
formdata={
# needs name, not id attributes!
'ctl04$Header$ctl01$textbox_username': self.username,
'ctl04$Header$ctl01$textbox_password': self.password,
'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
'ctl04$Header$ctl01$checkbox_permanent': 'True',
},
callback = self.parse_item,
)
def parse_item(self, response):
articles = response.xpath('//div[#id="artikel"]')
items = []
for article in articles:
item = KiSpiderItem()
item['link'] = response.url
item['title'] = articles.xpath("div[#class='ct1']/text()").extract()
item['subtitle'] = articles.xpath("div[#class='ct2']/text()").extract()
item['article'] = articles.extract()
item['published'] = articles.xpath("div[#class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
item['artid'] = articles.xpath("div[#class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
item['lang'] = 'de-DE'
items.append(item)
# return(items)
yield items
# what is the difference between return and yield?? found both on web.
When doing scrapy crawl KiSpider, this results in:
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider)
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (info#defrent.de)', 'DOWNLOAD_DELAY': 0.25}
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 465,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 48998,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)}
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)
Is it that the login routine should not end with a callback, but some kind of return/yield statement? Or what am I doing wrong? Unfortunately, the docs and tutorials I have seen so far only give me a vague idea of how every bit connects to the others, especially Scrapy's docs seem to be written as a reference for people who already know a lot about Scrapy.
Somewhat frustrated greetings
Christopher
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
you do not need allow parameter, because there is only one link in the tag selected by XPath.
I do not understand the regex in allow parameter but at least you should escape the ?.
I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. However, the spider just crawls the start url and stops at that. I've added the spider code and the log. Anyone has any idea why the spider is not able to crawl the pages.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from realcommercial.items import RealcommercialItem
from scrapy.selector import Selector
from scrapy.http import Request
class RealCommercial(CrawlSpider):
name = "realcommercial"
allowed_domains = ["realcommercial.com.au"]
start_urls = [
"http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
]
rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),
callback='parse_response',
process_links='process_links',
follow=True),
Rule(LinkExtractor( allow = []),
callback='parse_response',
process_links='process_links',
follow=True)]
def parse_response(self, response):
sel = Selector(response)
sites = sel.xpath("//a[#class='details']")
#items = []
for site in sites:
item = RealcommercialItem()
link = site.xpath('#href').extract()
#print link, '\n\n'
item['link'] = link
link = 'http://www.realcommercial.com.au/' + str(link[0])
#print 'link!!!!!!=', link
new_request = Request(link, callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
def process_links(self, links):
print 'inside process links'
for i, w in enumerate(links):
print w.url,'\n\n\n'
w.url = "http://www.realcommercial.com.au/" + w.url
print w.url,'\n\n\n'
links[i] = w
return links
def parse_file_page(self, response):
#item passed from request
#print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
title = sel.xpath('//*[#id="listing_address"]').extract()
#print title
item['title'] = title
return item
Log
2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
o
2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
2015-11-29 15:42:57 [scrapy] INFO: Spider opened
2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
ist-date> (referer: None)
2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)
I got the answer myself. There were two issues:
process_links was "http://www.realcommercial.com.au/" although it was already there. I thought it would give back the relative url.
The regular expression in link extractor was not correct.
I made changes to both of these and it worked.
I am doing the following tutorial on using Scrapy to scrape iTunes charts.
http://davidwalsh.name/python-scrape
The tutorial is slightly outdated, in that some of the syntaxes used have been deprecated in the current version of Scrapy (e.g. HtmlXPathSelector, BaseSpider..) - I have been working on completing the tutorial with the current version of Scrapy, but to no success.
If anyone knows what I'm doing incorrectly, would love to understand what I need to work on.
items.py
from scrapy.item import Item, Field
class AppItem(Item):
app_name = Field()
category = Field()
appstore_link = Field()
img_src = Field()
apple_spider.py
import scrapy
from scrapy.selector import Selector
from apple.items import AppItem
class AppleSpider(scrapy.Spider):
name = "apple"
allowed_domains = ["apple.com"]
start_urls = ["http://www.apple.com/itunes/charts/free-apps/"]
def parse(self, response):
apps = response.selector.xpath('//*[#id="main"]/section/ul/li')
count = 0
items = []
for app in apps:
item = AppItem()
item['app_name'] = app.select('//h3/a/text()')[count].extract()
item['appstore_link'] = app.select('//h3/a/#href')[count].extract()
item['category'] = app.select('//h4/a/text()')[count].extract()
item['img_src'] = app.select('//a/img/#src')[count].extract()
items.append(item)
count += 1
return items
This is my console message after running scrapy crawl apple:
2015-02-10 13:38:12-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: apple)
2015-02-10 13:38:12-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2015-02-10 13:38:12-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'apple.spiders', '
SPIDER_MODULES': ['apple.spiders'], 'BOT_NAME': 'apple'}
2015-02-10 13:38:12-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, We
bService, CoreStats, SpiderState
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, Download
TimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddle
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, D
ownloaderStats
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMidd
leware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled item pipelines:
2015-02-10 13:38:13-0500 [apple] INFO: Spider opened
2015-02-10 13:38:13-0500 [apple] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items
/min)
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-10 13:38:13-0500 [apple] DEBUG: Crawled (200) <GET http://www.apple.com/itunes/charts/free-a
pps/> (referer: None)
2015-02-10 13:38:13-0500 [apple] INFO: Closing spider (finished)
2015-02-10 13:38:13-0500 [apple] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 13148,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 271000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 240000)}
2015-02-10 13:38:13-0500 [apple] INFO: Spider closed (finished)
Thanks in advance for any help/advice!
Before reading the technical part: make sure you are not violating the iTunes terms of use.
All of the problems you have are inside the parse() callback:
the main xpath is not correct (there is no ul element directly under the section)
instead of response.selector you can directly use response
the xpath expressions in the loop should be context-specific
The fixed version:
def parse(self, response):
apps = response.xpath('//*[#id="main"]/section//ul/li')
for app in apps:
item = AppItem()
item['app_name'] = app.xpath('.//h3/a/text()').extract()
item['appstore_link'] = app.xpath('.//h3/a/#href').extract()
item['category'] = app.xpath('.//h4/a/text()').extract()
item['img_src'] = app.xpath('.//a/img/#src').extract()
yield item
how can I get my rule to work in my crawlspider and to follow the links, I added this rule but its not working, nothing gets display but i don't get no error either. I comment out what my domains should look like in the code for my rule.
Rule #1
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
I also tried this rule but did not work nothing happen no errors either: Rule #2
rules = (
Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)
code
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'yescobar2012#gmail.com', 'session_password': 'yescobar01'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an response page! %s' % response.url)
return Request(url='http://www.linkedin.com/csearch/results')
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/#href').extract()
items.append(item)
return items
output
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2171,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 87904,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)
SgmlLinkExtractor uses re to find matches in link URLs.
What you pass in allow= goes through .compile()and then all links in pages is checked with _matches which uses... .search() on the compiled regex
_matches = lambda url, regexs: any((r.search(url) for r in regexs))
See https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py
When I check your regexes in the Python shell, they both works (they return an SRE_Match for URL 1 and 2; I added a failing regex to compare):
>>> import re
>>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
>>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
>>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
>>> regex1.search(url1)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url1)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url1)
>>> regex1.search(url2)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url2)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url2)
>>>
To check if you've got links in the page at all (if everything is not Javascript-generated) I would add a very generic Rule matching every link (set allow=() or do not set allow at all)
See http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
But in the end, you may be better off using LinkedIn API for company search:
http://developer.linkedin.com/documents/company-search