Scrapy rule SgmlLinkExtractor not working - python

how can I get my rule to work in my crawlspider and to follow the links, I added this rule but its not working, nothing gets display but i don't get no error either. I comment out what my domains should look like in the code for my rule.
Rule #1
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
I also tried this rule but did not work nothing happen no errors either: Rule #2
rules = (
Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)
code
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'yescobar2012#gmail.com', 'session_password': 'yescobar01'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an response page! %s' % response.url)
return Request(url='http://www.linkedin.com/csearch/results')
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/#href').extract()
items.append(item)
return items
output
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2171,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 87904,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)

SgmlLinkExtractor uses re to find matches in link URLs.
What you pass in allow= goes through .compile()and then all links in pages is checked with _matches which uses... .search() on the compiled regex
_matches = lambda url, regexs: any((r.search(url) for r in regexs))
See https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py
When I check your regexes in the Python shell, they both works (they return an SRE_Match for URL 1 and 2; I added a failing regex to compare):
>>> import re
>>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
>>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
>>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
>>> regex1.search(url1)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url1)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url1)
>>> regex1.search(url2)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url2)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url2)
>>>
To check if you've got links in the page at all (if everything is not Javascript-generated) I would add a very generic Rule matching every link (set allow=() or do not set allow at all)
See http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
But in the end, you may be better off using LinkedIn API for company search:
http://developer.linkedin.com/documents/company-search

Related

Error with links in scrapy

I want to scrape some links to a news site and get the full news. However the links are relative
The news site is http://www.puntal.com.ar/v2/
And the links are as well
<div class="article-title">
Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"
</div>
then the relative link is "/v2/article.php?id=187222"
My spider is as follows (edit)
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from urlparse import urljoin
from scrapy.http.request import Request
try:
from urllib.parse import urljoin # Python3.x
except ImportError:
from urlparse import urljoin # Python2.7
from puntalcomar.items import PuntalcomarItem
class PuntalComArSpider(CrawlSpider):
name = 'puntal.com.ar'
allowed_domains = ['http://www.puntal.com.ar/v2/']
start_urls = ['http://www.puntal.com.ar/v2/']
rules = (
Rule(LinkExtractor(allow=(''),), callback="parse", follow=True),
)
def parse_url(self, response):
hxs = Selector(response)
urls = hxs.xpath('//div[#class="article-title"]/a/#href').extract()
print 'enlace relativo ', urls
for url in urls:
urlfull = urljoin('http://www.puntal.com.ar',url
print 'enlace completo ', urlfull
yield Request(urlfull, callback = self.parse_item)
def parse_item(self, response):
hxs = Selector(response)
dates = hxs.xpath('//span[#class="date"]')
title = hxs.xpath('//div[#class="title"]')
subheader = hxs.xpath('//div[#class="subheader"]')
body = hxs.xpath('//div[#class="body"]/p')
items = []
for date in dates:
item = PuntalcomarItem()
item["date"] = date.xpath('text()').extract()
item["title"] = title.xpath("text()").extract()
item["subheader"] = subheader.xpath('text()').extract()
item["body"] = body.xpath("text()").extract()
items.append(item)
return items
But it does not work
I have Linux Mint with Python 2.7.6
Shell:
$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 50497,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 16,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)
I tried the absolute links and correct. I was not really happening.
That i[1:] is strange and is the key problem. There is no need in slicing:
def parse(self, response):
urls = response.xpath('//div[#class="article-title"]/a/#href').extract()
for url in urls:
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_url)
Note that I've also fixed the XPath expression - it needs to be started with // to look for the div elements at any level in the DOM tree.

Scrapy CrawlSpider is not following Links

I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. However, the spider just crawls the start url and stops at that. I've added the spider code and the log. Anyone has any idea why the spider is not able to crawl the pages.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from realcommercial.items import RealcommercialItem
from scrapy.selector import Selector
from scrapy.http import Request
class RealCommercial(CrawlSpider):
name = "realcommercial"
allowed_domains = ["realcommercial.com.au"]
start_urls = [
"http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
]
rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),
callback='parse_response',
process_links='process_links',
follow=True),
Rule(LinkExtractor( allow = []),
callback='parse_response',
process_links='process_links',
follow=True)]
def parse_response(self, response):
sel = Selector(response)
sites = sel.xpath("//a[#class='details']")
#items = []
for site in sites:
item = RealcommercialItem()
link = site.xpath('#href').extract()
#print link, '\n\n'
item['link'] = link
link = 'http://www.realcommercial.com.au/' + str(link[0])
#print 'link!!!!!!=', link
new_request = Request(link, callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
def process_links(self, links):
print 'inside process links'
for i, w in enumerate(links):
print w.url,'\n\n\n'
w.url = "http://www.realcommercial.com.au/" + w.url
print w.url,'\n\n\n'
links[i] = w
return links
def parse_file_page(self, response):
#item passed from request
#print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
title = sel.xpath('//*[#id="listing_address"]').extract()
#print title
item['title'] = title
return item
Log
2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
o
2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
2015-11-29 15:42:57 [scrapy] INFO: Spider opened
2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
ist-date> (referer: None)
2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)
I got the answer myself. There were two issues:
process_links was "http://www.realcommercial.com.au/" although it was already there. I thought it would give back the relative url.
The regular expression in link extractor was not correct.
I made changes to both of these and it worked.

Scrapy outputs [ into my .json file

A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.
my Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[#id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = titles.xpath('//*[#id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests -----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.
This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[#id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[#id="spnDate"]/text()').extract()
yield item

Scraping iTunes Charts using Scrapy

I am doing the following tutorial on using Scrapy to scrape iTunes charts.
http://davidwalsh.name/python-scrape
The tutorial is slightly outdated, in that some of the syntaxes used have been deprecated in the current version of Scrapy (e.g. HtmlXPathSelector, BaseSpider..) - I have been working on completing the tutorial with the current version of Scrapy, but to no success.
If anyone knows what I'm doing incorrectly, would love to understand what I need to work on.
items.py
from scrapy.item import Item, Field
class AppItem(Item):
app_name = Field()
category = Field()
appstore_link = Field()
img_src = Field()
apple_spider.py
import scrapy
from scrapy.selector import Selector
from apple.items import AppItem
class AppleSpider(scrapy.Spider):
name = "apple"
allowed_domains = ["apple.com"]
start_urls = ["http://www.apple.com/itunes/charts/free-apps/"]
def parse(self, response):
apps = response.selector.xpath('//*[#id="main"]/section/ul/li')
count = 0
items = []
for app in apps:
item = AppItem()
item['app_name'] = app.select('//h3/a/text()')[count].extract()
item['appstore_link'] = app.select('//h3/a/#href')[count].extract()
item['category'] = app.select('//h4/a/text()')[count].extract()
item['img_src'] = app.select('//a/img/#src')[count].extract()
items.append(item)
count += 1
return items
This is my console message after running scrapy crawl apple:
2015-02-10 13:38:12-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: apple)
2015-02-10 13:38:12-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2015-02-10 13:38:12-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'apple.spiders', '
SPIDER_MODULES': ['apple.spiders'], 'BOT_NAME': 'apple'}
2015-02-10 13:38:12-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, We
bService, CoreStats, SpiderState
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, Download
TimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddle
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, D
ownloaderStats
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMidd
leware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled item pipelines:
2015-02-10 13:38:13-0500 [apple] INFO: Spider opened
2015-02-10 13:38:13-0500 [apple] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items
/min)
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-10 13:38:13-0500 [apple] DEBUG: Crawled (200) <GET http://www.apple.com/itunes/charts/free-a
pps/> (referer: None)
2015-02-10 13:38:13-0500 [apple] INFO: Closing spider (finished)
2015-02-10 13:38:13-0500 [apple] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 13148,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 271000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 240000)}
2015-02-10 13:38:13-0500 [apple] INFO: Spider closed (finished)
Thanks in advance for any help/advice!
Before reading the technical part: make sure you are not violating the iTunes terms of use.
All of the problems you have are inside the parse() callback:
the main xpath is not correct (there is no ul element directly under the section)
instead of response.selector you can directly use response
the xpath expressions in the loop should be context-specific
The fixed version:
def parse(self, response):
apps = response.xpath('//*[#id="main"]/section//ul/li')
for app in apps:
item = AppItem()
item['app_name'] = app.xpath('.//h3/a/text()').extract()
item['appstore_link'] = app.xpath('.//h3/a/#href').extract()
item['category'] = app.xpath('.//h4/a/text()').extract()
item['img_src'] = app.xpath('.//a/img/#src').extract()
yield item

Mysql pipeline loading but not working in scrapy

I am trying to load a second pipeline to write entries to a mysql db. In de log i see him loading but nothing happens afterwards. Even no logging. This is my pipeline:
# Mysql
import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(host="localhost", user="***", passwd="***", db="***", charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
CurrentDateTime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Md5Hash = hashlib.md5(item['link']).hexdigest()
try:
self.cursor.execute("""INSERT INTO apple (article_add_date, article_date, article_title, article_link, article_link_md5, article_summary, article_image_url, article_source) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""", (CurrentDateTime, item['date'], item['title'], item['link'], Md5Hash, item['summary'], item['image'], item['sourcesite']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
And this is my log:
scrapy crawl macnn_com
2013-06-20 08:15:53+0200 [scrapy] INFO: Scrapy 0.16.4 started (bot: HungryFeed)
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled item pipelines: MySQLStorePipeline, CleanDateField
2013-06-20 08:15:54+0200 [macnn_com] INFO: Spider opened
2013-06-20 08:15:54+0200 [macnn_com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Crawled (200) <GET http://www.macnn.com> (referer: None)
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Crawled (200) <GET http://www.macnn.com/articles/13/06/19/compatibility.described.as.experimental/> (referer: http://www.macnn.com)
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Scraped from <200 http://www.macnn.com/articles/13/06/19/compatibility.described.as.experimental/>
*** lot of scraping data ***
*** lot of scraping data ***
*** lot of scraping data ***
2013-06-20 08:15:56+0200 [macnn_com] INFO: Closing spider (finished)
2013-06-20 08:15:56+0200 [macnn_com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5711,
'downloader/request_count': 17,
'downloader/request_method_count/GET': 17,
'downloader/response_bytes': 281140,
'downloader/response_count': 17,
'downloader/response_status_count/200': 17,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 6, 20, 6, 15, 56, 685286),
'item_scraped_count': 16,
'log_count/DEBUG': 39,
'log_count/INFO': 4,
'request_depth_max': 1,
'response_received_count': 17,
'scheduler/dequeued': 17,
'scheduler/dequeued/memory': 17,
'scheduler/enqueued': 17,
'scheduler/enqueued/memory': 17,
'start_time': datetime.datetime(2013, 6, 20, 6, 15, 54, 755766)}
2013-06-20 08:15:56+0200 [macnn_com] INFO: Spider closed (finished)
Of course i don't have to mention i loaded the pipeline in the settings.py like:
ITEM_PIPELINES = [
'HungryFeed.pipelines.CleanDateField',
'HungryFeed.pipelines.MySQLStorePipeline'
]
Am i missing here somthing?
This is my first pipeline:
class CleanDateField(object):
def process_item(self,item,spider):
from dateutil import parser
rawdate = item['date']
#text replace per spider so parser can recognize better the datetime
if spider.name == "macnn_com":
rawdate = rawdate.replace("updated","").strip()
dt = parser.parse(rawdate)
articledate = dt.strftime("%Y-%m-%d %H:%M:%S")
item['date'] = articledate
return item

Categories