Error with links in scrapy - python

I want to scrape some links to a news site and get the full news. However the links are relative
The news site is http://www.puntal.com.ar/v2/
And the links are as well
<div class="article-title">
Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"
</div>
then the relative link is "/v2/article.php?id=187222"
My spider is as follows (edit)
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from urlparse import urljoin
from scrapy.http.request import Request
try:
from urllib.parse import urljoin # Python3.x
except ImportError:
from urlparse import urljoin # Python2.7
from puntalcomar.items import PuntalcomarItem
class PuntalComArSpider(CrawlSpider):
name = 'puntal.com.ar'
allowed_domains = ['http://www.puntal.com.ar/v2/']
start_urls = ['http://www.puntal.com.ar/v2/']
rules = (
Rule(LinkExtractor(allow=(''),), callback="parse", follow=True),
)
def parse_url(self, response):
hxs = Selector(response)
urls = hxs.xpath('//div[#class="article-title"]/a/#href').extract()
print 'enlace relativo ', urls
for url in urls:
urlfull = urljoin('http://www.puntal.com.ar',url
print 'enlace completo ', urlfull
yield Request(urlfull, callback = self.parse_item)
def parse_item(self, response):
hxs = Selector(response)
dates = hxs.xpath('//span[#class="date"]')
title = hxs.xpath('//div[#class="title"]')
subheader = hxs.xpath('//div[#class="subheader"]')
body = hxs.xpath('//div[#class="body"]/p')
items = []
for date in dates:
item = PuntalcomarItem()
item["date"] = date.xpath('text()').extract()
item["title"] = title.xpath("text()").extract()
item["subheader"] = subheader.xpath('text()').extract()
item["body"] = body.xpath("text()").extract()
items.append(item)
return items
But it does not work
I have Linux Mint with Python 2.7.6
Shell:
$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 50497,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 16,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)
I tried the absolute links and correct. I was not really happening.

That i[1:] is strange and is the key problem. There is no need in slicing:
def parse(self, response):
urls = response.xpath('//div[#class="article-title"]/a/#href').extract()
for url in urls:
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_url)
Note that I've also fixed the XPath expression - it needs to be started with // to look for the div elements at any level in the DOM tree.

Related

process_item pipeline not called

I'm new to python and scrapy
After scraping process I tried to save database to mysqlite,
Follow by this src : https://github.com/sunshineatnoon/Scrapy-Amazon-Sqlite( from url)
My problem is database was created successfully but items can't be inserted to database because process_item not called
EDIT
I paste the source code from github link above
setting.py
ITEM_PIPELINES = {
'amazon.pipelines.AmazonPipeline': 300
}
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3
import os
con = None
class AmazonPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def process_item(self, item, spider):
print('---------------process item----')
self.storeInDb(item)
return item
def setupDBCon(self):
self.con = sqlite3.connect(os.getcwd() + '/test.db')
self.cur = self.con.cursor()
def createTables(self):
self.dropAmazonTable()
self.createAmazonTable()
def dropAmazonTable(self):
#drop amazon table if it exists
self.cur.execute("DROP TABLE IF EXISTS Amazon")
def closeDB(self):
self.con.close()
def __del__(self):
self.closeDB()
def createAmazonTable(self):
self.cur.execute("CREATE TABLE IF NOT EXISTS Amazon(id INTEGER PRIMARY KEY NOT NULL, \
name TEXT, \
path TEXT, \
source TEXT \
)")
self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")
print ('------------------------')
self.con.commit()
def storeInDb(self,item):
# self.cur.execute("INSERT INTO Amazon(\
# name, \
# path, \
# source \
# ) \
# VALUES( ?, ?, ?)", \
# ( \
# item.get('Name',''),
# item.get('Path',''),
# item.get('Source','')
# ))
self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")
print ('------------------------')
print ('Data Stored in Database')
print ('------------------------')
self.con.commit()
spiders/amazonspider.py
import scrapy
import urllib
from amazon.items import AmazonItem
import os
class amazonSpider(scrapy.Spider):
imgcount = 1
name = "amazon"
allowed_domains = ["amazon.com"]
'''
start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=backpack",
"http://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Abackpack&page=2&keywords=backpack&ie=UTF8&qid=1442907452&spIA=B00YCRMZXW,B010HWLMMA"
]
'''
def start_requests(self):
yield scrapy.Request("http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0",self.parse)
for i in range(2,3):
yield scrapy.Request("http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page="+str(i)+"&bbn=10445813011&ie=UTF8&qid=1442910987",self.parse)
def parse(self,response):
#namelist = response.xpath('//a[#class="a-link-normal s-access-detail-page a-text-normal"]/#title').extract()
#htmllist = response.xpath('//a[#class="a-link-normal s-access-detail-page a-text-normal"]/#href').extract()
#imglist = response.xpath('//a[#class="a-link-normal a-text-normal"]/img/#src').extract()
namelist = response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/#title').extract()
htmllist = response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/#href').extract()
imglist = response.xpath('//img[#class="s-access-image cfMarker"]/#src').extract()
listlength = len(namelist)
pwd = os.getcwd()+'/'
if not os.path.isdir(pwd+'crawlImages/'):
os.mkdir(pwd+'crawlImages/')
for i in range(0,listlength):
item = AmazonItem()
item['Name'] = namelist[i]
item['Source'] = htmllist[i]
urllib.urlretrieve(imglist[i],pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg")
item['Path'] = pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg"
amazonSpider.imgcount = amazonSpider.imgcount + 1
yield item
Result
after run scrapy crawl amazone
I have test.db created but item haven't inserted (I've checked my sqlite db.test), that mean process_item was not run
build result
2018-09-18 16:38:38 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: amazon)
2018-09-18 16:38:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-18 16:38:38 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'amazon', 'NEWSPIDER_MODULE': 'amazon.spiders', 'SPIDER_MODULES': ['amazon.spiders']}
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
------------------------
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled item pipelines:
['amazon.pipelines.AmazonPipeline']
2018-09-18 16:38:38 [scrapy.core.engine] INFO: Spider opened
2018-09-18 16:38:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-18 16:38:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987> from <GET http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0> from <GET http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> from <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> from <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> (referer: None)
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> (referer: None)
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Closing spider (finished)
2018-09-18 16:38:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1909,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 140740,
'downloader/response_count': 6,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 9, 18, 9, 38, 41, 53948),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'memusage/max': 52600832,
'memusage/startup': 52600832,
'response_received_count': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2018, 9, 18, 9, 38, 38, 677280)}
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Spider closed (finished)
I've searching around but have no luck
Thanks
If I crawl
https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011
I don't get any result in namelist and htmllist. Urllist is filled.
Checking the html-code:
... <a class="a-link-normal s-access-detail-page s-overflow-ellipsis s-color-twister-title-link a-text-normal" ...
I found an additional "s-color-twister-title-link" so your specific xpath is not correct. You can add the s-color-twister-title-link
In [9]: response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis s-
...: color-twister-title-link a-text-normal"]/#title').extract()
Out[9]:
['Anime Anti-theft Backpack, Luminous School Bag, Waterproof Laptop Backpack with USB Charging Port, Unisex 15.6 Inch College Daypack, Starry',
'Anime Luminous Backpack Noctilucent School Bags Daypack USB chargeing Port Laptop Bag Handbag for Boys Girls Men Women',
or you can use a more specific one like:
response.xpath('//a[contains(#class,"s-access-detail-page")]/#title').extract()

pyquery response.body retrieve div elements

I am trying to write a web crawler using scrapy and PyQuery.The full spider code is as follows.
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class gotspider(CrawlSpider):
name='gotspider'
allowed_domains=['fundrazr.com']
start_urls = ['https://fundrazr.com/find?category=Health']
rules = [
Rule(LinkExtractor(allow=('/find/category=Health')), callback='parse',follow=True)
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print(response.body)
The web page skeleton
<div id="header">
<h2 class="title"> Township </h2>
<p><strong>Client: </strong> Township<br>
<strong>Location: </strong>Pennsylvania<br>
<strong>Size: </strong>54,000 SF</p>
</div>
output of the crawler, The crawler fetches the Requesting URL and its hitting the correct web target but the parse_item or parse method is not getting the response. The Response.URL is not printing. I tried to verify this by running the spider without logs scrapy crawl rsscrach--nolog but nothing is printed as logs. The problem is very granular.
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: rsscrach)
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'rsscrach', 'NEWSPIDER_MODULE': 'rsscrach.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['rsscrach.spiders']}
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-26 18:07:12 [scrapy.core.engine] INFO: Spider opened
2017-11-26 18:07:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-26 18:07:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-11-26 18:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/robots.txt> (referer: None)
2017-11-26 18:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/find?category=Health> (referer: None)
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-26 18:07:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 605,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13510,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 26, 10, 7, 15, 46516),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 52465664,
'memusage/startup': 52465664,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 11, 26, 10, 7, 12, 198182)}
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Spider closed (finished)
How do I get the Client, Location and Size of the attributes ?
I made standalone script with Scrapy which test different methods to get data and it works without problem. Maybe it helps you find your problem.
import scrapy
import pyquery
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://fundrazr.com/find?category=Health']
def parse(self, response):
print('--- css 1 ---')
for title in response.css('h2'):
print('>>>', title)
print('--- css 2 ---')
for title in response.css('h2'):
print('>>>', title.extract()) # without _first())
print('>>>', title.css('a').extract_first())
print('>>>', title.css('a ::text').extract_first())
print('-----')
print('--- css 3 ---')
for title in response.css('h2 a ::text'):
print('>>>', title.extract()) # without _first())
print('--- pyquery 1 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2'):
print('>>>', title, title.text, '<<<') # `title.text` gives "\n"
print('--- pyquery 2 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2').text():
print('>>>', title)
print(p('h2').text())
print('--- pyquery 3 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2 a'):
print('>>>', title, title.text)
# ---------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

Scrapy CrawlSpider is not following Links

I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. However, the spider just crawls the start url and stops at that. I've added the spider code and the log. Anyone has any idea why the spider is not able to crawl the pages.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from realcommercial.items import RealcommercialItem
from scrapy.selector import Selector
from scrapy.http import Request
class RealCommercial(CrawlSpider):
name = "realcommercial"
allowed_domains = ["realcommercial.com.au"]
start_urls = [
"http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
]
rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),
callback='parse_response',
process_links='process_links',
follow=True),
Rule(LinkExtractor( allow = []),
callback='parse_response',
process_links='process_links',
follow=True)]
def parse_response(self, response):
sel = Selector(response)
sites = sel.xpath("//a[#class='details']")
#items = []
for site in sites:
item = RealcommercialItem()
link = site.xpath('#href').extract()
#print link, '\n\n'
item['link'] = link
link = 'http://www.realcommercial.com.au/' + str(link[0])
#print 'link!!!!!!=', link
new_request = Request(link, callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
def process_links(self, links):
print 'inside process links'
for i, w in enumerate(links):
print w.url,'\n\n\n'
w.url = "http://www.realcommercial.com.au/" + w.url
print w.url,'\n\n\n'
links[i] = w
return links
def parse_file_page(self, response):
#item passed from request
#print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
title = sel.xpath('//*[#id="listing_address"]').extract()
#print title
item['title'] = title
return item
Log
2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
o
2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
2015-11-29 15:42:57 [scrapy] INFO: Spider opened
2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
ist-date> (referer: None)
2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)
I got the answer myself. There were two issues:
process_links was "http://www.realcommercial.com.au/" although it was already there. I thought it would give back the relative url.
The regular expression in link extractor was not correct.
I made changes to both of these and it worked.

Scrapy outputs [ into my .json file

A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.
my Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[#id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = titles.xpath('//*[#id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests -----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.
This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[#id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[#id="spnDate"]/text()').extract()
yield item

Scrapy rule SgmlLinkExtractor not working

how can I get my rule to work in my crawlspider and to follow the links, I added this rule but its not working, nothing gets display but i don't get no error either. I comment out what my domains should look like in the code for my rule.
Rule #1
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
I also tried this rule but did not work nothing happen no errors either: Rule #2
rules = (
Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)
code
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'yescobar2012#gmail.com', 'session_password': 'yescobar01'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an response page! %s' % response.url)
return Request(url='http://www.linkedin.com/csearch/results')
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/#href').extract()
items.append(item)
return items
output
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2171,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 87904,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)
SgmlLinkExtractor uses re to find matches in link URLs.
What you pass in allow= goes through .compile()and then all links in pages is checked with _matches which uses... .search() on the compiled regex
_matches = lambda url, regexs: any((r.search(url) for r in regexs))
See https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py
When I check your regexes in the Python shell, they both works (they return an SRE_Match for URL 1 and 2; I added a failing regex to compare):
>>> import re
>>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
>>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
>>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
>>> regex1.search(url1)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url1)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url1)
>>> regex1.search(url2)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url2)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url2)
>>>
To check if you've got links in the page at all (if everything is not Javascript-generated) I would add a very generic Rule matching every link (set allow=() or do not set allow at all)
See http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
But in the end, you may be better off using LinkedIn API for company search:
http://developer.linkedin.com/documents/company-search

Categories