I have an html file demo1.html with code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
“What is this obsession people have with books? They put them in their houses—like they’re trophies.
What do you need it for after you read it?” – Jerry
</body>
</html>
as you can see, In demo1.html file I have added link to another html file named demo2.html ()
demo2.html code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
“Tuesday has no feel. Monday has a feel, Friday has a feel, Sunday has a feel…” – Newman
</body>
</html>
I have written a spider which would scrape the plaintext from the html files and store it in text file namely basename.txt, with respect to each url.
My spider code:
from os.path import splitext
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
from os.path import basename
import lxml
FOLLOW = True
class CustomLinkExtractor(LinkExtractor):
def __init__(self, *args, **kwargs):
super(CustomLinkExtractor, self).__init__(*args, **kwargs)
self.deny_extensions = [".zip", ".mp4", ".mp3"] # ignore files with mentioned extensions
def get_plain_html(response_body):
root = lxml.html.fromstring(response_body)
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head", "style")
text = lxml.html.tostring(root, method="text", encoding='utf-8')
return text
def get_file_name(url):
parsed_url = urlparse(url)
file_name = basename(parsed_url.path.strip('/')) if parsed_url.path.strip('/') else parsed_url.netloc
return file_name
class WebScraper(CrawlSpider):
name = "goblin"
start_urls = [
'file:///path/to/demo1.html'
]
def __init__(self, *args, **kwargs):
self.rules = (Rule(CustomLinkExtractor(), follow=FOLLOW, callback="parse_file"),)
super(WebScraper, self).__init__(*args, **kwargs)
def parse_file(self, response):
try:
file_name = get_file_name(response.url)
if hasattr(response, "text"):
file_name = '{0}.txt'.format(file_name)
text = get_plain_html(response.body)
file_path = './{0}'.format(file_name)
with open(file_path, 'wb') as f_data:
f_data.write(text)
except Exception as ex:
self.logger.error(ex, exc_info=True)
When I run my spider I can see demo2.html being scraped and the text:
“Tuesday has no feel. Monday has a feel, Friday has a feel, Sunday has a feel…” – Newman
is stored in demo2.html.txt, but my spider does not return any response fordemo1.html which is a part of the url in start_urls list.
I am expecting a file demo1.html.txt to be created with text:
“What is this obsession people have with books? They put them in their houses—like they’re trophies.
What do you need it for after you read it?” – Jerry
Note: I have set DEPTH_LIMIT = 1 in settings.py
Scrapy Logs:
2020-06-17 20:33:27 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_project)
2020-06-17 20:33:27 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 7 2019, 10:50:52) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-...-Ubuntu-...
2020-06-17 20:33:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-06-17 20:33:27 [scrapy.crawler] INFO: Overridden settings:
{'AJAXCRAWL_ENABLED': True,
'AUTOTHROTTLE_ENABLED': True,
'BOT_NAME': 'scrapy_project',
'CONCURRENT_REQUESTS': 30,
'COOKIES_ENABLED': False,
'DEPTH_LIMIT': 1,
'DOWNLOAD_MAXSIZE': 5242880,
'NEWSPIDER_MODULE': 'scrapy_project.spiders',
'REACTOR_THREADPOOL_MAXSIZE': 20,
'SPIDER_MODULES': ['scrapy_project.spiders']}
2020-06-17 20:33:27 [scrapy.extensions.telnet] INFO: Telnet Password: *******
2020-06-17 20:33:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-06-17 20:33:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy_project.middlewares.FilterResponses',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-17 20:33:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-17 20:33:27 [scrapy.core.engine] INFO: Spider opened
2020-06-17 20:33:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-17 20:33:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-17 20:33:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///path/to/demo1.html> (referer: None)
2020-06-17 20:33:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///path/to/demo2.html> (referer: None)
2020-06-17 20:33:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-17 20:33:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 556,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 646,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 6.091841,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 17, 15, 3, 33, 427522),
'log_count/DEBUG': 2,
'log_count/INFO': 14,
'memusage/max': 1757986816,
'memusage/startup': 1757986816,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 6, 17, 15, 3, 27, 335681)}
2020-06-17 20:33:33 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
any help would be appreciated :)
I changed my callback to parse_start_url and override it.
Refer answer: https://stackoverflow.com/a/15839428/10011503
Complete code with expected changes:
from os.path import splitext
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
from os.path import basename
import lxml
FOLLOW = True
class CustomLinkExtractor(LinkExtractor):
def __init__(self, *args, **kwargs):
super(CustomLinkExtractor, self).__init__(*args, **kwargs)
self.deny_extensions = [".zip", ".mp4", ".mp3"] # ignore files with mentioned extensions
def get_plain_html(response_body):
root = lxml.html.fromstring(response_body)
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head", "style")
text = lxml.html.tostring(root, method="text", encoding='utf-8')
return text
def get_file_name(url):
parsed_url = urlparse(url)
file_name = basename(parsed_url.path.strip('/')) if parsed_url.path.strip('/') else parsed_url.netloc
return file_name
class WebScraper(CrawlSpider):
name = "goblin"
start_urls = [
'file:///path/to/demo1.html'
]
def __init__(self, *args, **kwargs):
self.rules = (Rule(CustomLinkExtractor(), follow=FOLLOW, callback="parse_file"),)
super(WebScraper, self).__init__(*args, **kwargs)
def parse_start_url(self, response):
return self.parse_file(response)
def parse_file(self, response):
try:
file_name = get_file_name(response.url)
if hasattr(response, "text"):
file_name = '{0}.txt'.format(file_name)
text = get_plain_html(response.body)
file_path = './{0}'.format(file_name)
with open(file_path, 'wb') as f_data:
f_data.write(text)
except Exception as ex:
self.logger.error(ex, exc_info=True)
Related
I have the following custom pipeline for downloading JSON files. It was functioning fine until I need to add the __init__ function, in which I subclass the FilesPipeline class in order to add a few new properties. The pipeline takes URLs that are to API endpoints and downloads their responses. The folders are properly created when running the spider via scrapy crawl myspider and the two print statements in the file_path function show the correct values (filename and filepath). However, the files are never actually downloaded.
I did find a few similar questions about custom file pipelines and files not downloading (here (the solution was they needed to yield the items instead of returning them) and here (the solution was needing to adjust the ROBOTSTXT_OBEY setting) for example), but the solutions did not work for me.
What am I doing wrong (or forgetting to do when subclassing the FilesPipeline)? I've been racking my brain over this issue for a good 3 hours and my google-fu has not yielded any resolutions for my case.
class LocalJsonFilesPipeline(FilesPipeline):
FILES_STORE = "json_src"
FILES_URLS_FIELD = "json_url"
FILES_RESULT_FIELD = "local_json"
def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
# super(LocalJsonFilesPipeline, self).__init__(store_uri)
self.store_uri = store_uri
self.use_response_url = use_response_url
if filename_regex:
self.filename_regex = re.compile(filename_regex)
else:
self.filename_regex = filename_regex
super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
#classmethod
def from_crawler(cls, crawler):
if not crawler.spider:
return BasePipeline()
store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
settings = crawler.spider.settings
use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
filename_regex = settings.get('JSON_FILENAME_REGEX')
return cls(store_uri, use_response_url, filename_regex, settings)
def parse_path(self, value):
if self.filename_regex:
try:
return self.filename_regex.findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
# example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
link_path = os.path.splitext(urlparse(value).path)[0] # omit extension if there is one
link_params = link_path.rsplit('/', 1)[1] # preserve the last portion separated by forward-slash (A-77775602)
return link_params if '=' not in link_params else link_params.split('=', 1)[1]
def get_media_requests(self, item, info):
json_url = item.get(self.FILES_URLS_FIELD)
if json_url:
filename_url = json_url if not self.use_response_url else item.get('url', '')
return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
def file_path(self, request, response=None, info=None):
final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
print('url', request.url)
print('downloading to', final_path)
return final_path
And the custom settings of my spider
class MockSpider(scrapy.Spider):
name = 'mock'
custom_settings = {
'ITEM_PIPELINES': {
'mock.pipelines.LocalJsonFilesPipeline': 200
},
'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
}
Log with the level set to debug
C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'pipeline_test.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
'local_json': [],
'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.468001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 14,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)
I finally figured out the issue, which was the fact that the FilesPipeline class does not have a from_crawler method, but instead requires a from_settings method when wanting to pass added parameters to a subclassed/custom FilesPipeline. Below is my working version of the custom FilesPipeline
from scrapy import Request
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
import re
class LocalFilesPipeline(FilesPipeline):
FILES_STORE = "data_src"
FILES_URLS_FIELD = "data_url"
FILES_RESULT_FIELD = "local_file"
def __init__(self, settings=None):
"""
Attributes:
use_response_url indicates we want to grab the filename from the response url instead of json_url
filename_regex regexes to use for grabbing filenames out of urls
filename_suffixes suffixes to append to filenames when there are multiple files to download per item
filename_extension the file extension to append to each filename in the file_path function
"""
self.use_response_url = settings.get('FILENAME_USE_RESPONSE_URL', False)
self.filename_regex = settings.get('FILENAME_REGEX', [])
self.filename_suffixes = settings.get('FILENAME_SUFFIXES', [])
self.filename_extension = settings.get('FILENAME_EXTENSION', 'json')
if isinstance(self.filename_regex, str):
self.filename_regex = [self.filename_regex]
if isinstance(self.filename_suffixes, str):
self.filename_suffixes = [self.filename_suffixes]
if self.filename_regex and self.filename_suffixes and len(self.filename_regex) != len(self.filename_suffixes):
raise ValueError('FILENAME_REGEX and FILENAME_SUFFIXES settings must contain the same number of elements')
if self.filename_regex:
for i, f_regex in enumerate(self.filename_regex):
self.filename_regex[i] = re.compile(f_regex)
super(LocalFilesPipeline, self).__init__(self.FILES_STORE, settings=settings)
#classmethod
def from_settings(cls, settings):
return cls(settings=settings)
def parse_path(self, value, index):
if self.filename_regex:
try:
return self.filename_regex[index-1].findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
link_path = os.path.splitext(urlparse(value).path)[0]
# preserve the last portion separated by forward-slash
try:
return link_path.rsplit('/', 1)[1]
except IndexError:
return link_path
def get_media_requests(self, item, info):
file_urls = item.get(self.FILES_URLS_FIELD)
requests = []
if file_urls:
total_urls = len(file_urls)
for i, file_url in enumerate(file_urls, 1):
filename_url = file_url if not self.use_response_url else item.get('url', '')
filename = self.parse_path(filename_url, i)
if self.filename_suffixes:
current_suffix = self.filename_suffixes[i-1]
if current_suffix.startswith('/'):
# this will end up creating a separate folder for the different types of files
filename += current_suffix
else:
# this will keep all files in single folder while still making it easy to differentiate each
# type of file. this comes in handy when searching for a file by the base name.
filename += f'_{current_suffix}'
elif total_urls > 1:
# default to numbering files sequentially in the order they were added to the item
filename += f'_file{i}'
requests.append(Request(file_url, meta={'spider': info.spider.name, 'filename': filename}))
return requests
def file_path(self, request, response=None, info=None):
return f'{request.meta["spider"]}/{request.meta["filename"]}.{self.filename_extension}'
Then, to utilize the pipeline you can set the applicable values in a spider's custom_settings property
custom_settings = {
'ITEM_PIPELINES': {
'spins.pipelines.LocalFilesPipeline': 200
},
'FILENAME_REGEX': [r'products\/(.+?)\/ProductInfo\+ProductDetails']
}
I'm new to python and scrapy
After scraping process I tried to save database to mysqlite,
Follow by this src : https://github.com/sunshineatnoon/Scrapy-Amazon-Sqlite( from url)
My problem is database was created successfully but items can't be inserted to database because process_item not called
EDIT
I paste the source code from github link above
setting.py
ITEM_PIPELINES = {
'amazon.pipelines.AmazonPipeline': 300
}
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3
import os
con = None
class AmazonPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def process_item(self, item, spider):
print('---------------process item----')
self.storeInDb(item)
return item
def setupDBCon(self):
self.con = sqlite3.connect(os.getcwd() + '/test.db')
self.cur = self.con.cursor()
def createTables(self):
self.dropAmazonTable()
self.createAmazonTable()
def dropAmazonTable(self):
#drop amazon table if it exists
self.cur.execute("DROP TABLE IF EXISTS Amazon")
def closeDB(self):
self.con.close()
def __del__(self):
self.closeDB()
def createAmazonTable(self):
self.cur.execute("CREATE TABLE IF NOT EXISTS Amazon(id INTEGER PRIMARY KEY NOT NULL, \
name TEXT, \
path TEXT, \
source TEXT \
)")
self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")
print ('------------------------')
self.con.commit()
def storeInDb(self,item):
# self.cur.execute("INSERT INTO Amazon(\
# name, \
# path, \
# source \
# ) \
# VALUES( ?, ?, ?)", \
# ( \
# item.get('Name',''),
# item.get('Path',''),
# item.get('Source','')
# ))
self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")
print ('------------------------')
print ('Data Stored in Database')
print ('------------------------')
self.con.commit()
spiders/amazonspider.py
import scrapy
import urllib
from amazon.items import AmazonItem
import os
class amazonSpider(scrapy.Spider):
imgcount = 1
name = "amazon"
allowed_domains = ["amazon.com"]
'''
start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=backpack",
"http://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Abackpack&page=2&keywords=backpack&ie=UTF8&qid=1442907452&spIA=B00YCRMZXW,B010HWLMMA"
]
'''
def start_requests(self):
yield scrapy.Request("http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0",self.parse)
for i in range(2,3):
yield scrapy.Request("http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page="+str(i)+"&bbn=10445813011&ie=UTF8&qid=1442910987",self.parse)
def parse(self,response):
#namelist = response.xpath('//a[#class="a-link-normal s-access-detail-page a-text-normal"]/#title').extract()
#htmllist = response.xpath('//a[#class="a-link-normal s-access-detail-page a-text-normal"]/#href').extract()
#imglist = response.xpath('//a[#class="a-link-normal a-text-normal"]/img/#src').extract()
namelist = response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/#title').extract()
htmllist = response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/#href').extract()
imglist = response.xpath('//img[#class="s-access-image cfMarker"]/#src').extract()
listlength = len(namelist)
pwd = os.getcwd()+'/'
if not os.path.isdir(pwd+'crawlImages/'):
os.mkdir(pwd+'crawlImages/')
for i in range(0,listlength):
item = AmazonItem()
item['Name'] = namelist[i]
item['Source'] = htmllist[i]
urllib.urlretrieve(imglist[i],pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg")
item['Path'] = pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg"
amazonSpider.imgcount = amazonSpider.imgcount + 1
yield item
Result
after run scrapy crawl amazone
I have test.db created but item haven't inserted (I've checked my sqlite db.test), that mean process_item was not run
build result
2018-09-18 16:38:38 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: amazon)
2018-09-18 16:38:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-18 16:38:38 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'amazon', 'NEWSPIDER_MODULE': 'amazon.spiders', 'SPIDER_MODULES': ['amazon.spiders']}
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
------------------------
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled item pipelines:
['amazon.pipelines.AmazonPipeline']
2018-09-18 16:38:38 [scrapy.core.engine] INFO: Spider opened
2018-09-18 16:38:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-18 16:38:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987> from <GET http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0> from <GET http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> from <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> from <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> (referer: None)
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> (referer: None)
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Closing spider (finished)
2018-09-18 16:38:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1909,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 140740,
'downloader/response_count': 6,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 9, 18, 9, 38, 41, 53948),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'memusage/max': 52600832,
'memusage/startup': 52600832,
'response_received_count': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2018, 9, 18, 9, 38, 38, 677280)}
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Spider closed (finished)
I've searching around but have no luck
Thanks
If I crawl
https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011
I don't get any result in namelist and htmllist. Urllist is filled.
Checking the html-code:
... <a class="a-link-normal s-access-detail-page s-overflow-ellipsis s-color-twister-title-link a-text-normal" ...
I found an additional "s-color-twister-title-link" so your specific xpath is not correct. You can add the s-color-twister-title-link
In [9]: response.xpath('//a[#class="a-link-normal s-access-detail-page s-overflow-ellipsis s-
...: color-twister-title-link a-text-normal"]/#title').extract()
Out[9]:
['Anime Anti-theft Backpack, Luminous School Bag, Waterproof Laptop Backpack with USB Charging Port, Unisex 15.6 Inch College Daypack, Starry',
'Anime Luminous Backpack Noctilucent School Bags Daypack USB chargeing Port Laptop Bag Handbag for Boys Girls Men Women',
or you can use a more specific one like:
response.xpath('//a[contains(#class,"s-access-detail-page")]/#title').extract()
I am trying to write a web crawler using scrapy and PyQuery.The full spider code is as follows.
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class gotspider(CrawlSpider):
name='gotspider'
allowed_domains=['fundrazr.com']
start_urls = ['https://fundrazr.com/find?category=Health']
rules = [
Rule(LinkExtractor(allow=('/find/category=Health')), callback='parse',follow=True)
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print(response.body)
The web page skeleton
<div id="header">
<h2 class="title"> Township </h2>
<p><strong>Client: </strong> Township<br>
<strong>Location: </strong>Pennsylvania<br>
<strong>Size: </strong>54,000 SF</p>
</div>
output of the crawler, The crawler fetches the Requesting URL and its hitting the correct web target but the parse_item or parse method is not getting the response. The Response.URL is not printing. I tried to verify this by running the spider without logs scrapy crawl rsscrach--nolog but nothing is printed as logs. The problem is very granular.
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: rsscrach)
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'rsscrach', 'NEWSPIDER_MODULE': 'rsscrach.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['rsscrach.spiders']}
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-26 18:07:12 [scrapy.core.engine] INFO: Spider opened
2017-11-26 18:07:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-26 18:07:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-11-26 18:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/robots.txt> (referer: None)
2017-11-26 18:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/find?category=Health> (referer: None)
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-26 18:07:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 605,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13510,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 26, 10, 7, 15, 46516),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 52465664,
'memusage/startup': 52465664,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 11, 26, 10, 7, 12, 198182)}
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Spider closed (finished)
How do I get the Client, Location and Size of the attributes ?
I made standalone script with Scrapy which test different methods to get data and it works without problem. Maybe it helps you find your problem.
import scrapy
import pyquery
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://fundrazr.com/find?category=Health']
def parse(self, response):
print('--- css 1 ---')
for title in response.css('h2'):
print('>>>', title)
print('--- css 2 ---')
for title in response.css('h2'):
print('>>>', title.extract()) # without _first())
print('>>>', title.css('a').extract_first())
print('>>>', title.css('a ::text').extract_first())
print('-----')
print('--- css 3 ---')
for title in response.css('h2 a ::text'):
print('>>>', title.extract()) # without _first())
print('--- pyquery 1 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2'):
print('>>>', title, title.text, '<<<') # `title.text` gives "\n"
print('--- pyquery 2 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2').text():
print('>>>', title)
print(p('h2').text())
print('--- pyquery 3 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2 a'):
print('>>>', title, title.text)
# ---------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. However, the spider just crawls the start url and stops at that. I've added the spider code and the log. Anyone has any idea why the spider is not able to crawl the pages.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from realcommercial.items import RealcommercialItem
from scrapy.selector import Selector
from scrapy.http import Request
class RealCommercial(CrawlSpider):
name = "realcommercial"
allowed_domains = ["realcommercial.com.au"]
start_urls = [
"http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
]
rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),
callback='parse_response',
process_links='process_links',
follow=True),
Rule(LinkExtractor( allow = []),
callback='parse_response',
process_links='process_links',
follow=True)]
def parse_response(self, response):
sel = Selector(response)
sites = sel.xpath("//a[#class='details']")
#items = []
for site in sites:
item = RealcommercialItem()
link = site.xpath('#href').extract()
#print link, '\n\n'
item['link'] = link
link = 'http://www.realcommercial.com.au/' + str(link[0])
#print 'link!!!!!!=', link
new_request = Request(link, callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
def process_links(self, links):
print 'inside process links'
for i, w in enumerate(links):
print w.url,'\n\n\n'
w.url = "http://www.realcommercial.com.au/" + w.url
print w.url,'\n\n\n'
links[i] = w
return links
def parse_file_page(self, response):
#item passed from request
#print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
title = sel.xpath('//*[#id="listing_address"]').extract()
#print title
item['title'] = title
return item
Log
2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
o
2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
2015-11-29 15:42:57 [scrapy] INFO: Spider opened
2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
ist-date> (referer: None)
2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)
I got the answer myself. There were two issues:
process_links was "http://www.realcommercial.com.au/" although it was already there. I thought it would give back the relative url.
The regular expression in link extractor was not correct.
I made changes to both of these and it worked.
A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.
my Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[#id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = titles.xpath('//*[#id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests -----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.
This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[#id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[#id="spnDate"]/text()').extract()
yield item