I've been trying to follow the Scrapy tutorial but I stuck and have no idea where is mistake.
It is working but no items are crawled.
I get the following output:
C:\Users\xxx\allegro>scrapy crawl AllegroPrices
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: AllegroPrices)
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'allegro.spiders', 'SPIDER_MODULES': ['allegro.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'AllegroPrices'}
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'allegro.middlewares.AllegroSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled item pipelines:
['allegro.pipelines.AllegroPipeline']
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider opened
2017-12-10 22:25:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-10 22:25:15 [AllegroPrices] INFO: Spider opened: AllegroPrices
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-10 22:25:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 527000),
'log_count/INFO': 8,
'start_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 517000)}
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider closed (finished)
My spider file:
# -*- coding: utf-8 -*-
import scrapy
from allegro.items import AllegroItem
class AllegroPrices(scrapy.Spider):
name = "AllegroPrices"
allowed_domains = ["allegro.pl"]
#Use working product URL below
start_urls = [
"http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html", "http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html",
"http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html", "http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html"
]
def parse(self, response):
items = AllegroItem()
title = response.xpath('//h1[#class="title"]//text()').extract()
sale_price = response.xpath('//div[#class="price"]//text()').extract()
seller = response.xpath('//div[#class="btn btn-default btn-user"]/span/text()').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_seller'] = ''.join(seller).strip()
yield items
Settings:
# -*- coding: utf-8 -*-
# Scrapy settings for allegro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'AllegroPrices'
SPIDER_MODULES = ['allegro.spiders']
NEWSPIDER_MODULE = 'allegro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'allegro (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'allegro.middlewares.AllegroSpiderMiddleware': 543,
}
LOG_LEVEL = 'INFO'
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'allegro.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'allegro.pipelines.AllegroPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Pipeline:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class AllegroPipeline(object):
def process_item(self, item, spider):
return item
Items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AllegroItem(scrapy.Item):
# define the fields for your item here like:
product_name = scrapy.Field()
product_sale_price = scrapy.Field()
product_seller = scrapy.Field()
I have no problem to run it as standalone script without creating project and save to CSV file.
And I don't have to change USER-AGENT.
Maybe there is problem with some settings. You didn't put url to tutorial to check it.
Or simply you have wrong indentions and start_urls and parse() in not inside class. Indentions are very important in Python.
BTW: you forgot /a/ in xpath for seller.
import scrapy
#class AllegroItem(scrapy.Item):
# product_name = scrapy.Field()
# product_sale_price = scrapy.Field()
# product_seller = scrapy.Field()
class AllegroPrices(scrapy.Spider):
name = "AllegroPrices"
allowed_domains = ["allegro.pl"]
start_urls = [
"http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html",
"http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html",
"http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html",
"http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html"
]
def parse(self, response):
title = response.xpath('//h1[#class="title"]//text()').extract()
sale_price = response.xpath('//div[#class="price"]//text()').extract()
seller = response.xpath('//div[#class="btn btn-default btn-user"]/a/span/text()').extract()
title = title[0].strip()
print(title, sale_price, seller)
yield {'title': title, 'price': sale_price, 'seller': seller}
#items = AllegroItem()
#items['product_name'] = ''.join(title).strip()
#items['product_sale_price'] = ''.join(sale_price).strip()
#items['product_seller'] = ''.join(seller).strip()
#yield items
# --- run it as standalone script without project and save in CSV ---
from scrapy.crawler import CrawlerProcess
#c = CrawlerProcess()
c = CrawlerProcess({
# 'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv'
})
c.crawl(AllegroPrices)
c.start()
Result in CSV file:
title,price,seller
STAR WARS: EMPIRE AT WAR [2006] DVD BOX,"24,90 zł",CDkingpl
DIABLO II: LORD OF DESTRUCTION 2 PC BIG BOX ENG,"149,00 zł",CDkingpl
HEAVY GEAR II 2 | PC ENG CDkingpl,"19,90 zł",CDkingpl
DIABLO II 2 | PC DVD BOX | ENG,"24,90 zł",CDkingpl
Related
I have the following custom pipeline for downloading JSON files. It was functioning fine until I need to add the __init__ function, in which I subclass the FilesPipeline class in order to add a few new properties. The pipeline takes URLs that are to API endpoints and downloads their responses. The folders are properly created when running the spider via scrapy crawl myspider and the two print statements in the file_path function show the correct values (filename and filepath). However, the files are never actually downloaded.
I did find a few similar questions about custom file pipelines and files not downloading (here (the solution was they needed to yield the items instead of returning them) and here (the solution was needing to adjust the ROBOTSTXT_OBEY setting) for example), but the solutions did not work for me.
What am I doing wrong (or forgetting to do when subclassing the FilesPipeline)? I've been racking my brain over this issue for a good 3 hours and my google-fu has not yielded any resolutions for my case.
class LocalJsonFilesPipeline(FilesPipeline):
FILES_STORE = "json_src"
FILES_URLS_FIELD = "json_url"
FILES_RESULT_FIELD = "local_json"
def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
# super(LocalJsonFilesPipeline, self).__init__(store_uri)
self.store_uri = store_uri
self.use_response_url = use_response_url
if filename_regex:
self.filename_regex = re.compile(filename_regex)
else:
self.filename_regex = filename_regex
super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
#classmethod
def from_crawler(cls, crawler):
if not crawler.spider:
return BasePipeline()
store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
settings = crawler.spider.settings
use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
filename_regex = settings.get('JSON_FILENAME_REGEX')
return cls(store_uri, use_response_url, filename_regex, settings)
def parse_path(self, value):
if self.filename_regex:
try:
return self.filename_regex.findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
# example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
link_path = os.path.splitext(urlparse(value).path)[0] # omit extension if there is one
link_params = link_path.rsplit('/', 1)[1] # preserve the last portion separated by forward-slash (A-77775602)
return link_params if '=' not in link_params else link_params.split('=', 1)[1]
def get_media_requests(self, item, info):
json_url = item.get(self.FILES_URLS_FIELD)
if json_url:
filename_url = json_url if not self.use_response_url else item.get('url', '')
return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
def file_path(self, request, response=None, info=None):
final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
print('url', request.url)
print('downloading to', final_path)
return final_path
And the custom settings of my spider
class MockSpider(scrapy.Spider):
name = 'mock'
custom_settings = {
'ITEM_PIPELINES': {
'mock.pipelines.LocalJsonFilesPipeline': 200
},
'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
}
Log with the level set to debug
C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'pipeline_test.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
'local_json': [],
'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.468001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 14,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)
I finally figured out the issue, which was the fact that the FilesPipeline class does not have a from_crawler method, but instead requires a from_settings method when wanting to pass added parameters to a subclassed/custom FilesPipeline. Below is my working version of the custom FilesPipeline
from scrapy import Request
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
import re
class LocalFilesPipeline(FilesPipeline):
FILES_STORE = "data_src"
FILES_URLS_FIELD = "data_url"
FILES_RESULT_FIELD = "local_file"
def __init__(self, settings=None):
"""
Attributes:
use_response_url indicates we want to grab the filename from the response url instead of json_url
filename_regex regexes to use for grabbing filenames out of urls
filename_suffixes suffixes to append to filenames when there are multiple files to download per item
filename_extension the file extension to append to each filename in the file_path function
"""
self.use_response_url = settings.get('FILENAME_USE_RESPONSE_URL', False)
self.filename_regex = settings.get('FILENAME_REGEX', [])
self.filename_suffixes = settings.get('FILENAME_SUFFIXES', [])
self.filename_extension = settings.get('FILENAME_EXTENSION', 'json')
if isinstance(self.filename_regex, str):
self.filename_regex = [self.filename_regex]
if isinstance(self.filename_suffixes, str):
self.filename_suffixes = [self.filename_suffixes]
if self.filename_regex and self.filename_suffixes and len(self.filename_regex) != len(self.filename_suffixes):
raise ValueError('FILENAME_REGEX and FILENAME_SUFFIXES settings must contain the same number of elements')
if self.filename_regex:
for i, f_regex in enumerate(self.filename_regex):
self.filename_regex[i] = re.compile(f_regex)
super(LocalFilesPipeline, self).__init__(self.FILES_STORE, settings=settings)
#classmethod
def from_settings(cls, settings):
return cls(settings=settings)
def parse_path(self, value, index):
if self.filename_regex:
try:
return self.filename_regex[index-1].findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
link_path = os.path.splitext(urlparse(value).path)[0]
# preserve the last portion separated by forward-slash
try:
return link_path.rsplit('/', 1)[1]
except IndexError:
return link_path
def get_media_requests(self, item, info):
file_urls = item.get(self.FILES_URLS_FIELD)
requests = []
if file_urls:
total_urls = len(file_urls)
for i, file_url in enumerate(file_urls, 1):
filename_url = file_url if not self.use_response_url else item.get('url', '')
filename = self.parse_path(filename_url, i)
if self.filename_suffixes:
current_suffix = self.filename_suffixes[i-1]
if current_suffix.startswith('/'):
# this will end up creating a separate folder for the different types of files
filename += current_suffix
else:
# this will keep all files in single folder while still making it easy to differentiate each
# type of file. this comes in handy when searching for a file by the base name.
filename += f'_{current_suffix}'
elif total_urls > 1:
# default to numbering files sequentially in the order they were added to the item
filename += f'_file{i}'
requests.append(Request(file_url, meta={'spider': info.spider.name, 'filename': filename}))
return requests
def file_path(self, request, response=None, info=None):
return f'{request.meta["spider"]}/{request.meta["filename"]}.{self.filename_extension}'
Then, to utilize the pipeline you can set the applicable values in a spider's custom_settings property
custom_settings = {
'ITEM_PIPELINES': {
'spins.pipelines.LocalFilesPipeline': 200
},
'FILENAME_REGEX': [r'products\/(.+?)\/ProductInfo\+ProductDetails']
}
I am trying to make a ebay spider that goes through each product link on a page and for each link visit each link and do something with that new page in parse_link function.
i am scraping this link
in parse function it iterates over each link fine prints out each link fine but only calls the parse function for only one link on a page
i mean each page has 50 or so products i am getting each product link and for each link visit each link and do something in the pase_link function
but for each page the parse_link function gets called for only one link (out of 50 or so links)
here is the code
class EbayspiderSpider(scrapy.Spider):
name = "ebayspider"
#allowed_domains = ["ebay.com"]
start_urls = ['http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562']
def parse(self, response):
global c
for attr in response.xpath('//*[#id="ListViewInner"]/li'):
item = EbayItem()
linkse = '.vip ::attr(href)'
link = attr.css('a.vip ::attr(href)').extract_first()
c+=1
print '', 'I AM HERE', link, '\t', c
yield scrapy.Request(link, callback=self.parse_link, meta={'item': item})
next_page = '.gspr.next ::attr(href)'
next_page = response.css(next_page).extract_first()
print '\nI AM NEXT PAGE\n'
if next_page:
yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)
def parse_link(self, response):
global c2
c2+=1
print '\n\n\tIam in parselink\t', c2
SEE FOR EVERY 50 or so links scrapy only executes the parse link 1 time i am printing the counts how many links extracted and how many times parse_link gets executed using global variables
shady#shadyD:~/Desktop/ebay$ scrapy crawl ebayspider
ENTER THE URL TO SCRAPE : http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562
2017-05-13 22:44:31 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: ebay)
2017-05-13 22:44:31 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ebay.spiders', 'SPIDER_MODULES': ['ebay.spiders'], 'BOT_NAME': 'ebay'}
2017-05-13 22:44:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-13 22:44:33 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:38079/session {"requiredCapabilities": {}, "desiredCapabilities": {"platform": "ANY", "browserName": "chrome", "version": "", "chromeOptions": {"args": [], "extensions": []}, "javascriptEnabled": true}}
2017-05-13 22:44:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-13 22:44:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-13 22:44:33 [scrapy.core.engine] INFO: Spider opened
2017-05-13 22:44:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-13 22:44:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-13 22:44:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562> (referer: None)
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320B-320BL-320BLN-321B-322BL-325BL-151-9385-/361916086833?hash=item5443e13a31:g:NMwAAOSwX~dWomWJ 1
I AM HERE http://www.ebay.com/itm/257954A1-New-Case-580SL-580SM-580SL-Series-2-Backhoe-Loader-Hydraulic-Pump-/361345120303?hash=item5421d8f82f:g:KQEAAOSwBLlVVP0X 2
I AM HERE http://www.ebay.com/itm/Case-580K-forward-reverse-transmission-shuttle-kit-includ-NEW-PUMP-SEALS-GASKETS-/110777599002?hash=item19cadc041a:g:QBgAAOSwh-1W2GkE 3
I AM HERE http://www.ebay.com/itm/Case-Loader-Backhoe-580L-Hydraulic-Pump-130258A1-130258A2-15-spline-NEW-/361889539361?hash=item54424c2521:g:nzgAAOSw9GhYiQzz 4
I AM HERE http://www.ebay.com/itm/Hitachi-EX60-PLAIN-Excavator-Service-Manual-Shop-Repair-Book-KM-099-00-KM09900-/132118077640?hash=item1ec2d9e0c8:g:DLkAAOxyVLNS6Cj7 5
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-416E-420D-420E-428D-Backhoe-3054c-C4-4-engine-TurboCharger-turbo-/361576953143?hash=item542faa7537:g:I78AAOSw3ihXTZwm 6
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-1200002-120-0002-/131402610746?hash=item1e9834b83a:g:hBUAAOSwpdpVX4DS 7
I AM HERE http://www.ebay.com/itm/Fuel-Cap-Case-Backhoe-Skid-Steer-1845c-1845-1840-1835-1835b-1835c-diesel-or-gas-/132102578279?hash=item1ec1ed6067:g:LCYAAOSwGYVXCDJ4 8
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-312C-312CL-Stepping-Throttle-Motor-247-5207-2475207-/112125482091?hash=item1a1b33146b:g:1wAAAOSw9IpX0HLt 9
I AM HERE http://www.ebay.com/itm/AT179792-John-Deere-Loader-Backhoe-310E-310G-310K-310J-710D-Hydraulic-Pump-NEW-/111290280036?hash=item19e96ae864:g:hxQAAOSw2GlXEW8g 10
I AM HERE http://www.ebay.com/itm/L32129-CASE-580C-480C-Brake-master-cylinder-REPAIR-KIT-480B-580B-530-570-480-430-/112228195723?hash=item1a21525d8b:g:lWEAAOSwux5YRucG 11
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310C-310D-310E-410B-410C-510C-710C-King-pin-Kingpin-kit-T184816-/112266699462?hash=item1a239de2c6:g:~qAAAOSw44BYfmcP 12
I AM HERE http://www.ebay.com/itm/Case-257948A1-580L-580L-580SL-580M-580SM-590SL-590SM-Series-2-Coupler-17-spline-/131506726034?hash=item1e9e696492:g:ZnkAAOSwPgxVTNAx 13
I AM HERE http://www.ebay.com/itm/Construction-Equipment-key-set-John-Deere-Hitachi-JD-JCB-excavator-backhoe-multi-/360445978301?hash=item53ec4126bd:g:1HkAAMXQlUNRLOiF 14
I AM HERE http://www.ebay.com/itm/Case-580C-580E-forward-reverse-transmission-shuttle-kit-includ-NEW-SEALS-GASKETS-/361588374712?hash=item543058bcb8:g:kOYAAOSwDuJW2Gna 15
I AM HERE http://www.ebay.com/itm/John-Deere-300D-310D-315D-TRANSMISSION-REVERSER-SOLENOID-ASSEMBLY-EARLY-AT163601-/361435304759?hash=item5427391337:g:5rsAAOSwnipWXft4 16
I AM HERE http://www.ebay.com/itm/Bobcat-743-Service-Manual-Book-Skid-steer-6566109-/131768685855?hash=item1eae06951f:g:rgcAAOSwQgpW~nqW 17
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320C-312c-330c-325c-1573198-157-3198-panel-/112063225844?hash=item1a177d1ff4:g:BtgAAOSwepZXTfZ~ 18
I AM HERE http://www.ebay.com/itm/Ford-NEW-HOLLAND-Loader-BACKHOE-Hydraulic-pump-550-535-555-D1NN600B-Cessna-/360202190657?hash=item53ddb93f41:g:3gkAAOSwPgxVP5VF 19
I AM HERE http://www.ebay.com/itm/87435827-New-Case-590SL-590SM-Series-1-2-Backhoe-Loader-Hydraulic-oil-Pump-14S-/131992359553?hash=item1ebb5b9281:g:KQEAAOSwBLlVVP0X 20
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-2475227-247-5227-/111677605339?hash=item1a008105db:g:stsAAOSwNSxVX4kG 21
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-938H-950H-962H-416E-Wheel-Loader-Locking-Fuel-Tank-Cap-2849039-/111446084638?hash=item19f2b44c1e:g:u0IAAOxy1klRdqOQ 22
I AM HERE http://www.ebay.com/itm/FORD-BACKHOE-Hydraulic-pump-555C-555D-655D-E7NN600CA-/361376010222?hash=item5423b04fee:g:UdkAAOSwu4BV4J6T 23
I AM HERE http://www.ebay.com/itm/John-Deere-Excavator-AT154524-High-Speed-Solenoid-valve-490E-790ELC-790E-pump-/131623918235?hash=item1ea5659a9b:g:o-EAAOSwo0JWF~PC 24
I AM HERE http://www.ebay.com/itm/John-Deere-350C-450C-Dozer-Loader-Arm-Rest-PAIR-SEAT-/360164308266?hash=item53db77352a:m:m-79tleHP2PC3zD-HqRPMQw 25
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-D3-D3B-D3C-D4B-D4C-D4H-D5C-Dozer-3204-Engine-water-pump-NEW-/112061839578?hash=item1a1767f8da:g:6x0AAOSwIgNXjkNm 26
I AM HERE http://www.ebay.com/itm/International-IH-TD5-OLD-Crawler-Dozer-Seat-cushions-/110840656548?hash=item19ce9e32a4:m:mu5f6-grIZNQVtDoLSDcDJg 27
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D3C-Series-III-D4G-D4H-8E4148-Arm-rests-rest-cushion-Dozer-seat-/131827423319?hash=item1eb186d857:g:JxMAAOSwQaJXRdzW 28
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320C-321C-322C-325C-260-2160-2602160-gauge-/112014409886?hash=item1a1494409e:g:BtgAAOSwepZXTfZ~ 29
I AM HERE http://www.ebay.com/itm/John-Deere-JD-NON-Turbo-Muffler-AT83613-210C-300D-310C-310D-315C-315D-400G-410B-/361917008791?hash=item5443ef4b97:g:U0wAAOSw~CRTpFsn 30
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310D-Shuttle-transmission-Overhaul-Kit-With-Pump-Forward-Reverse-/361916993624?hash=item5443ef1058:g:8cUAAOSwDNdVp7-1 31
I AM HERE http://www.ebay.com/itm/AT318659-AT139444-John-Deere-Loader-Brake-Hydraulic-Pump-NEW-SURPLUS-544E-544G-/132040240495?hash=item1ebe362d6f:g:mRMAAOSwJ7RYWWUF 32
I AM HERE http://www.ebay.com/itm/Hitachi-EX60-PLAIN-Excavator-PARTS-Manual-Book-P10717-P107E16-Machine-Comp-/132110375418?hash=item1ec26459fa:g:rbwAAOSwPe1UAQal 33
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D2-ENGINE-SERVICE-REPAIR-manual-book-D311-212-motor-grader-/360724733057?hash=item53fcde9c81:m:mfYRAKtemeCg_HnjxHAiO0w 34
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-312C-315C-318C-319C-260-2160-2602160-gauge-/131833751423?hash=item1eb1e7677f:g:BtgAAOSwepZXTfZ~ 35
I AM HERE http://www.ebay.com/itm/121335A1-Case-580L-580L-Series-2-Backhoe-Throttle-Cable-BENT-77-75-LONG-BEND-/361891435313?hash=item5442691331:g:lgcAAOSwhOdXogxu 36
I AM HERE http://www.ebay.com/itm/Heavy-Construction-Equipment-21-Key-Set-Cat-Case-Deere-Komatsu-Volvo-Truck-Laser-/111018804148?hash=item19d93c83b4:m:mm5Eephzc48HDdiNjCCaxtg 37
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-320B-322B-325B-throttle-motor-governor-2475232-247-5232-5-pin-/112183024608?hash=item1a1ea11be0:g:4bUAAOSwXeJYESNh 38
I AM HERE http://www.ebay.com/itm/John-Deere-REAR-Window-BOTTOM-300D-310D-310E-410D-410E-510D-710D-Backhoe-T132952-/111788475468?hash=item1a071cc44c:m:mM6nkmXre_mrGj9gBQbSQHQ 39
I AM HERE http://www.ebay.com/itm/JD-John-Deere-200CLC-120CLC-Excavator-Cab-Front-Upper-Glass-Window-4602562-120C-/361479558328?hash=item5429dc54b8:g:WvEAAOSw2s1Uz-er 40
I AM HERE http://www.ebay.com/itm/Hitachi-Excavator-Front-Lower-Glass-Window-4369588-/110718985349?hash=item19c75da485:m:mettchbVo-QopfqTgIqtY3g 41
I AM HERE http://www.ebay.com/itm/Caterpillar-D6M-D6N-D6R-D8R-Suspension-Seat-6W9744-Cat-/361294230211?hash=item541ed072c3:g:3wAAAOSwNSxVULZJ 42
I AM HERE http://www.ebay.com/itm/Komatsu-D20A-3-D20P-7-D21P-7-Dozer-Track-Adjuster-Seal-Kit-909036-WITH-BUSHING-/132165283763?hash=item1ec5aa2fb3:g:-0MAAOSwdzVXl3CN 43
I AM HERE http://www.ebay.com/itm/Locking-Fuel-Cap-John-Deere-310S-310SE-410E-backhoe-AT176378-NEW-310-S-SE-410-E-/361853261989?hash=item54402298a5:g:NUIAAOSwOtdYUEnj 44
I AM HERE http://www.ebay.com/itm/John-Deere-450G-455G-550G-555G-650G-Dozer-Loader-Arm-Rest-rests-/361912161141?hash=item5443a55375:g:7rkAAOSw3xJVVhwe 45
I AM HERE http://www.ebay.com/itm/John-Deere-AT418735-RIGHT-bucket-Handle-CT322-240-250-260-270-Skid-Steer-loader-/112335938162?hash=item1a27be6272:g:A2MAAOSwTM5YyYCc 46
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Tooth-Penetration-Rock-Tip-220-9092-2209092-320C-320D-325C-325D-/361928972291?hash=item5444a5d803:g:nGsAAOxy4YdTV~Qx 47
I AM HERE http://www.ebay.com/itm/John-Deere-AT418734-LEFT-Bucket-Handle-CT322-240-250-260-270-Skid-Steer-loader-/132127244893?hash=item1ec365c25d:g:5doAAOSwax5YyYAH 48
I AM HERE http://www.ebay.com/itm/4N9618-CAT-Caterpillar-977L-966C-235-D6C-3306-ENGINE-caterpiller-dozer-loader-/112360381857?hash=item1a29335da1:g:dLsAAOSwuLZY5lPU 49
I AM HERE http://www.ebay.com/itm/Bobcat-763-763F-Service-Manual-Book-Skid-steer-6900091-repair-shop-book-/131531875901?hash=item1e9fe9263d:g:VUsAAOxyOlhS0EiN 50
I AM NEXT PAGE
2017-05-13 22:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/itm/Bobcat-763-763F-Service-Manual-Book-Skid-steer-6900091-repair-shop-book-/131531875901?hash=item1e9fe9263d:g:VUsAAOxyOlhS0EiN> (referer: http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562)
Iam in parselink 2
2017-05-13 22:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc> (referer: http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562)
I AM HERE http://www.ebay.com/itm/Hitachi-EX120-3-Excavator-Service-Technical-WorkShop-Manual-Shop-KM135E00-/361971788377?hash=item5447332a59:g:uXEAAMXQEgpTERZv 51
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-320B-320BL-320BLN-321B-322BL-325BL-106-0172-/112208711245?hash=item1a20290e4d:g:NMwAAOSwX~dWomWJ 52
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-D4D-Seat-Cushion-Set-Arm-Rest-Dozer-9M6702-8K9100-3K4403-NEW-/111027276253?hash=item19d9bdc9dd:g:taYAAMXQhuVROmSf 53
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-UPPER-RH-LH-85801626-/111004314632?hash=item19d85f6c08:g:kSkAAOxyzHxRL8~e 54
I AM HERE http://www.ebay.com/itm/187-8391-1878391-Caterpillar-Cat-Oil-Cooler-939C-D4C-D5C-933C-D3C-Series-3-/132036431899?hash=item1ebdfc101b:g:VhQAAOSw3YNXYtcn 55
I AM HERE http://www.ebay.com/itm/A137187-CASE-BACKHOE-Power-Steering-pump-480B-580B-530-NEW-A36559-/132028859390?hash=item1ebd8883fe:g:HMsAAOSwzOxUWpVL 56
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-953-7N5538-Exhaust-flex-pipe-EARLY-S-N-/361407787737?hash=item54259532d9:g:n3YAAOSwo6lWHQOL 57
I AM HERE http://www.ebay.com/itm/LINKBELT-Excavator-locking-Fuel-Cap-with-keys-KHH0140-/131504146758?hash=item1e9e420946:g:FHUAAOSwPhdVSLkJ 58
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-D4H-D5H-D6D-EXHAUST-PIPE-LOCKING-RAIN-CAP-5-INCH-/131962111459?hash=item1eb98e05e3:g:0fgAAOSwpLNX9qT1 59
I AM HERE http://www.ebay.com/itm/Caterpillar-CAT-Dozer-D5C-D5G-rear-sprocket-segments-NEW-1979677-1979678-CR6602-/361403972171?hash=item54255afa4b:g:qJsAAOSwLqFV9tkk 60
I AM HERE http://www.ebay.com/itm/John-Deere-4265372-RPM-sensor-110-120-160C-200C-330CLC-490E-790ELC-892E-HITACHI-/131567763291?hash=item1ea20cbf5b:g:PZYAAOSwPcVVup-H 61
I AM HERE http://www.ebay.com/itm/CATERPILLAR-D3B-931B-arm-rests-9C4136-5G2621-/360160327148?hash=item53db3a75ec:m:mY4iFhRua2zcfV6IL5i8csQ 62
I AM HERE http://www.ebay.com/itm/Bobcat-864-Operation-Maintenance-Manual-Book-6900953-operator-skid-steer-Track-/131664897965?hash=item1ea7d6e7ad:g:exkAAOSwcBhWXem~ 63
I AM HERE http://www.ebay.com/itm/Case-550G-650G-750G-850G-1150G-arm-rests-194738A1-144427A1-seat-cushion-crawler-/112393155898?hash=item1a2b27753a:g:GVEAAOSw5L9XDoN- 64
I AM HERE http://www.ebay.com/itm/7834-41-3002-7834-41-3003-Komatsu-PC300-7-PC360-7-PC400-7-Throttle-motor-/132135899267?hash=item1ec3e9d083:g:ulMAAOSw4A5Y1Agl 65
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-955H-Crawler-Loader-Dozer-Parts-Manual-Book-NEW-60A8413-and-up-/361855690487?hash=item544047a6f7:g:FeUAAOSwux5YVDfu 66
I AM HERE http://www.ebay.com/itm/Case-580CK-530-530ck-2wd-Power-Steering-cylinder-A37859-A37509-/111184835276?hash=item19e321f2cc:g:h~QAAOxyGstR8DSu 67
I AM HERE http://www.ebay.com/itm/Case-Backhoe-580-SUPER-L-580L-590SL-Radiator-234876A1-234876A2-Metal-tank-580SL-/111646548306?hash=item19fea72152:g:3igAAOxyI8lR8TnL 68
I AM HERE http://www.ebay.com/itm/Dresser-International-TD7C-TD8C-TD7E-TD12-TD15E-Dozer-Fuel-Cap-701922C2-103768C1-/132062834112?hash=item1ebf8eedc0:g:-CEAAOSwImRYeOug 69
I AM HERE http://www.ebay.com/itm/JD-John-Deere-120-160LC-200LC-230LC-Excavator-Cab-Door-Lower-Glass-4383401-/360651229974?hash=item53f87d0b16:g:fhUAAMXQDfdRqPQ5 70
I AM HERE http://www.ebay.com/itm/New-Holland-LB75b-loader-backhoe-operators-manual-operator-operation-maintenance-/361287895632?hash=item541e6fca50:g:1WAAAOSwAvJW9X~t 71
I AM HERE http://www.ebay.com/itm/Bobcat-743-early-parts-Manual-Book-Skid-steer-loader-6566179-/112084996042?hash=item1a18c94fca:g:wAoAAOxykmZTNY92 72
I AM HERE http://www.ebay.com/itm/Dresser-TD15E-Operator-Maintenance-Manual-International-crawler-dozer-operation-/111385189587?hash=item19ef131cd3:g:qDYAAOSwnQhXohwA 73
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-REAR-BACK-85801632-/360573341694?hash=item53f3d88ffe:g:nDQAAOxyyF5RL9H2 74
I AM HERE http://www.ebay.com/itm/DEERE-160LC-200LC-230LC-330LC-370-GLASS-LOWER-AT214097-/361070972976?hash=item541181d030:m:mettchbVo-QopfqTgIqtY3g 75
I AM HERE http://www.ebay.com/itm/John-Deere-NEW-Turbocharger-turbo-545D-590D-595-495D-EXCAVATOR-JD-RE26342-NEW-/131458659790?hash=item1e9b8bf5ce:g:3c4AAOxyu4dRwzW4 76
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-FRONT-LOWER-LH-85801623-/361342507318?hash=item5421b11936:g:ZbYAAOSwPcVVpsif 77
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-excavator-311B-312-312B-Stepping-Throttle-Motor-247-5231-1190633-/132186922816?hash=item1ec6f45f40:g:hBUAAOSwpdpVX4DS 78
I AM HERE http://www.ebay.com/itm/Cat-Caterpillar-Excavator-Monitor-330C-260-2160-2602160-gauge-/361578440228?hash=item542fc12624:g:BtgAAOSwepZXTfZ~ 79
I AM HERE http://www.ebay.com/itm/John-Deere-210C-310D-Shuttle-Reverser-Overhaul-Kit-With-Pump-Forward-Reverse-/131963132435?hash=item1eb99d9a13:g:8cUAAOSwDNdVp7-1 80
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Multi-Terrain-Skid-Steer-Loader-Suspension-seat-cushion-kit-/360880511219?hash=item54062798f3:m:m5Tt8bBvIax8MVfT4VqcQgA 81
I AM HERE http://www.ebay.com/itm/Case-310G-Crawler-Tractor-4pc-Seat-Cushion-set-/361381166532?hash=item5423fefdc4:g:hzAAAOSwSdZWdHZS 82
I AM HERE http://www.ebay.com/itm/International-IH-500-OLD-Crawler-Dozer-Seat-cushions-/110598250697?hash=item19c02b60c9:g:DQ0AAMXQTT9RwIuh 83
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Excavator-Locking-Fuel-Cap-0963100-key-E110-E120-E70B-E110B-312-/110702080613?hash=item19c65bb265:g:pLwAAOxy2YtRwx2L 84
I AM HERE http://www.ebay.com/itm/Fuel-Cap-Case-Backhoe-Skid-Steer-1845c-1845-1840-1835-1835b-1835c-diesel-or-gas-/132102578719?hash=item1ec1ed621f:g:~IcAAOSwgZ1Xvyk9 85
I AM HERE http://www.ebay.com/itm/87433897-New-Case-580SL-580SM-580SL-Series-1-2-Backhoe-Hydraulic-Pump-14-Spline-/112192774351?hash=item1a1f35e0cf:g:KQEAAOSwBLlVVP0X 86
I AM HERE http://www.ebay.com/itm/Case-580K-580SK-580L-580SL-BACKHOE-Right-Door-Rear-Hinged-Window-Glass-R52882-/111777519523?hash=item1a067597a3:m:mUh405BlfpMRnDzu0J8qEEw 87
I AM HERE http://www.ebay.com/itm/Case-backhoe-door-spring-580E-580K-580SK-580SL-580SL-SERIES-2-580L-F44881-/111485899971?hash=item19f513d4c3:m:mpgpGQ1o0j_2ewhNIMMA53w 88
I AM HERE http://www.ebay.com/itm/FORD-555E-575E-655E-675E-BACKHOE-GLASS-WINDOW-DOOR-LOWER-LH-85801625-/111002325387?hash=item19d841118b:g:HrIAAMXQySpRL9SJ 89
I AM HERE http://www.ebay.com/itm/International-Dresser-TD8E-Dozer-4pc-Seat-Cushion-set-TD8C-IH-/131522416031?hash=item1e9f58cd9f:g:qC0AAOSwqBJXUJIL 90
I AM HERE http://www.ebay.com/itm/John-Deere-450G-550G-650G-Crawler-Dozer-Operators-Manual-Maintenance-OMT163974-/132190364513?hash=item1ec728e361:g:lUAAAOxygPtS59xJ 91
I AM HERE http://www.ebay.com/itm/Heavy-Construction-Equipment-key-set-excavator-bull-dozer-broom-forklift-loaders-/110751342295?hash=item19c94b5ed7:m:mm5Eephzc48HDdiNjCCaxtg 92
I AM HERE http://www.ebay.com/itm/International-IH-Dresser-TD15B-TD15C-Crawler-Loader-Seat-Cushion-set-4-pieces-/111731372191?hash=item1a03b5709f:g:TrAAAOSwDNdVu5He 93
I AM HERE http://www.ebay.com/itm/Caterpillar-Cat-Skid-Steer-loader-Suspension-COMPLETE-Seat-247-247B-more-/131069185959?hash=item1e84550fa7:g:kQYAAOxy4dNSqIYD 94
I AM HERE http://www.ebay.com/itm/John-Deere-JD-Loader-Backhoe-710D-310g-310E-310J-310K-Hydraulic-charge-Pump-/131129131733?hash=item1e87e7c2d5:g:zFQAAOxy9eVRJ9cw 95
I AM HERE http://www.ebay.com/itm/Case-480E-480ELL-LANDSCAPE-Backhoe-4x4-4wd-FRONT-RIM-wheel-New-D126930-12-X-16-5-/360913564299?hash=item54081ff28b:m:mYte9AXdktKLD9H-HOFJthQ 96
I AM HERE http://www.ebay.com/itm/Bobcat-763F-763-Operation-Maintenance-Manual-operator-owner-6900788-/360337555830?hash=item53e5cac176:g:4IQAAOxy4dNSxZHP 97
I AM HERE http://www.ebay.com/itm/Bobcat-753H-753-H-Service-Manual-Book-Skid-steer-loader-6900090-/131522633242?hash=item1e9f5c1e1a:g:1JEAAOxyUrZS-j4Q 98
I AM HERE http://www.ebay.com/itm/John-Deere-JD-550-Crawler-Dozer-Parts-Manual-PC1437-/131985496504?hash=item1ebaf2d9b8:g:GkIAAOSwPgxVLR7f 99
I AM HERE http://www.ebay.com/itm/Case-IH-580D-580SE-580SD-Backhoe-Rear-Closure-Panel-Cab-Glass-Window-CG3116-NEW-/111070117033?hash=item19dc4b7ca9:g:jHEAAOxykVNRwL34 100
I AM NEXT PAGE
2017-05-13 22:44:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/itm/Case-IH-580D-580SE-580SD-Backhoe-Rear-Closure-Panel-Cab-Glass-Window-CG3116-NEW-/111070117033?hash=item19dc4b7ca9:g:jHEAAOxykVNRwL34> (referer: http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc)
Iam in parselink 3
2017-05-13 22:44:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=3&_skc=100&rt=nc> (referer: http://www.ebay.com/sch/m.html?item=132127244893&_ssn=hfinney&_pgn=2&_skc=50&rt=nc)
I AM HERE http://www.ebay.com/itm/John-Deere-Hitachi-Zaxis-110-120-160-200-225-230-Alternator-1812005304-Excavator-/360495635483?hash=item53ef36dc1b:m:mqifohjA-IWXcIg_oWMee1Q 101
I AM HERE http://www.ebay.com/itm/CAT-Caterpillar-955H-Crawler-Loader-Dozer-Parts-Manual-Book-NEW-60A8413-and-up-/361855690487?hash=item54404
EDIT:
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for ebay project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ebay'
SPIDER_MODULES = ['ebay.spiders']
NEWSPIDER_MODULE = 'ebay.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ebay (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ebay.middlewares.EbaySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ebay.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'ebay.pipelines.EbayPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
items.py
import scrapy
from scrapy.item import Item, Field
class EbayItem(scrapy.Item):
NAME = scrapy.Field()
MPN = scrapy.Field()
ITEMID = scrapy.Field()
PRICE = scrapy.Field()
FREIGHT_1_for_quan_1 = scrapy.Field()
FREIGHT_2_for_quan_2 = scrapy.Field()
DATE = scrapy.Field()
QUANTITY = scrapy.Field()
CATAGORY = scrapy.Field()
SUBCATAGORY = scrapy.Field()
SUBCHILDCATAGORY = scrapy.Field()
pipelines.py although i have not touched this file
class EbayPipeline(object):
def process_item(self, item, spider):
return item
Middleware.py Have not touched this file either
from scrapy import signals
class EbaySpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
Solution: no fix needed, it seems to be working fine
I quickly ran your code (with only slight modifications like removing the global vars and replacing EbayItem) and it works fine and visit alls URLs you are creating.
Explanation / What's going on here:
I suspect your scraper is scheduling the urls in a way that makes it appear as if it is not visiting all links. But it will do, only later.
I suspect you have set CONCURRENT_REQUESTS = 2. That's why scrapy is scheduling 2 of the 51 URLs for being processed next. Among these 2 URLs there is the next page URL which creates another 51 requests. And these new requests are pushing the old 49 requests further back in the queue ... and so on and so on it will go until there are no more next links.
If you run the scraper long enough you will see that all links will be visited sooner or later. Most probably the 49 "missing" requests that were created first will be visited last.
Also you can remove the creation of the next_page request to see whether all 50 links are visited.
I have wasted days to get my mind around Scrapy, reading the docs and other Scrapy Blogs and Q&A ... and now I am about to do what men hate most: Ask for directions ;-) The problem is: My spider opens, fetches the start_urls, but apparently does nothing with them. Instead it closes immediately and that was that. Apparently, I do not even get to the first self.log() statement.
What I've got so far is this:
# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *
class KiSpider(CrawlSpider):
name = "KiSpider"
allowed_domains = ['www.kiweb.de', 'kiweb.de']
start_urls = (
# ST Regra start page:
'https://www.kiweb.de/default.aspx?pageid=206',
# follow ST Regra links in the form of:
# https://www.kiweb.de/default.aspx?pageid=206&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
# ST Thermo start page:
'https://www.kiweb.de/default.aspx?pageid=202&page=1',
# follow ST Thermo links in the form of:
# https://www.kiweb.de/default.aspx?pageid=202&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
)
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
# Once an article page is reached, check whether a login is necessary:
def parse_init(self, response):
self.log('Parsing article: %s' % response.url)
if not response.xpath('input[#value="Logout"]'):
# Note: response.xpath() is a shortcut of response.selector.xpath()
self.log('Not logged in. Logging in...\n')
return self.login(response)
else:
self.log('Already logged in. Continue crawling...\n')
return self.parse_item(response)
def login(self, response):
self.log("Trying to log in...\n")
self.username = self.settings['KI_USERNAME']
self.password = self.settings['KI_PASSWORD']
return FormRequest.from_response(
response,
formname='Form1',
formdata={
# needs name, not id attributes!
'ctl04$Header$ctl01$textbox_username': self.username,
'ctl04$Header$ctl01$textbox_password': self.password,
'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
'ctl04$Header$ctl01$checkbox_permanent': 'True',
},
callback = self.parse_item,
)
def parse_item(self, response):
articles = response.xpath('//div[#id="artikel"]')
items = []
for article in articles:
item = KiSpiderItem()
item['link'] = response.url
item['title'] = articles.xpath("div[#class='ct1']/text()").extract()
item['subtitle'] = articles.xpath("div[#class='ct2']/text()").extract()
item['article'] = articles.extract()
item['published'] = articles.xpath("div[#class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
item['artid'] = articles.xpath("div[#class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
item['lang'] = 'de-DE'
items.append(item)
# return(items)
yield items
# what is the difference between return and yield?? found both on web.
When doing scrapy crawl KiSpider, this results in:
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider)
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (info#defrent.de)', 'DOWNLOAD_DELAY': 0.25}
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 465,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 48998,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)}
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)
Is it that the login routine should not end with a callback, but some kind of return/yield statement? Or what am I doing wrong? Unfortunately, the docs and tutorials I have seen so far only give me a vague idea of how every bit connects to the others, especially Scrapy's docs seem to be written as a reference for people who already know a lot about Scrapy.
Somewhat frustrated greetings
Christopher
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
you do not need allow parameter, because there is only one link in the tag selected by XPath.
I do not understand the regex in allow parameter but at least you should escape the ?.
My intention is to invoke start_requests method to login to the website. After login, scrape the website. Based on the log message, I see that
1. But, I see that start_request is not invoked.
2. call_back function of the parse is also not invoking.
Whats actually happening is spider is only loading the urls in the start_urls.
Question:
Why the spider is not crawling through other pages(say page 2, 3, 4)?
Why looking from spider is not working?
Note:
My method to calculate page number and url creation is correct. I verified it.
I referred this link to write this code Using loginform with scrapy
My code:
zauba.py (spider)
#!/usr/bin/env python
from scrapy.spiders import CrawlSpider
from scrapy.http import FormRequest
from scrapy.http.request import Request
from loginform import fill_login_form
import logging
logger = logging.getLogger('Zauba')
class zauba(CrawlSpider):
name = 'Zauba'
login_url = 'https://www.zauba.com/user'
login_user = 'scrapybot1#gmail.com'
login_password = 'scrapybot1'
logger.info('zauba')
start_urls = ['https://www.zauba.com/import-gold/p-1-hs-code.html']
def start_requests(self):
logger.info('start_request')
# let's start by sending a first request to login page
yield scrapy.Request(self.login_url, callback = self.parse_login)
def parse_login(self, response):
logger.warning('parse_login')
# got the login page, let's fill the login form...
data, url, method = fill_login_form(response.url, response.body,
self.login_user, self.login_password)
# ... and send a request with our login data
return FormRequest(url, formdata=dict(data),
method=method, callback=self.start_crawl)
def start_crawl(self, response):
logger.warning('start_crawl')
# OK, we're in, let's start crawling the protected pages
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
logger.info('parse')
text = response.xpath('//div[#id="block-system-main"]/div[#class="content"]/div[#style="width:920px; margin-bottom:12px;"]/span/text()').extract_first()
total_entries = int(text.split()[0].replace(',', ''))
total_pages = int(math.ceil((total_entries*1.0)/30))
logger.warning('*************** : ' + total_pages)
print('*************** : ' + total_pages)
for page in xrange(1, (total_pages + 1)):
url = 'https://www.zauba.com/import-gold/p-' + page +'-hs-code.html'
log.msg('url%d : %s' % (pages,url))
yield scrapy.Request(url, callback=self.extract_entries)
def extract_entries(self, response):
logger.warning('extract_entries')
row_trs = response.xpath('//div[#id="block-system-main"]/div[#class="content"]/div/table/tr')
for row_tr in row_trs[1:]:
row_content = row_tr.xpath('.//td/text()').extract()
if (row_content.__len__() == 9):
print row_content
yield {
'date' : row_content[0].replace(' ', ''),
'hs_code' : int(row_content[1]),
'description' : row_content[2],
'origin_country' : row_content[3],
'port_of_discharge' : row_content[4],
'unit' : row_content[5],
'quantity' : int(row_content[6].replace(',', '')),
'value_inr' : int(row_content[7].replace(',', '')),
'per_unit_inr' : int(row_content[8].replace(',', '')),
}
loginform.py
#!/usr/bin/env python
import sys
from argparse import ArgumentParser
from collections import defaultdict
from lxml import html
__version__ = '1.0' # also update setup.py
def _form_score(form):
score = 0
# In case of user/pass or user/pass/remember-me
if len(form.inputs.keys()) in (2, 3):
score += 10
typecount = defaultdict(int)
for x in form.inputs:
type_ = (x.type if isinstance(x, html.InputElement) else 'other'
)
typecount[type_] += 1
if typecount['text'] > 1:
score += 10
if not typecount['text']:
score -= 10
if typecount['password'] == 1:
score += 10
if not typecount['password']:
score -= 10
if typecount['checkbox'] > 1:
score -= 10
if typecount['radio']:
score -= 10
return score
def _pick_form(forms):
"""Return the form most likely to be a login form"""
return sorted(forms, key=_form_score, reverse=True)[0]
def _pick_fields(form):
"""Return the most likely field names for username and password"""
userfield = passfield = emailfield = None
for x in form.inputs:
if not isinstance(x, html.InputElement):
continue
type_ = x.type
if type_ == 'password' and passfield is None:
passfield = x.name
elif type_ == 'text' and userfield is None:
userfield = x.name
elif type_ == 'email' and emailfield is None:
emailfield = x.name
return (userfield or emailfield, passfield)
def submit_value(form):
"""Returns the value for the submit input, if any"""
for x in form.inputs:
if x.type == 'submit' and x.name:
return [(x.name, x.value)]
else:
return []
def fill_login_form(
url,
body,
username,
password,
):
doc = html.document_fromstring(body, base_url=url)
form = _pick_form(doc.xpath('//form'))
(userfield, passfield) = _pick_fields(form)
form.fields[userfield] = username
form.fields[passfield] = password
form_values = form.form_values() + submit_value(form)
return (form_values, form.action or form.base_url, form.method)
def main():
ap = ArgumentParser()
ap.add_argument('-u', '--username', default='username')
ap.add_argument('-p', '--password', default='secret')
ap.add_argument('url')
args = ap.parse_args()
try:
import requests
except ImportError:
print 'requests library is required to use loginform as a tool'
r = requests.get(args.url)
(values, action, method) = fill_login_form(args.url, r.text,
args.username, args.password)
print '''url: {0}
method: {1}
payload:'''.format(action, method)
for (k, v) in values:
print '- {0}: {1}'.format(k, v)
if __name__ == '__main__':
sys.exit(main())
The Log Message:
2016-10-02 23:31:28 [scrapy] INFO: Scrapy 1.1.3 started (bot: scraptest)
2016-10-02 23:31:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scraptest.spiders', 'FEED_URI': 'medic.json', 'SPIDER_MODULES': ['scraptest.spiders'], 'BOT_NAME': 'scraptest', 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:39.0) Gecko/20100101 Firefox/39.0', 'FEED_FORMAT': 'json', 'AUTOTHROTTLE_ENABLED': True}
2016-10-02 23:31:28 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2016-10-02 23:31:28 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-02 23:31:28 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-02 23:31:28 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-02 23:31:28 [scrapy] INFO: Spider opened
2016-10-02 23:31:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-02 23:31:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-02 23:31:29 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/robots.txt> (referer: None)
2016-10-02 23:31:38 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/import-gold/p-1-hs-code.html> (referer: None)
2016-10-02 23:31:38 [scrapy] INFO: Closing spider (finished)
2016-10-02 23:31:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 558,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 136267,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 3, 6, 31, 38, 560012),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 3, 6, 31, 28, 927872)}
2016-10-02 23:31:38 [scrapy] INFO: Spider closed (finished)
I figured out the crapy mistake i did!!!!
I didn't place the functions inside the class. Thats why .... things didnt work as expected. Now, I added a tab space to all the fuctions and things started to work fine
Thanks #user2989777 and #Granitosaurus for coming forward to debug
Scrapy already has form request manager called FormRequest.
In most of the cases it will find the correct form by itself. You can try:
>>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
from scrapy import FormRequest
login_data={'name':'mylogin', 'pass':'mypass'})
request = FormRequest.from_response(response, formdata=login_data)
print(request.body)
# b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'
Once you log in any requests chained afterwards will have a session cookie attached to them so you only need to login once at the beginning of your chain.
I write a small spider, when I run and it can't call pipeline.
After debug for a while, I find the bug code area.
The logic of the spider is that I crawl the first url to fetch cookie, then I crawl the second url to download the code picture with cookie, and I post some data I prepare to the third url. And If the text I get from the picture wrong then I download again to post the third url repeatedly, until I got the right text.
Let me show you the code:
# -*- coding: gbk -*-
import scrapy
from scrapy.http import FormRequest
import json
import os
from datetime import datetime
from scrapy.selector import Selector
from teacherCourse.handlePic import handle
from teacherCourse.items import DetailProfItem
from teacherCourse.items import DetailProfCourseItem
from teacherCourse.items import containItem
class GetTeacherCourseSpider(scrapy.Spider):
name = 'TeacherCourse'
# custom_settings = {
# 'ITEM_PIPELINES': {
# 'teacherCourse.pipelines.TeacherCoursePipeline': 300,
# }
# }
def __init__(self, selXNXQ='', titleCode=''):
self.getUrl = 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx' # first
self.vcodeUrl = 'http://jwxt.dgut.edu.cn/jwweb/sys/ValidateCode.aspx' # second
self.postUrl = 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB_rpt.aspx' # third
self.findSessionId = None # to save the cookies
self.XNXQ = selXNXQ
self.titleCode = titleCode
def start_requests(self):
request = scrapy.Request(self.getUrl,
callback = self.downloadPic)
yield request
def downloadPic(self, response):
# download the picture
# find the session id
self.findSessionId = response.headers.getlist('Set-Cookie')[0].decode().split(";")[0].split("=")
request = scrapy.Request(self.vcodeUrl,
cookies= {self.findSessionId[0]: self.findSessionId[1]},
callback = self.getAndHandleYzm)
yield request
def getAndHandleYzm(self, response):
yzm = handle(response.body)
yield FormRequest(self.postUrl,
formdata={'Sel_XNXQ': '20151',
'sel_zc': '011',
'txt_yzm': yzm,
'type': '2'},
headers={
'Referer': 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx',
'Cookie': self.findSessionId[0] + '=' + self.findSessionId[1],
},
callback=self.parse)
def parse(self, response):
body = response.body.decode('gbk')
num = body.find('alert')
if num != -1:
# means CAPTCHA validation fails, need to re-request the CAPTCHA
yield scrapy.Request(self.vcodeUrl+'?t='+'%.f' % (datetime.now().microsecond / 1000),
headers={
'Referer': 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx',
'Cookie': self.findSessionId[0]+'='+self.findSessionId[1]
},
callback=self.getAndHandleYzm)
else:
# parse data
self.parseData(body)
# item = containItem()
# item['first'] = len(body)
# return item
# the parse data part is a little bit long, but it doesn't matter.
# At the last line, I did yield a item
def parseData(self, body):
# parse body data
sel = Selector(text=body)
# get all the note text data
noteTables = sel.xpath('//table[#style="border:0px;"]').extract()
noteList = [] # to store all the note text
for noteTable in noteTables:
if '<b>' in noteTable:
sele = Selector(text = noteTable)
note = (sele.xpath('//table/tr/td/b/text()').extract())
noteText = (sele.xpath('//table/tr/td/text()').extract())
# combine note and noteText
if not noteText:
noteText.append('')
noteText.append('')
else:
if len(noteText) == 1:
noteText.append('')
noteList.append(noteText)
# get all the course data
courseTables = sel.xpath('//table[#class="page_table"]/tbody').extract()
AllDetailCourse = [] # all the teachers' course
for table in courseTables:
everyTeacherC = [] # every teacher's course
s = Selector(text = table)
trs = s.xpath('//tr').extract()
for tr in trs:
sel = Selector(text = tr)
snum = (sel.xpath('//td[1]/text()').extract())
course = (sel.xpath('//td[2]/text()').extract())
credit = (sel.xpath('//td[3]/text()').extract())
teachWay = (sel.xpath('//td[4]/text()').extract())
courseType = (sel.xpath('//td[5]/text()').extract())
classNum = (sel.xpath('//td[6]/text()').extract())
className = (sel.xpath('//td[7]/text()').extract())
stuNum = (sel.xpath('//td[8]/text()').extract())
week = (sel.xpath('//td[9]/text()').extract())
section = (sel.xpath('//td[10]/text()').extract())
location = (sel.xpath('//td[11]/text()').extract())
tmpList = []
tmpList.append(snum)
tmpList.append(course)
tmpList.append(credit)
tmpList.append(teachWay)
tmpList.append(courseType)
tmpList.append(classNum)
tmpList.append(className)
tmpList.append(stuNum)
tmpList.append(week)
tmpList.append(section)
tmpList.append(location)
# to know whether every variable is empty
detailCourse = []
for each in tmpList:
if not each:
each = ''
else:
each = each[0]
detailCourse.append(each)
everyTeacherC.append(detailCourse)
AllDetailCourse.append(everyTeacherC)
# get department, teacher, gender and title
sel = Selector(text = body)
temp1 = sel.xpath('//*[#group="group"]/table/tr/td/text()').extract()
# fill two tables, which will store in the database
i = 0
# every professor
for each in temp1:
tables = containItem() # all the data in every for loop to send to the pipeline
each = each.replace(u'\xa0', u' ')
each = each.split(' ')
depart = each[0].split('£º')
teacher = each[1].split('£º')
gender = each[2].split('£º')
title = each[3].split('£º')
# first table
profItem = DetailProfItem()
profItem['XNXQ'] = self.XNXQ
profItem['department'] = depart[1] # department
profItem['teacher'] = teacher[1] # teacher
profItem['gender'] = gender[1]
profItem['title'] = title[1]
profItem['note1'] = noteList[i][0]
profItem['note2'] = noteList[i][1]
tables['first'] = profItem # add the first table
# second table
# every professor's courses
profCourses = []
for j in range(len(AllDetailCourse[i])): # how many course for every professor
profCourseItem = DetailProfCourseItem() # every course for every professor
profCourseItem['snum'] = AllDetailCourse[i][j][0] # i means i-th professor, j means j-th course, third num means what position of the course
profCourseItem['course'] = AllDetailCourse[i][j][1]
profCourseItem['credit'] = AllDetailCourse[i][j][2]
profCourseItem['teachWay'] = AllDetailCourse[i][j][3]
profCourseItem['courseType'] = AllDetailCourse[i][j][4]
profCourseItem['classNum'] = AllDetailCourse[i][j][5]
profCourseItem['className'] = AllDetailCourse[i][j][6]
profCourseItem['stuNum'] = AllDetailCourse[i][j][7]
profCourseItem['week'] = AllDetailCourse[i][j][8]
profCourseItem['section'] = AllDetailCourse[i][j][9]
profCourseItem['location'] = AllDetailCourse[i][j][10]
profCourses.append(profCourseItem) # every professor's courses
tables['second'] = profCourseItem # add the second table
i += 1
yield tables
Any suggestions would be appreciate!
settings.py: (pipeline part)
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'teacherCourse.pipelines.TeacherCoursePipeline': 300,
}
items.py: (I don't think it's matter)
# detail professor course message
class DetailProfCourseItem(scrapy.Item):
snum = scrapy.Field() # serial number
course = scrapy.Field()
credit = scrapy.Field()
teachWay = scrapy.Field()
courseType = scrapy.Field()
classNum = scrapy.Field()
className = scrapy.Field()
stuNum = scrapy.Field()
week = scrapy.Field()
section = scrapy.Field()
location = scrapy.Field()
# the third item which contain first and second item
class containItem(scrapy.Item):
first = scrapy.Field() # for fist table
second = scrapy.Field() # for second table
pipeline code:
class TeacherCoursePipeline(object):
def process_item(self, item, spider):
print('I am called!!!!!')
print(item)
return item
And When I run spider scrapy crawl TeacherCourse
it output:
2016-07-19 17:39:18 [scrapy] INFO: Scrapy 1.1.0rc1 started (bot: teacherCourse)
2016-07-19 17:39:18 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'teacherCourse', 'NEWSPIDER_MODULE': 'teacherCourse.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['teacherCourse.spiders']}
2016-07-19 17:39:18 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
2016-07-19 17:39:18 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-19 17:39:18 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-19 17:39:18 [scrapy] INFO: Enabled item pipelines:
['teacherCourse.pipelines.TeacherCoursePipeline']
2016-07-19 17:39:18 [scrapy] INFO: Spider opened
2016-07-19 17:39:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-19 17:39:18 [scrapy] DEBUG: Crawled (404) <GET http://jwxt.dgut.edu.cn/robots.txt> (referer: None)
2016-07-19 17:39:18 [scrapy] DEBUG: Crawled (200) <GET http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx> (referer: None)
2016-07-19 17:39:19 [scrapy] DEBUG: Crawled (200) <GET http://jwxt.dgut.edu.cn/jwweb/sys/ValidateCode.aspx> (referer: http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx)
2016-07-19 17:39:19 [scrapy] DEBUG: Crawled (200) <POST http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB_rpt.aspx> (referer: http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx)
2016-07-19 17:39:19 [scrapy] INFO: Closing spider (finished)
2016-07-19 17:39:19 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1330,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 230886,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 19, 9, 39, 19, 861620),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 4,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 7, 19, 9, 39, 18, 774293)}
2016-07-19 17:39:19 [scrapy] INFO: Spider closed (finished)
The problem seems to be that the parse method only yields scrapy.Request objects, never scrapy.Item instances.
The else: branch calls the generator parseData(body) but doesn't use the data it can produce (namely containItem objects).
One way to solve this is to loop on the generator results and yield them one-by-one:
def parse(self, response):
body = response.body.decode('gbk')
num = body.find('alert')
if num != -1:
# means CAPTCHA validation fails, need to re-request the CAPTCHA
yield scrapy.Request(self.vcodeUrl+'?t='+'%.f' % (datetime.now().microsecond / 1000),
headers={
'Referer': 'http://jwxt.dgut.edu.cn/jwweb/ZNPK/TeacherKBFB.aspx',
'Cookie': self.findSessionId[0]+'='+self.findSessionId[1]
},
callback=self.getAndHandleYzm)
else:
# parse data
for i in self.parseData(body):
yield i
# item = containItem()
# item['first'] = len(body)
# return item