Python / Scrapy: CrawlSpider stops after fetching start_urls

Python / Scrapy: CrawlSpider stops after fetching start_urls - python

I have wasted days to get my mind around Scrapy, reading the docs and other Scrapy Blogs and Q&A ... and now I am about to do what men hate most: Ask for directions ;-) The problem is: My spider opens, fetches the start_urls, but apparently does nothing with them. Instead it closes immediately and that was that. Apparently, I do not even get to the first self.log() statement.
What I've got so far is this:
# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *
class KiSpider(CrawlSpider):
name = "KiSpider"
allowed_domains = ['www.kiweb.de', 'kiweb.de']
start_urls = (
# ST Regra start page:
'https://www.kiweb.de/default.aspx?pageid=206',
# follow ST Regra links in the form of:
# https://www.kiweb.de/default.aspx?pageid=206&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
# ST Thermo start page:
'https://www.kiweb.de/default.aspx?pageid=202&page=1',
# follow ST Thermo links in the form of:
# https://www.kiweb.de/default.aspx?pageid=202&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
)
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
# Once an article page is reached, check whether a login is necessary:
def parse_init(self, response):
self.log('Parsing article: %s' % response.url)
if not response.xpath('input[#value="Logout"]'):
# Note: response.xpath() is a shortcut of response.selector.xpath()
self.log('Not logged in. Logging in...\n')
return self.login(response)
else:
self.log('Already logged in. Continue crawling...\n')
return self.parse_item(response)
def login(self, response):
self.log("Trying to log in...\n")
self.username = self.settings['KI_USERNAME']
self.password = self.settings['KI_PASSWORD']
return FormRequest.from_response(
response,
formname='Form1',
formdata={
# needs name, not id attributes!
'ctl04$Header$ctl01$textbox_username': self.username,
'ctl04$Header$ctl01$textbox_password': self.password,
'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
'ctl04$Header$ctl01$checkbox_permanent': 'True',
},
callback = self.parse_item,
)
def parse_item(self, response):
articles = response.xpath('//div[#id="artikel"]')
items = []
for article in articles:
item = KiSpiderItem()
item['link'] = response.url
item['title'] = articles.xpath("div[#class='ct1']/text()").extract()
item['subtitle'] = articles.xpath("div[#class='ct2']/text()").extract()
item['article'] = articles.extract()
item['published'] = articles.xpath("div[#class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
item['artid'] = articles.xpath("div[#class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
item['lang'] = 'de-DE'
items.append(item)
# return(items)
yield items
# what is the difference between return and yield?? found both on web.
When doing scrapy crawl KiSpider, this results in:
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider)
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (info#defrent.de)', 'DOWNLOAD_DELAY': 0.25}
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 465,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 48998,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)}
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)
Is it that the login routine should not end with a callback, but some kind of return/yield statement? Or what am I doing wrong? Unfortunately, the docs and tutorials I have seen so far only give me a vague idea of how every bit connects to the others, especially Scrapy's docs seem to be written as a reference for people who already know a lot about Scrapy.
Somewhat frustrated greetings
Christopher

rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[#id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
# allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[#class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
you do not need allow parameter, because there is only one link in the tag selected by XPath.
I do not understand the regex in allow parameter but at least you should escape the ?.

Related

pyquery response.body retrieve div elements

I am trying to write a web crawler using scrapy and PyQuery.The full spider code is as follows.
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class gotspider(CrawlSpider):
name='gotspider'
allowed_domains=['fundrazr.com']
start_urls = ['https://fundrazr.com/find?category=Health']
rules = [
Rule(LinkExtractor(allow=('/find/category=Health')), callback='parse',follow=True)
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print(response.body)
The web page skeleton
<div id="header">
<h2 class="title"> Township </h2>
<p><strong>Client: </strong> Township<br>
<strong>Location: </strong>Pennsylvania<br>
<strong>Size: </strong>54,000 SF</p>
</div>
output of the crawler, The crawler fetches the Requesting URL and its hitting the correct web target but the parse_item or parse method is not getting the response. The Response.URL is not printing. I tried to verify this by running the spider without logs scrapy crawl rsscrach--nolog but nothing is printed as logs. The problem is very granular.
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: rsscrach)
2017-11-26 18:07:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'rsscrach', 'NEWSPIDER_MODULE': 'rsscrach.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['rsscrach.spiders']}
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-26 18:07:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-26 18:07:12 [scrapy.core.engine] INFO: Spider opened
2017-11-26 18:07:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-26 18:07:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-11-26 18:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/robots.txt> (referer: None)
2017-11-26 18:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fundrazr.com/find?category=Health> (referer: None)
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-26 18:07:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 605,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13510,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 26, 10, 7, 15, 46516),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 52465664,
'memusage/startup': 52465664,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 11, 26, 10, 7, 12, 198182)}
2017-11-26 18:07:15 [scrapy.core.engine] INFO: Spider closed (finished)
How do I get the Client, Location and Size of the attributes ?

I made standalone script with Scrapy which test different methods to get data and it works without problem. Maybe it helps you find your problem.
import scrapy
import pyquery
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://fundrazr.com/find?category=Health']
def parse(self, response):
print('--- css 1 ---')
for title in response.css('h2'):
print('>>>', title)
print('--- css 2 ---')
for title in response.css('h2'):
print('>>>', title.extract()) # without _first())
print('>>>', title.css('a').extract_first())
print('>>>', title.css('a ::text').extract_first())
print('-----')
print('--- css 3 ---')
for title in response.css('h2 a ::text'):
print('>>>', title.extract()) # without _first())
print('--- pyquery 1 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2'):
print('>>>', title, title.text, '<<<') # `title.text` gives "\n"
print('--- pyquery 2 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2').text():
print('>>>', title)
print(p('h2').text())
print('--- pyquery 3 ---')
p = pyquery.PyQuery(response.body)
for title in p('h2 a'):
print('>>>', title, title.text)
# ---------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

scrapy - spider module def functions not getting invoked

My intention is to invoke start_requests method to login to the website. After login, scrape the website. Based on the log message, I see that
1. But, I see that start_request is not invoked.
2. call_back function of the parse is also not invoking.
Whats actually happening is spider is only loading the urls in the start_urls.
Question:
Why the spider is not crawling through other pages(say page 2, 3, 4)?
Why looking from spider is not working?
Note:
My method to calculate page number and url creation is correct. I verified it.
I referred this link to write this code Using loginform with scrapy
My code:
zauba.py (spider)
#!/usr/bin/env python
from scrapy.spiders import CrawlSpider
from scrapy.http import FormRequest
from scrapy.http.request import Request
from loginform import fill_login_form
import logging
logger = logging.getLogger('Zauba')
class zauba(CrawlSpider):
name = 'Zauba'
login_url = 'https://www.zauba.com/user'
login_user = 'scrapybot1#gmail.com'
login_password = 'scrapybot1'
logger.info('zauba')
start_urls = ['https://www.zauba.com/import-gold/p-1-hs-code.html']
def start_requests(self):
logger.info('start_request')
# let's start by sending a first request to login page
yield scrapy.Request(self.login_url, callback = self.parse_login)
def parse_login(self, response):
logger.warning('parse_login')
# got the login page, let's fill the login form...
data, url, method = fill_login_form(response.url, response.body,
self.login_user, self.login_password)
# ... and send a request with our login data
return FormRequest(url, formdata=dict(data),
method=method, callback=self.start_crawl)
def start_crawl(self, response):
logger.warning('start_crawl')
# OK, we're in, let's start crawling the protected pages
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
logger.info('parse')
text = response.xpath('//div[#id="block-system-main"]/div[#class="content"]/div[#style="width:920px; margin-bottom:12px;"]/span/text()').extract_first()
total_entries = int(text.split()[0].replace(',', ''))
total_pages = int(math.ceil((total_entries*1.0)/30))
logger.warning('*************** : ' + total_pages)
print('*************** : ' + total_pages)
for page in xrange(1, (total_pages + 1)):
url = 'https://www.zauba.com/import-gold/p-' + page +'-hs-code.html'
log.msg('url%d : %s' % (pages,url))
yield scrapy.Request(url, callback=self.extract_entries)
def extract_entries(self, response):
logger.warning('extract_entries')
row_trs = response.xpath('//div[#id="block-system-main"]/div[#class="content"]/div/table/tr')
for row_tr in row_trs[1:]:
row_content = row_tr.xpath('.//td/text()').extract()
if (row_content.__len__() == 9):
print row_content
yield {
'date' : row_content[0].replace(' ', ''),
'hs_code' : int(row_content[1]),
'description' : row_content[2],
'origin_country' : row_content[3],
'port_of_discharge' : row_content[4],
'unit' : row_content[5],
'quantity' : int(row_content[6].replace(',', '')),
'value_inr' : int(row_content[7].replace(',', '')),
'per_unit_inr' : int(row_content[8].replace(',', '')),
}
loginform.py
#!/usr/bin/env python
import sys
from argparse import ArgumentParser
from collections import defaultdict
from lxml import html
__version__ = '1.0' # also update setup.py
def _form_score(form):
score = 0
# In case of user/pass or user/pass/remember-me
if len(form.inputs.keys()) in (2, 3):
score += 10
typecount = defaultdict(int)
for x in form.inputs:
type_ = (x.type if isinstance(x, html.InputElement) else 'other'
)
typecount[type_] += 1
if typecount['text'] > 1:
score += 10
if not typecount['text']:
score -= 10
if typecount['password'] == 1:
score += 10
if not typecount['password']:
score -= 10
if typecount['checkbox'] > 1:
score -= 10
if typecount['radio']:
score -= 10
return score
def _pick_form(forms):
"""Return the form most likely to be a login form"""
return sorted(forms, key=_form_score, reverse=True)[0]
def _pick_fields(form):
"""Return the most likely field names for username and password"""
userfield = passfield = emailfield = None
for x in form.inputs:
if not isinstance(x, html.InputElement):
continue
type_ = x.type
if type_ == 'password' and passfield is None:
passfield = x.name
elif type_ == 'text' and userfield is None:
userfield = x.name
elif type_ == 'email' and emailfield is None:
emailfield = x.name
return (userfield or emailfield, passfield)
def submit_value(form):
"""Returns the value for the submit input, if any"""
for x in form.inputs:
if x.type == 'submit' and x.name:
return [(x.name, x.value)]
else:
return []
def fill_login_form(
url,
body,
username,
password,
):
doc = html.document_fromstring(body, base_url=url)
form = _pick_form(doc.xpath('//form'))
(userfield, passfield) = _pick_fields(form)
form.fields[userfield] = username
form.fields[passfield] = password
form_values = form.form_values() + submit_value(form)
return (form_values, form.action or form.base_url, form.method)
def main():
ap = ArgumentParser()
ap.add_argument('-u', '--username', default='username')
ap.add_argument('-p', '--password', default='secret')
ap.add_argument('url')
args = ap.parse_args()
try:
import requests
except ImportError:
print 'requests library is required to use loginform as a tool'
r = requests.get(args.url)
(values, action, method) = fill_login_form(args.url, r.text,
args.username, args.password)
print '''url: {0}
method: {1}
payload:'''.format(action, method)
for (k, v) in values:
print '- {0}: {1}'.format(k, v)
if __name__ == '__main__':
sys.exit(main())
The Log Message:
2016-10-02 23:31:28 [scrapy] INFO: Scrapy 1.1.3 started (bot: scraptest)
2016-10-02 23:31:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scraptest.spiders', 'FEED_URI': 'medic.json', 'SPIDER_MODULES': ['scraptest.spiders'], 'BOT_NAME': 'scraptest', 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:39.0) Gecko/20100101 Firefox/39.0', 'FEED_FORMAT': 'json', 'AUTOTHROTTLE_ENABLED': True}
2016-10-02 23:31:28 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2016-10-02 23:31:28 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-02 23:31:28 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-02 23:31:28 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-02 23:31:28 [scrapy] INFO: Spider opened
2016-10-02 23:31:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-02 23:31:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-02 23:31:29 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/robots.txt> (referer: None)
2016-10-02 23:31:38 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/import-gold/p-1-hs-code.html> (referer: None)
2016-10-02 23:31:38 [scrapy] INFO: Closing spider (finished)
2016-10-02 23:31:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 558,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 136267,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 3, 6, 31, 38, 560012),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 3, 6, 31, 28, 927872)}
2016-10-02 23:31:38 [scrapy] INFO: Spider closed (finished)

I figured out the crapy mistake i did!!!!
I didn't place the functions inside the class. Thats why .... things didnt work as expected. Now, I added a tab space to all the fuctions and things started to work fine
Thanks #user2989777 and #Granitosaurus for coming forward to debug

Scrapy already has form request manager called FormRequest.
In most of the cases it will find the correct form by itself. You can try:
>>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
from scrapy import FormRequest
login_data={'name':'mylogin', 'pass':'mypass'})
request = FormRequest.from_response(response, formdata=login_data)
print(request.body)
# b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'
Once you log in any requests chained afterwards will have a session cookie attached to them so you only need to login once at the beginning of your chain.

Scrapy CrawlSpider is not following Links

I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. However, the spider just crawls the start url and stops at that. I've added the spider code and the log. Anyone has any idea why the spider is not able to crawl the pages.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from realcommercial.items import RealcommercialItem
from scrapy.selector import Selector
from scrapy.http import Request
class RealCommercial(CrawlSpider):
name = "realcommercial"
allowed_domains = ["realcommercial.com.au"]
start_urls = [
"http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
]
rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),
callback='parse_response',
process_links='process_links',
follow=True),
Rule(LinkExtractor( allow = []),
callback='parse_response',
process_links='process_links',
follow=True)]
def parse_response(self, response):
sel = Selector(response)
sites = sel.xpath("//a[#class='details']")
#items = []
for site in sites:
item = RealcommercialItem()
link = site.xpath('#href').extract()
#print link, '\n\n'
item['link'] = link
link = 'http://www.realcommercial.com.au/' + str(link[0])
#print 'link!!!!!!=', link
new_request = Request(link, callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
def process_links(self, links):
print 'inside process links'
for i, w in enumerate(links):
print w.url,'\n\n\n'
w.url = "http://www.realcommercial.com.au/" + w.url
print w.url,'\n\n\n'
links[i] = w
return links
def parse_file_page(self, response):
#item passed from request
#print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
title = sel.xpath('//*[#id="listing_address"]').extract()
#print title
item['title'] = title
return item
Log
2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
o
2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
2015-11-29 15:42:57 [scrapy] INFO: Spider opened
2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
ist-date> (referer: None)
2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 303,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)

I got the answer myself. There were two issues:
process_links was "http://www.realcommercial.com.au/" although it was already there. I thought it would give back the relative url.
The regular expression in link extractor was not correct.
I made changes to both of these and it worked.

Scrapy outputs [ into my .json file

A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong.
my Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[#id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = titles.xpath('//*[#id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a "[" for every time I try and run the spider, to try and figure out what the issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items -d 2 -v http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken to\nPietermaritzburg hospitals after showing signs of food poisoning while at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests -----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either yield or print gives me no items scraped in the parse.

This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[#id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[#id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[#id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[#id="spnDate"]/text()').extract()
yield item

Scrapy rule SgmlLinkExtractor not working

how can I get my rule to work in my crawlspider and to follow the links, I added this rule but its not working, nothing gets display but i don't get no error either. I comment out what my domains should look like in the code for my rule.
Rule #1
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
I also tried this rule but did not work nothing happen no errors either: Rule #2
rules = (
Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)
code
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'yescobar2012#gmail.com', 'session_password': 'yescobar01'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an response page! %s' % response.url)
return Request(url='http://www.linkedin.com/csearch/results')
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/#href').extract()
items.append(item)
return items
output
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2171,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 87904,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)

SgmlLinkExtractor uses re to find matches in link URLs.
What you pass in allow= goes through .compile()and then all links in pages is checked with _matches which uses... .search() on the compiled regex
_matches = lambda url, regexs: any((r.search(url) for r in regexs))
See https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py
When I check your regexes in the Python shell, they both works (they return an SRE_Match for URL 1 and 2; I added a failing regex to compare):
>>> import re
>>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
>>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
>>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
>>> regex1.search(url1)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url1)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url1)
>>> regex1.search(url2)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url2)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url2)
>>>
To check if you've got links in the page at all (if everything is not Javascript-generated) I would add a very generic Rule matching every link (set allow=() or do not set allow at all)
See http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
But in the end, you may be better off using LinkedIn API for company search:
http://developer.linkedin.com/documents/company-search

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python / Scrapy: CrawlSpider stops after fetching start_urls - python

Related

pyquery response.body retrieve div elements

scrapy - spider module def functions not getting invoked

Scrapy CrawlSpider is not following Links

Scrapy outputs [ into my .json file

Scrapy rule SgmlLinkExtractor not working

Categories

Resources