I have a working BaseSpider on Scrapy 0.20.0. But I'm trying to collect the number of found website URL's and print it as INFO when the spider is finished (closed). Problem is that I am not able to print this simple integer variable at the end of the session, and any print statement in the parse() or parse_item() functions are printed too early, long before.
I also looked at this question, but it seem somewhat outdated and it is unclear how to use it, properly. I.e. Where to put it (myspider.py, pipelines.py etc)?
Right now my spider-code is something like:
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
print "Found %d websites in this session.\n\n" % (self.foundWebsites)
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
And this is obviously not working as intended. Any better and simple ideas?
The first answer referred to works and there is no need to add anything else to pipelines.py. Just add "that answer" to your spider code like this:
# To use "spider_closed" we also need:
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
def __init__(self):
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
if spider is not self:
return
print "Found %d websites in this session.\n\n" % (self.foundWebsites)
Related
I'm using Scrapy to scrape a site that has a login page followed by a set of content pages with sequential integer IDs, pulled up as a URL parameter. This has been successfully running for a while, but the other day I decided to move the code that yields the Requests into a separate method, so that I can call it other places besides the initial load (basically, to dynamically add some more pages to fetch).
And it... just won't call that separate method. It reaches the point where I invoke self.issue_requests(), and proceeds right through it as if the instruction isn't there.
So this (part of the spider class definition, without the separate method) works:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
def parse_page(self, response):
# ...
...however, this slight refactor does not:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def issue_requests(self):
self.logger.debug("Inside issue_requests()!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
return
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
self.issue_requests()
def parse_page(self, response):
# ...
Looking at the logs, it reaches the "logged in successfully!" part, but then never gets "inside issue_requests()", and because there are no scrapy Request objects yielded by the generator, its next step is to close the spider, having done nothing.
I've never seen a situation where an object instance just refuses to call a method. You'd expect there to be some failure message if it can't pass control to the method, or if there's (say) a problem with the variable scoping in the method. But for it to silently move on and pretend I never told it to go to issue_requests() is, to me, bizarre. Help!
(this is Python 2.7.18, btw)
You have to yield items from parse_login as well:
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
for req in self.issue_requests():
yield req
I'm trying to write a scrapy-splash spider to go to a website, and continuously enter fields into a form, and submit the form. For the loop, for field in field[0:10], it will work for the first field, give me the data I want, and write the files. But for the last 9 elements, I get no response/it never calls the callback function. I tried doing the body of parse in the start_urls function, but got the same response. I hope someone can clear up what I seem to be misunderstanding.
Additional notes: I'm doing this in jupyter, where I predefined my settings and put them in when I initialize my process.
Observations: The first field will work, and then the next 9 will print their respective lua script instantly. I cannot tell if this is because the lua script is not working, or something fundamentally about how scrapy-splash works. Additionally, the print(c) only works once so either the callback is not being called, or the splash_request is not called.
c=0
fields = ['']
class MySpider(scrapy.Spider):
name = 'url'
start_urls = ['url']
def __init__(self, *args, **kwargs):
with open('script.lua', 'r') as script:
self.LUA_SOURCE = script.read()
print(self.LUA_SOURCE)
def parse(self, response):
for field in fields[0:10]:
try:
src = self.LUA_SOURCE % (field)
print(src)
yield SplashRequest(url=self.start_urls[0],
callback=self.parse2,
endpoint='execute',
args={'wait': 10,
'timeout':60,
'lua_source': src},
cache_args=['lua_source'],
)
except Exception as e:
print(e)
def parse2(self, response):
global c
c+=1
print(c)
try:
with open(f'text_files/{c}.txt', 'w') as f:
f.write(response.text)
except Exception as e:
print(e)
I have a problem with inheritance. Probably a beginners mistake. I made two Scrapy spiders:
from scrapy.spiders import SitemapSpider
class SchemaorgSpider(SitemapSpider):
name = 'schemaorg'
def parse(self, response):
print "parse"
...
and
from schemaorg import SchemaorgSpider
class SchemaorgSpider_two(SchemaorgSpider):
name = 'schemaorg_two'
sitemap_urls = [
urltoparse
]
sitemap_rules= [('/stuff/','parse_fromrule')]
def parse_fromrule(self, response):
print "parsefromrule"
self.parse(response)
I am basically defining all the logic in mparse and then using it in all child classes. When I run my second spider, I see only "parsefromrule" and not "parse". This looks to me like "inheritance 101" but it does not work.
What's wrong with it?
edit: test without scrapy that works:
class a(object):
def aa(self):
print "hello"
class b(a):
def bb(self):
self.aa()
class c(b):
def cc(self):
self.aa()
hello = c()
hello.cc()
hello.bb()
hello.aa()
I see all 3 "hello". I am confused why it doesn't work with Scrapy.
Edit2: if I put self.blabla(response) instead of self.parse(response) I get an error. It means that it is looking for an existing method.
I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.
It's looks like:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=self.second_parse_function)
# Here I need some function for sleep only this function like time.sleep(10)
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.
If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?
Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.
UPDATE:
Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:
def call_after_pause(self, response):
d = Deferred()
reactor.callLater(10.0, d.callback, Request(
'https://example.com/',
callback=self.non_stop_function,
dont_filter=True))
return d
And use this function for my request:
yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True)
Request object has callback parameter, try to use that one for the purpose.
I mean, create a Deferred which wraps self.second_parse_function and pause.
Here is my dirty and not tested example, changed lines are marked.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:
def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
d = Deferred()
d.addCallback(fn, *args, **kwargs)
d.addCallback(pause, seconds=seconds)
return d
And here is possible usage:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
# changed
yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.
Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:
from twisted.internet import reactor, defer
def non_stop_function(self, response)
d = defer.Deferred()
reactor.callLater(10.0, d.callback, Request(
'some url',
callback=self.non_stop_function
))
return d
The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.
# removed...
from twisted.internet import reactor, defer
class MySpider(scrapy.Spider):
# removed...
def request_with_pause(self, response):
d = defer.Deferred()
reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
response.url,
callback=response.meta['callback'],
dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
return d
def parse(self, response):
# removed....
yield scrapy.Request(the_url, meta={
'time': 86400,
'callback': self.the_parse,
'dont_proxy': True
}, callback=self.request_with_pause)
For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.
I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...