I am trying to use CrawlerRunner to run a spider using Scrapy as follows:
a_crawler = CrawlerRunner(settings)
#defer.inlineCallbacks
def crawl():
CodeThatGenerateException()
print("Starting crawler")
yield a_crawler.crawl(MySpider)
reactor.stop()
crawl()
reactor.run()
Strangely the Exception generated by the first line of the crawl function is not printed, nothing happens and the application hangs and does not stop
I cannot figure out what is going on
Any suggestion is welcomed
Related
I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?
By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.
I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:
I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)
I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?
I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()
If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.
My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()
I try to debug run on the following code and have a breakpoint in settings.py file, it finished without stepping into that line of code:
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
# yield runner.crawl(a_spider)
yield runner.crawl(b_spider)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Whereas if i run with the following code, it will be hit.
cmdline.execute(("scrapy crawl a_spider -o %s -t csv -L INFO" % (file_path,)).split())
What I'm trying to do is running multiple spiders in one single run, could anyone help me out with the latter solution? Thanks.
As scrapy.crawler.CrawlerRunner doesn't load the settings automatically for you, you'll need to get the setting object yourself and pass it to the runner.
E.g. you may replace this line of your code:
runner = CrawlerRunner()
with these:
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
See also: Run Scrapy from a script (Scrapy doc)
I have to call the crawler from another python file, for which I use the following code.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
On running this, I get the error as
exceptions.ValueError: signal only works in main thread
The only workaround I could find is to use
reactor.run(installSignalHandlers=False)
which I don't want to use as I want to call this method multiple times and want reactor to be stopped before the next call. What can I do to make this work (maybe force the crawler to start in the same 'main' thread)?
The first thing I would say to you is when you're executing Scrapy from external file the loglevel is set to INFO,you should change it to DEBUG to see what's happening if your code doesn't work
you should change the line:
log.start()
for:
log.start(loglevel=log.DEBUG)
To store everything in the log and generate a text file (for debugging purposes) you can do:
log.start(logfile="file.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
About the signals issue with the log level changed to DEBUG maybe you can see some output that can help you to fix it, you can try to put your script into the Scrapy Project folder to see if still crashes.
If you change the line:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
for:
dispatcher.connect(reactor.stop, signals.spider_closed)
What does it say ?
Depending on your Scrapy version it may be deprecated
for looping and use un azure functions with timertrigger use this taks
from twisted.internet import task from twisted.internet import reactor
loopTimes = 3 failInTheEnd = False
_loopCounter = 0
def runEverySecond():
"""
Called at ever loop interval.
"""
global _loopCounter
if _loopCounter < loopTimes:
_loopCounter += 1
print('A new second has passed.')
return
if failInTheEnd:
raise Exception('Failure during loop execution.')
# We looped enough times.
loop.stop()
return
def cbLoopDone(result):
"""
Called when loop was stopped with success.
"""
print("Loop done.")
reactor.stop()
def ebLoopFailed(failure):
"""
Called when loop execution failed.
"""
print(failure.getBriefTraceback())
reactor.stop()
loop = task.LoopingCall(runEverySecond)
# Start looping every 1 second. loopDeferred = loop.start(1.0)
# Add callbacks for stop and failure. loopDeferred.addCallback(cbLoopDone) loopDeferred.addErrback(ebLoopFailed)
reactor.run()
If we want a task to run every X seconds repeatedly, we can use twisted.internet.task.LoopingCall:
from https://docs.twisted.org/en/stable/core/howto/time.html
I have two spiders within the same project. One of them depends on the other running first. They use different pipelines. How can I make sure they are run sequentially?
Just from the doc:https://doc.scrapy.org/en/1.2/topics/request-response.html
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()
Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item
crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)
I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task
Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.