Running spider multiple times from script - ReactorNotRestartable [duplicate] - python

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?

By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.

I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates

Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()

I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:

I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)

I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?

I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()

If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.

My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()

Related

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?
By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.
I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:
I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)
I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?
I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()
If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.
My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()

Running Multiple spiders in scrapy for 1 website in parallel?

I want to crawl a website with 2 parts and my script is not as fast as I need.
Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?
I tried to have 2 different classes, and run them
scrapy crawl firstSpider
scrapy crawl secondSpider
but i think that it is not smart.
I read the documentation of scrapyd but I don't know if it's good for my case.
I think what you are looking for is something like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
You can read more at: running-multiple-spiders-in-the-same-process.
Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
#inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(Second Solution):
Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()

Being able to change the settings while running scrapy from a script

I want to run scrapy from a single script and I want to get all settings from settings.py but I would like to be able to change some of them:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
*### so what im missing here is being able to set or override one or two of the settings###*
# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
I wasn't able to use this. I tried the following:
settings=scrapy.settings.Settings()
settings.set('RETRY_TIMES',10)
but it didn't work.
Note: I'm using the latest version of scrapy.
So in order to override some settings, one way would be overriding/setting custom_settings, the spider's static variable, in our script.
so I imported the spider's class and then override the custom_setting:
from testspiders.spiders.followall import FollowAllSpider
FollowAllSpider.custom_settings={'RETRY_TIMES':10}
So this is the whole script:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from testspiders.spiders.followall import FollowAllSpider
FollowAllSpider.custom_settings={'RETRY_TIMES':10}
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
For some reason, the above script didn't work for me. Instead I wrote the following and it works. Posting in case if anyone else falls into the same issue.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.settings.set(
'RETRY_TIMES', 10, priority='cmdline')
process.crawl('testspider', domain='scrapinghub.com')
process.start()
Ran into this problem myself and had a slightly different solution that uses a modern Python (>=3.5) approach
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
settings = {
**get_project_settings(),
'RETRY_TIMES': 2
}
process = CrawlerProcess(settings)
process.crawl('testspider', domain='scrapinghub.com')
process.start()

Scrapy: how to run two crawlers one after another?

I have two spiders within the same project. One of them depends on the other running first. They use different pipelines. How can I make sure they are run sequentially?
Just from the doc:https://doc.scrapy.org/en/1.2/topics/request-response.html
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
class CrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)
def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)
# Usage
if __name__ == "__main__":
log.start()
"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010
Thank you.
All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Simply we can use
from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName
process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()
Use these arguments inside spider __init__ function with the global scope.
Though I haven't tried it I think the answer can be found within the scrapy documentation. To quote directly from it:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
From what I gather this is a new development in the library which renders some of the earlier approaches online (such as that in the question) obsolete.
In scrapy 0.19.x you should do this:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
Note these lines
settings = get_project_settings()
crawler = Crawler(settings)
Without it your spider won't use your settings and will not save the items.
Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.
One more to do so is just call command directly from you script
from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split()) #followall is the spider's name
Copied this answer from my first answer in here:
https://stackoverflow.com/a/19060485/1402286
When there are multiple crawlers need to be run inside one python script, the reactor stop needs to be handled with caution as the reactor can only be stopped once and cannot be restarted.
However, I found while doing my project that using
os.system("scrapy crawl yourspider")
is the easiest. This will save me from handling all sorts of signals especially when I have multiple spiders.
If Performance is a concern, you can use multiprocessing to run your spiders in parallel, something like:
def _crawl(spider_name=None):
if spider_name:
os.system('scrapy crawl %s' % spider_name)
return None
def run_crawler():
spider_names = ['spider1', 'spider2', 'spider2']
pool = Pool(processes=len(spider_names))
pool.map(_crawl, spider_names)
it is an improvement of
Scrapy throws an error when run using crawlerprocess
and https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087
First create your usual spider for successful command line running. it is very very important that it should run and export data or image or file
Once it is over, do just like pasted in my program above spider class definition and below __name __ to invoke settings.
it will get necessary settings which "from scrapy.utils.project import get_project_settings" failed to do which is recommended by many
both above and below portions should be there together. only one don't run.
Spider will run in scrapy.cfg folder not any other folder
tree diagram may be displayed by the moderators for reference
#Tree
[enter image description here][1]
#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings
#----------------Typical Spider Program starts here-----------------------------
spider class definition here
#----------------Typical Spider Program ends here-------------------------------
if __name__ == "__main__":
crawler_settings = Settings()
crawler_settings.setmodule(my_settings)
process = CrawlerProcess(settings=crawler_settings)
process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
process.start(stop_after_crawl=True)
# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute
def gen_argv(s):
sys.argv = s.split()
if __name__ == '__main__':
gen_argv('scrapy crawl abc_spider')
execute()
Put this code to the path you can run scrapy crawl abc_spider from command line. (Tested with Scrapy==0.24.6)
If you want to run a simple crawling, It's easy by just running command:
scrapy crawl .
There is another options to export your results to store in some formats like:
Json, xml, csv.
scrapy crawl -o result.csv or result.json or result.xml.
you may want to try it

Categories