Running Scrapy spiders in a Celery task - python

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.
Something like this:
class StandAloneSpider(Spider):
#a regular spider
settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...
crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()
spider = StandAloneSpider()
crawler.crawl( spider )
crawler.start()
I've decided to use Celery and use workers to queue up the crawl requests.
However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.
Anyone can share any tips with running Spiders within the Celery framework?

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/
First the tasks.py file
from celery import task
#task()
def crawl_domain(domain_pk):
from crawl import domain_crawl
return domain_crawl(domain_pk)
Then the crawl.py file
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from spider import DomainSpider
from models import Domain
class DomainCrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
self.crawler.install()
self.crawler.configure()
def _crawl(self, domain_pk):
domain = Domain.objects.get(
pk = domain_pk,
)
urls = []
for page in domain.pages.all():
urls.append(page.url())
self.crawler.crawl(DomainSpider(urls))
self.crawler.start()
self.crawler.stop()
def crawl(self, domain_pk):
p = Process(target=self._crawl, args=[domain_pk])
p.start()
p.join()
crawler = DomainCrawlerScript()
def domain_crawl(domain_pk):
crawler.crawl(domain_pk)
The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])
In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".
Hope this helps!

I set CELERYD_MAX_TASKS_PER_CHILD to 1 in the settings file and that took care of the issue. The worker daemon starts a new process after each spider run and that takes care of the reactor.

Related

Scrapy - Spiders taking too long to being shut down

Basically, I have a file named spiders.py in which I configure all my spiders and fire then all, using a single crawler. This is the source code of this file:
from scrapy import spiderloader
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from navigator import *
def main():
settings = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
process = CrawlerProcess(settings=settings)
for spider_name in spider_loader.list():
process.crawl(spider_name)
process.start()
if __name__ == '__main__':
main()
What I'm trying to achieve is to fire this spiders from another script, using subprocess module, and after 5 minutes of execution, turning down all spiders (using only one SIGTERM). The file responsible for this objective is monitor.py:
from time import sleep
import os
import signal
import subprocess
def main():
spiders_process = subprocess.Popen(["python", "spiders.py"], stdout=subprocess.PIPE,
shell=False, preexec_fn=os.setsid)
sleep(300)
os.killpg(spiders_process.pid, signal.SIGTERM)
if __name__ == '__main__':
main()
When the main thread wake up, the terminal says 2018-07-19 21:45:09 [scrapy.crawler] INFO: Received SIGTERM, shutting down gracefully. Send again to force
. But even after this message, the spiders continue to scrap the web pages. What I doing wrong?
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?
I believe when scrapy receives a SIGTERM it tries to shutdown gracefully by first waiting to finished all sent/scheduled requests. Your best bet is to either limit the number or concurrent requests so it finishes quicker (CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN are 16 and 8 respectively by default) or to send two SIGTERM's to instruct scrapy to do unclean immediate exit.
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?
process.start() starts twisted reactor (twisted main event loop) which is a blocking call, to circumvent it and run more code after the reactor has been started you can schedule a function be run inside the loop. First snippet from this manual should give you an idea: https://twistedmatrix.com/documents/current/core/howto/time.html.
However, if you go that way, you must make sure that the code you schedule have to also be non blocking, otherwise when you pause the execution of the loop for too long bad things can start happening. So things like time.sleep() must be rewritten in a twisted equivalent.

Running Multiple spiders in scrapy for 1 website in parallel?

I want to crawl a website with 2 parts and my script is not as fast as I need.
Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?
I tried to have 2 different classes, and run them
scrapy crawl firstSpider
scrapy crawl secondSpider
but i think that it is not smart.
I read the documentation of scrapyd but I don't know if it's good for my case.
I think what you are looking for is something like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
You can read more at: running-multiple-spiders-in-the-same-process.
Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
#inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(Second Solution):
Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()

Calling scrapy from a python script not creating JSON output file

Here's the python script that i am using to call scrapy, the answer of
Scrapy crawl from script always blocks script execution after scraping
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
here's my pipelines.py code
from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher
class scrapermar11Pipeline(object):
def __init__(self):
self.files = {}
dispatcher.connect(self.spider_opened , signals.spider_opened)
dispatcher.connect(self.spider_closed , signals.spider_closed)
def spider_opened(self,spider):
file = open('links_pipelines.json' ,'wb')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
log.msg('It reached here')
return item
This code is taken from here
Scrapy :: Issues with JSON export
When i run the crawler like this
scrapy crawl MySpider -a start_url='abc'
a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run.
I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?
This code worked for me:
from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'}
settings.overrides.update(mySettings)
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool
dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.
To run all known spiders, one per core:
scrapy list | parallel --line-buffer scrapy crawl
scrapy list lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl) passed to GNU Parallel instead. --line-buffer means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at --group and --ungroup).
NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the --jobs argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:
scrapy list | parallel --jobs 200% --line-buffer scrapy crawl

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
class CrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)
def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)
# Usage
if __name__ == "__main__":
log.start()
"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010
Thank you.
All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Simply we can use
from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName
process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()
Use these arguments inside spider __init__ function with the global scope.
Though I haven't tried it I think the answer can be found within the scrapy documentation. To quote directly from it:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
From what I gather this is a new development in the library which renders some of the earlier approaches online (such as that in the question) obsolete.
In scrapy 0.19.x you should do this:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
Note these lines
settings = get_project_settings()
crawler = Crawler(settings)
Without it your spider won't use your settings and will not save the items.
Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.
One more to do so is just call command directly from you script
from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split()) #followall is the spider's name
Copied this answer from my first answer in here:
https://stackoverflow.com/a/19060485/1402286
When there are multiple crawlers need to be run inside one python script, the reactor stop needs to be handled with caution as the reactor can only be stopped once and cannot be restarted.
However, I found while doing my project that using
os.system("scrapy crawl yourspider")
is the easiest. This will save me from handling all sorts of signals especially when I have multiple spiders.
If Performance is a concern, you can use multiprocessing to run your spiders in parallel, something like:
def _crawl(spider_name=None):
if spider_name:
os.system('scrapy crawl %s' % spider_name)
return None
def run_crawler():
spider_names = ['spider1', 'spider2', 'spider2']
pool = Pool(processes=len(spider_names))
pool.map(_crawl, spider_names)
it is an improvement of
Scrapy throws an error when run using crawlerprocess
and https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087
First create your usual spider for successful command line running. it is very very important that it should run and export data or image or file
Once it is over, do just like pasted in my program above spider class definition and below __name __ to invoke settings.
it will get necessary settings which "from scrapy.utils.project import get_project_settings" failed to do which is recommended by many
both above and below portions should be there together. only one don't run.
Spider will run in scrapy.cfg folder not any other folder
tree diagram may be displayed by the moderators for reference
#Tree
[enter image description here][1]
#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings
#----------------Typical Spider Program starts here-----------------------------
spider class definition here
#----------------Typical Spider Program ends here-------------------------------
if __name__ == "__main__":
crawler_settings = Settings()
crawler_settings.setmodule(my_settings)
process = CrawlerProcess(settings=crawler_settings)
process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
process.start(stop_after_crawl=True)
# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute
def gen_argv(s):
sys.argv = s.split()
if __name__ == '__main__':
gen_argv('scrapy crawl abc_spider')
execute()
Put this code to the path you can run scrapy crawl abc_spider from command line. (Tested with Scrapy==0.24.6)
If you want to run a simple crawling, It's easy by just running command:
scrapy crawl .
There is another options to export your results to store in some formats like:
Json, xml, csv.
scrapy crawl -o result.csv or result.json or result.xml.
you may want to try it

Running Scrapy tasks in Python

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()
Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item
crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)
I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task
Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

Categories