Scrapy - Spiders taking too long to being shut down - python

Basically, I have a file named spiders.py in which I configure all my spiders and fire then all, using a single crawler. This is the source code of this file:
from scrapy import spiderloader
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from navigator import *
def main():
settings = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
process = CrawlerProcess(settings=settings)
for spider_name in spider_loader.list():
process.crawl(spider_name)
process.start()
if __name__ == '__main__':
main()
What I'm trying to achieve is to fire this spiders from another script, using subprocess module, and after 5 minutes of execution, turning down all spiders (using only one SIGTERM). The file responsible for this objective is monitor.py:
from time import sleep
import os
import signal
import subprocess
def main():
spiders_process = subprocess.Popen(["python", "spiders.py"], stdout=subprocess.PIPE,
shell=False, preexec_fn=os.setsid)
sleep(300)
os.killpg(spiders_process.pid, signal.SIGTERM)
if __name__ == '__main__':
main()
When the main thread wake up, the terminal says 2018-07-19 21:45:09 [scrapy.crawler] INFO: Received SIGTERM, shutting down gracefully. Send again to force
. But even after this message, the spiders continue to scrap the web pages. What I doing wrong?
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?

I believe when scrapy receives a SIGTERM it tries to shutdown gracefully by first waiting to finished all sent/scheduled requests. Your best bet is to either limit the number or concurrent requests so it finishes quicker (CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN are 16 and 8 respectively by default) or to send two SIGTERM's to instruct scrapy to do unclean immediate exit.
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?
process.start() starts twisted reactor (twisted main event loop) which is a blocking call, to circumvent it and run more code after the reactor has been started you can schedule a function be run inside the loop. First snippet from this manual should give you an idea: https://twistedmatrix.com/documents/current/core/howto/time.html.
However, if you go that way, you must make sure that the code you schedule have to also be non blocking, otherwise when you pause the execution of the loop for too long bad things can start happening. So things like time.sleep() must be rewritten in a twisted equivalent.

Related

scrapy : parallel AND sequential running of spider

I have a scrapy project with multiple spiders. Some take minutes, some take hours, and anything in between - however that elapsed time is usually about the same for each run - so you can assume that scraper X runs for about the same time as scrapers Y and Z together.
What I want to do is instead of running the all in parallel starting at T0, I want to start scrapers 1,2,3 at the start, then chain scrapers 4,5,6 after 2 finishes, and 7,8,9 after 3 finishes to smooth out the downstream processing requirements (concurrent database connections etc)
I think I need to chain the deferreds, and there are some clear examples in the docs, but i'm not sure how to set that up as well as some running in parallel - current starting code is below (each spider is in its own external file)
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
process.crawl('scraper1')
process.crawl('scraper2')
process.crawl('scraper3')
...etc...
...etc...
process.start()
Figured out the answer - easier than I thought.
There's no need to worry about stopping the reactor (hence why it is commented out)
Scraper 1 and scraper 2 start at the same time, scraper 3 starts after scraper 2 finishes.
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
#defer.inlineCallbacks
def crawl_seq():
global process
yield process.crawl('scraper2')
yield process.crawl('scraper3')
#reactor.stop()
crawl_seq()
process.crawl('scraper1')
process.start()

Calling Scrapy from another file without threading

I have to call the crawler from another python file, for which I use the following code.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
On running this, I get the error as
exceptions.ValueError: signal only works in main thread
The only workaround I could find is to use
reactor.run(installSignalHandlers=False)
which I don't want to use as I want to call this method multiple times and want reactor to be stopped before the next call. What can I do to make this work (maybe force the crawler to start in the same 'main' thread)?
The first thing I would say to you is when you're executing Scrapy from external file the loglevel is set to INFO,you should change it to DEBUG to see what's happening if your code doesn't work
you should change the line:
log.start()
for:
log.start(loglevel=log.DEBUG)
To store everything in the log and generate a text file (for debugging purposes) you can do:
log.start(logfile="file.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
About the signals issue with the log level changed to DEBUG maybe you can see some output that can help you to fix it, you can try to put your script into the Scrapy Project folder to see if still crashes.
If you change the line:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
for:
dispatcher.connect(reactor.stop, signals.spider_closed)
What does it say ?
Depending on your Scrapy version it may be deprecated
for looping and use un azure functions with timertrigger use this taks
from twisted.internet import task from twisted.internet import reactor
loopTimes = 3 failInTheEnd = False
_loopCounter = 0
def runEverySecond():
"""
Called at ever loop interval.
"""
global _loopCounter
if _loopCounter < loopTimes:
_loopCounter += 1
print('A new second has passed.')
return
if failInTheEnd:
raise Exception('Failure during loop execution.')
# We looped enough times.
loop.stop()
return
def cbLoopDone(result):
"""
Called when loop was stopped with success.
"""
print("Loop done.")
reactor.stop()
def ebLoopFailed(failure):
"""
Called when loop execution failed.
"""
print(failure.getBriefTraceback())
reactor.stop()
loop = task.LoopingCall(runEverySecond)
# Start looping every 1 second. loopDeferred = loop.start(1.0)
# Add callbacks for stop and failure. loopDeferred.addCallback(cbLoopDone) loopDeferred.addErrback(ebLoopFailed)
reactor.run()
If we want a task to run every X seconds repeatedly, we can use twisted.internet.task.LoopingCall:
from https://docs.twisted.org/en/stable/core/howto/time.html

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.
Something like this:
class StandAloneSpider(Spider):
#a regular spider
settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...
crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()
spider = StandAloneSpider()
crawler.crawl( spider )
crawler.start()
I've decided to use Celery and use workers to queue up the crawl requests.
However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.
Anyone can share any tips with running Spiders within the Celery framework?
Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/
First the tasks.py file
from celery import task
#task()
def crawl_domain(domain_pk):
from crawl import domain_crawl
return domain_crawl(domain_pk)
Then the crawl.py file
from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from spider import DomainSpider
from models import Domain
class DomainCrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
self.crawler.install()
self.crawler.configure()
def _crawl(self, domain_pk):
domain = Domain.objects.get(
pk = domain_pk,
)
urls = []
for page in domain.pages.all():
urls.append(page.url())
self.crawler.crawl(DomainSpider(urls))
self.crawler.start()
self.crawler.stop()
def crawl(self, domain_pk):
p = Process(target=self._crawl, args=[domain_pk])
p.start()
p.join()
crawler = DomainCrawlerScript()
def domain_crawl(domain_pk):
crawler.crawl(domain_pk)
The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])
In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".
Hope this helps!
I set CELERYD_MAX_TASKS_PER_CHILD to 1 in the settings file and that took care of the issue. The worker daemon starts a new process after each spider run and that takes care of the reactor.

Running Scrapy tasks in Python

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()
Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item
crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)
I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task
Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

Using a Python script to start and stop the Google App Engine dev_appserver during continuous integration testing

I'm trying to write a Python script that will enable me to start the Google App Engine dev_appserver using coverage.py, fetch the /test url from the app that I launch, wait for the server to finish returning the page, then shutdown the dev_appserver, and then generate a report.
My challenge is how to launch the dev_appserver in the background so that I can do the http fetch and then how to shut down the dev_appserver before generating my report.
I'm heading towards something like this:
# get_gae_coverage.py
# Launch dev_appserver with coverge.py
coverage run --source=./ /usr/local/bin/dev_appserver.py --clear_datastore --use_sqlite .
#Fetch /test
urllib.urlopen('http://localhost:8080/test').read()
# Shutdown dev_appserver somehow
# ??
# Generate coverage report
coverage report
What is the best way to write a python script to do this?
You should go with subprocess Popen
import os
import signal
import subprocess
coverage_proc = subprocess.Popen(
['coverage','run', your_flag_list]
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
time.sleep(5) #Find the correct sleep value
urllib.urlopen('http://localhost:8080/test').read()
time.sleep(1)
os.kill(coverage_proc.pid, signal.SIGINT)
Here you can find another approach to test if the server is up and running:
line = proc.stdout.readline()
while '] Running application' not in line:
line = proc.stdout.readline()
threading is the way to accomplish such a kind of task. Namely, you start the dev_appserver in a thread or in the main thread and as it is running, run and collect the results using the coverage module and then kill the dev_appserver python process in another thread and you will have results from coverage.
Here is sample snippet, which runs the dev_appserver.py in a thread and then waits for 10 seconds before and then it kills the python process. You can modify the end method in a suitable wherein the instead of waiting for 10 seconds, it waits for few seconds (in order to let the python process start) and then start doing the coverage testing and after it is done, kill the appserver and finish coverage.
import threading
import subprocess
import time
hold_process = []
def start():
print 'In the start process'
proc = subprocess.Popen(['/usr/bin/python','dev_appserver.py','yourapp'])
hold_process.append(proc)
def end():
time.sleep(10)
proc = hold_process.pop(0)
print 'Killing the appserver process'
proc.kill()
t = threading.Thread(name='startprocess',target=start)
t.deamon = True
w = threading.Thread(name='endprocess',target=end)
t.start()
w.start()
t.join()
w.join()

Categories