Calling Scrapy from another file without threading - python

I have to call the crawler from another python file, for which I use the following code.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
On running this, I get the error as
exceptions.ValueError: signal only works in main thread
The only workaround I could find is to use
reactor.run(installSignalHandlers=False)
which I don't want to use as I want to call this method multiple times and want reactor to be stopped before the next call. What can I do to make this work (maybe force the crawler to start in the same 'main' thread)?

The first thing I would say to you is when you're executing Scrapy from external file the loglevel is set to INFO,you should change it to DEBUG to see what's happening if your code doesn't work
you should change the line:
log.start()
for:
log.start(loglevel=log.DEBUG)
To store everything in the log and generate a text file (for debugging purposes) you can do:
log.start(logfile="file.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
About the signals issue with the log level changed to DEBUG maybe you can see some output that can help you to fix it, you can try to put your script into the Scrapy Project folder to see if still crashes.
If you change the line:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
for:
dispatcher.connect(reactor.stop, signals.spider_closed)
What does it say ?
Depending on your Scrapy version it may be deprecated

for looping and use un azure functions with timertrigger use this taks
from twisted.internet import task from twisted.internet import reactor
loopTimes = 3 failInTheEnd = False
_loopCounter = 0
def runEverySecond():
"""
Called at ever loop interval.
"""
global _loopCounter
if _loopCounter < loopTimes:
_loopCounter += 1
print('A new second has passed.')
return
if failInTheEnd:
raise Exception('Failure during loop execution.')
# We looped enough times.
loop.stop()
return
def cbLoopDone(result):
"""
Called when loop was stopped with success.
"""
print("Loop done.")
reactor.stop()
def ebLoopFailed(failure):
"""
Called when loop execution failed.
"""
print(failure.getBriefTraceback())
reactor.stop()
loop = task.LoopingCall(runEverySecond)
# Start looping every 1 second. loopDeferred = loop.start(1.0)
# Add callbacks for stop and failure. loopDeferred.addCallback(cbLoopDone) loopDeferred.addErrback(ebLoopFailed)
reactor.run()
If we want a task to run every X seconds repeatedly, we can use twisted.internet.task.LoopingCall:
from https://docs.twisted.org/en/stable/core/howto/time.html

Related

Exceptions silently swollen when using CralwerRunner of Scrapy

I am trying to use CrawlerRunner to run a spider using Scrapy as follows:
a_crawler = CrawlerRunner(settings)
#defer.inlineCallbacks
def crawl():
CodeThatGenerateException()
print("Starting crawler")
yield a_crawler.crawl(MySpider)
reactor.stop()
crawl()
reactor.run()
Strangely the Exception generated by the first line of the crawl function is not printed, nothing happens and the application hangs and does not stop
I cannot figure out what is going on
Any suggestion is welcomed

Scrapy - Spiders taking too long to being shut down

Basically, I have a file named spiders.py in which I configure all my spiders and fire then all, using a single crawler. This is the source code of this file:
from scrapy import spiderloader
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from navigator import *
def main():
settings = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
process = CrawlerProcess(settings=settings)
for spider_name in spider_loader.list():
process.crawl(spider_name)
process.start()
if __name__ == '__main__':
main()
What I'm trying to achieve is to fire this spiders from another script, using subprocess module, and after 5 minutes of execution, turning down all spiders (using only one SIGTERM). The file responsible for this objective is monitor.py:
from time import sleep
import os
import signal
import subprocess
def main():
spiders_process = subprocess.Popen(["python", "spiders.py"], stdout=subprocess.PIPE,
shell=False, preexec_fn=os.setsid)
sleep(300)
os.killpg(spiders_process.pid, signal.SIGTERM)
if __name__ == '__main__':
main()
When the main thread wake up, the terminal says 2018-07-19 21:45:09 [scrapy.crawler] INFO: Received SIGTERM, shutting down gracefully. Send again to force
. But even after this message, the spiders continue to scrap the web pages. What I doing wrong?
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?
I believe when scrapy receives a SIGTERM it tries to shutdown gracefully by first waiting to finished all sent/scheduled requests. Your best bet is to either limit the number or concurrent requests so it finishes quicker (CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN are 16 and 8 respectively by default) or to send two SIGTERM's to instruct scrapy to do unclean immediate exit.
OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?
process.start() starts twisted reactor (twisted main event loop) which is a blocking call, to circumvent it and run more code after the reactor has been started you can schedule a function be run inside the loop. First snippet from this manual should give you an idea: https://twistedmatrix.com/documents/current/core/howto/time.html.
However, if you go that way, you must make sure that the code you schedule have to also be non blocking, otherwise when you pause the execution of the loop for too long bad things can start happening. So things like time.sleep() must be rewritten in a twisted equivalent.

Scapy variable sniff stop

i found a similar problem:
(Instance variables not being updated Python when using Multiprocessing),
but still do not know the solutionn for my task.
The task is to stop a scapy sniff function after the completness of a testskript. the running duration of single testscripts can vary greatly (from some seconds till hours). My sniff function runs in a separate threat. The testscript calls an init Funktion in the beginning which calls the sniff Function from an other modul.
#classmethod
def SaveFullTrafficPcap(self, TestCase, Termination):
try:
Full_Traffic = []
PktList = []
FullPcapName = Settings['GeneralSettings']['ResultsPath']+TestCase.TestCaseName +"Full_Traffic_PCAP.pcap"
#while Term.Termination < 1:
Full_Traffic = sniff(lfilter=None, iface=str(Settings['GeneralSettings']['EthInterface']), store=True, prn = lambda x: Full_Traffic.append(x), count=0, timeout=Term.Termination)
print(Full_Traffic)
wrpcap(FullPcapName, Full_Traffic)
except(Exception):
SYS.ABS_print("No full traffic PCAP file wirtten!\n")
At the end of the testscript an exit function is called. In the exit function I set Term.Termination parameter to 1 and wait for 5 sec, but it doesnt work. The sniff function is stoped by the system and i get no file"FullPCAPName"
If count or timeout get a value, the code works without problemms and i get my FullPCAPName file with he complet traffic on my Interface.
Have anybody hinds how i can stopt the sniff function regulary after finisching the testscript?
Use of the stop_filter command as specified here worked for me. I've duplicated HenningCash's code below for convenience:
import time, threading
from scapy.all import sniff
e = threading.Event()
def _sniff(e):
a = sniff(filter="tcp port 80", stop_filter=lambda p: e.is_set())
print("Stopped after %i packets" % len(a))
print("Start capturing thread")
t = threading.Thread(target=_sniff, args=(e,))
t.start()
time.sleep(3)
print("Try to shutdown capturing...")
e.set()
# This will run until you send a HTTP request somewhere
# There is no way to exit clean if no package is received
while True:
t.join(2)
if t.is_alive():
print("Thread is still running...")
else:
break
print("Shutdown complete!")
However, you still have to wait for a final packet to be sniffed, which might not be ideal in your scenario.
now i solved the problem with global variables. It is not nice, but it works well.
Nevertheless I am interested in a better solution for the variable sniff stop.
stop_var = ""
def stop():
global stop_var
stop_var.stop()
def start():
"""
your code
"""
global stop_var
stop_var = AsyncSniffer(**arg)
stop_var=start()

Stopping task.LoopingCall if exception occurs

I'm new to Twisted and after finally figuring out how the deferreds work I'm struggling with the tasks. What I want to achieve is to have a script that sends a REST request in a loop, however if at some point it fails I want to stop the loop. Since I'm using callbacks I can't easily catch exceptions and because I don't know how to stop the looping from an errback I'm stuck.
This is the simplified version of my code:
def send_request():
agent = Agent(reactor)
req_result = agent.request('GET', some_rest_link)
req_result.addCallbacks(cp_process_request, cb_process_error)
if __name__ == "__main__":
list_call = task.LoopingCall(send_request)
list_call.start(2)
reactor.run()
To end a task.LoopingCall all you need to do is call the stop on the return object (list_call in your case).
Somehow you need to make that var available to your errback (cb_process_error) either by pushing it into a class that cb_process_error is in, via some other class used as a pseudo-global or by literally using a global, then you simply call list_call.stop() inside the errback.
BTW you said:
Since I'm using callbacks I can't easily catch exceptions
Thats not really true. The point of an errback to to deal with exceptions, thats one of the things that literally causes it to be called! Check out my previous deferred answer and see if it makes errbacks any clearer.
The following is a runnable example (... I'm not saying this is the best way to do it, just that it is a way...)
#!/usr/bin/python
from twisted.internet import task
from twisted.internet import reactor
from twisted.internet.defer import Deferred
from twisted.web.client import Agent
from pprint import pprint
class LoopingStuff (object):
def cp_process_request(self, return_obj):
print "In callback"
pprint (return_obj)
def cb_process_error(self, return_obj):
print "In Errorback"
pprint(return_obj)
self.loopstopper()
def send_request(self):
agent = Agent(reactor)
req_result = agent.request('GET', 'http://google.com')
req_result.addCallbacks(self.cp_process_request, self.cb_process_error)
def main():
looping_stuff_holder = LoopingStuff()
list_call = task.LoopingCall(looping_stuff_holder.send_request)
looping_stuff_holder.loopstopper = list_call.stop
list_call.start(2)
reactor.callLater(10, reactor.stop)
reactor.run()
if __name__ == '__main__':
main()
Assuming you can get to google.com this will fetch pages for 10 seconds, if you change the second arg of the agent.request to something like http://127.0.0.1:12999 (assuming that port 12999 will give a connection refused) then you'll see 1 errback printout (which will have also shutdown the loopingcall) and have a 10 second wait until the reactor shuts down.

Running Scrapy tasks in Python

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()
Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item
crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)
I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task
Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

Categories