AWS lambda, scrapy and catching exceptions

AWS lambda, scrapy and catching exceptions - python

I'm running scrapy as a AWS lambda function. Inside my function I need to have a timer to see whether it's running longer than 1 minute and if so, I need to run some logic. Here is my code:
def handler():
x = 60
watchdog = Watchdog(x)
try:
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
except Watchdog:
print('Timeout error: process takes longer than %s seconds.' % x)
# some other logic here
watchdog.stop()
Watchdog timer class I took from this answer. The problem is the code never hits that except Watchdog block, but rather throws an exception outside:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 1182, in run
self.function(*self.args, **self.kwargs)
File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
raise self
functions.python.my_scrapy.index.Watchdog: 1
I need to catch exception in the function. How would I go about that.
PS: I'm very new to Python.

Alright this question had me going a little crazy, here is why that doesn't work:
What the Watchdog object does is create another thread where the exception is raised but not handled (the exception is only handled in the main process). Luckily, twisted has some neat features.
You can do it running the reactor in another thread:
import time
from threading import Thread
from twisted.internet import reactor
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start() # reactor will run in a different thread so it doesn't lock the script here
time.sleep(60) # Lock script here
# Now check if it's still scraping
if reactor.running:
# do something
else:
# do something else
I'm using python 3.7.0

Twisted has scheduling primitives. For example, this program runs for about 60 seconds:
from twisted.internet import reactor
reactor.callLater(60, reactor.stop)
reactor.run()

Related

How to catch exceptions thrown by functions executed using multiprocessing.Process() (python)

How can I catch exceptions from a process that was executed using multiprocessing.Process()?
Consider the following python script that executes a simple failFunction() (which immediately throws a runtime error) inside of a child process using mulitprocessing.Process()
#!/usr/bin/env python3
import multiprocessing, time
# this function will be executed in a child process asynchronously
def failFunction():
raise RuntimeError('trust fall, catch me!')
# execute the helloWorld() function in a child process in the background
process = multiprocessing.Process(
target = failFunction,
)
process.start()
# <this is where async stuff would happen>
time.sleep(1)
# try (and fail) to catch the exception
try:
process.join()
except Exception as e:
print( "This won't catch the exception" )
As you can see from the following execution, attempting to wrap the .join() does not actually catch the exception
user#host:~$ python3 example.py
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "example4.py", line 6, in failFunction
raise RuntimeError('trust fall, catch me!')
RuntimeError: trust fall, catch me!
user#host:~$
How can I update the above script to actually catch the exception from the function that was executed inside of a child process using multiprocessing.Process()?

This can be achieved by overloading the run() method in the multiprocessing.Proccess() class with a try..except statement and setting up a Pipe() to get and store any raised exceptions from the child process into an instance field for named exception:
#!/usr/bin/env python3
import multiprocessing, traceback, time
class Process(multiprocessing.Process):
def __init__(self, *args, **kwargs):
multiprocessing.Process.__init__(self, *args, **kwargs)
self._pconn, self._cconn = multiprocessing.Pipe()
self._exception = None
def run(self):
try:
multiprocessing.Process.run(self)
self._cconn.send(None)
except Exception as e:
tb = traceback.format_exc()
self._cconn.send((e, tb))
#raise e # You can still rise this exception if you need to
#property
def exception(self):
if self._pconn.poll():
self._exception = self._pconn.recv()
return self._exception
# this function will be executed in a child process asynchronously
def failFunction():
raise RuntimeError('trust fall, catch me!')
# execute the helloWorld() function in a child process in the background
process = Process(
target = failFunction,
)
process.start()
# <this is where async stuff would happen>
time.sleep(1)
# catch the child process' exception
try:
process.join()
if process.exception:
raise process.exception
except Exception as e:
print( "Exception caught!" )
Example execution:
user#host:~$ python3 example.py
Exception caught!
user#host:~$
Solution taken from this answer:
https://stackoverflow.com/a/33599967/1174102

This solution does not require the target function having to catch its own exceptions.
It may seem like overkill, but you can use class ProcessPoolExecutor in module concurrent.futures to create a process pool of size 1, which is all you that is required for your needs. When you submit a "job" to the executor a Future instance is created representing the state of execution of the process. When you call result() on the Future instance, you block until the process terminates and returns a results (the target function returns). If the target function throws an exception, you can catch it when you call result():
import concurrent.futures
def failFunction():
raise RuntimeError('trust fall, catch me!')
def main():
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
future = executor.submit(failFunction)
try:
result = future.result()
except Exception as e:
print('exception = ', e)
else:
print('result = ', result)
if __name__ == '__main__':
main()
Prints:
exception = trust fall, catch me!
The bonus of using a process pool is you have a ready-made process already created if you have additional functions you need to invoke in a sub-process.

(Python) correct use of greenlets

I have a remotely installed BeagleBone Black that needs to control a measurement device, a pan/tilt head, upload measured data, host a telnet server,...
I'm using Python 2.7
This is the first project in which I need to program, so a lot of questions come up.
I'd mostly like to know if what I'm doing is a reasonable way of handling what I need and why certain things don't do what I think.
Certain modules need to work together and share data. Best example is the telnet module, when the telnet user requests the position of the pan/tilt head.
As I understand it, the server is blocking the program, so I use gevent/Greenlets to run it from the "main" script.
Stripped down versions:
teln module
from gevent import monkey; monkey.patch_all() # patch functions to use gevent
import gevent
import gevent.server
from telnetsrv.green import TelnetHandler, command
__all__ = ["MyTelnetHandler", "start_server"] # used when module is loaded as "from teln import *"
class MyTelnetHandler(TelnetHandler):
"""Telnet implementation."""
def writeerror(self, text):
"""Write errors in red, preceded by 'ERROR: '."""
TelnetHandler.writeerror(self, "\n\x1b[31;5;1mERROR: {}\x1b[0m\n".format(text))
#command(["exit", "logout", "quit"], hidden=True)
def dummy(self, params):
"""Disables these commands and get them out of the "help" listing."""
pass
def start_server():
"""Server constructor, starts server."""
server = gevent.server.StreamServer(("", 2323), MyTelnetHandler.streamserver_handle)
print("server created")
try:
server.serve_forever()
finally:
server.close()
print("server finished")
"""Main loop"""
if __name__ == "__main__":
start_server()
Main script:
#! /usr/bin/env python
# coding: utf-8
from gevent import monkey; monkey.patch_all() # patch functions to gevent versions
import gevent
from gevent import Greenlet
import teln # telnet handler
from time import sleep
from sys import exit
"""Main loop"""
if __name__ == "__main__":
thread_telnet = Greenlet(teln.start_server)
print("greenlet created")
thread_telnet.start()
print("started")
sleep(10)
print("done sleeping")
i = 1
try:
while not thread_telnet.ready():
print("loop running ({:03d})".format(i))
i += 1
sleep(1)
except KeyboardInterrupt:
print("interrupted")
thread_telnet.kill()
print("killed")
exit()
The final main loop would need to run much more functions.
questions:
Is this a reasonable way of running processes/functions at the same time?
How do I get a function in the telnet module to call functions from a third module, controlling the head?
How do I make sure that the head isn't being controlled by the telnet module as well as the main script (which runs some kind of schedule)?
In the "def start_server()" function in teln module, two print commands are called when starting and stopping the server. I do not see these appearing in the terminal. What could be happening?
When I open a telnet session from a remote machine, and then close it, I get the following output (program keeps running):
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 536, in run
result = self._run(*self.args, **self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/telnetsrv/telnetsrvlib.py", line 815, in inputcooker
c = self._inputcooker_getc()
File "/usr/local/lib/python2.7/dist-packages/telnetsrv/telnetsrvlib.py", line 776, in _inputcooker_getc
ret = self.sock.recv(20)
File "/usr/local/lib/python2.7/dist-packages/gevent/_socket2.py", line 283, in recv
self._wait(self._read_event)
File "/usr/local/lib/python2.7/dist-packages/gevent/_socket2.py", line 182, in _wait
self.hub.wait(watcher)
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 651, in wait
result = waiter.get()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 898, in get
return self.hub.switch()
File "/usr/local/lib/python2.7/dist-packages/gevent/hub.py", line 630, in switch
return RawGreenlet.switch(self)
cancel_wait_ex: [Errno 9] File descriptor was closed in another greenlet
Fri Sep 22 09:31:12 2017 <Greenlet at 0xb6987bc0L: <bound method MyTelnetHandler.inputcooker of <teln.MyTelnetHandler instance at 0xb69a1c38>>> failed with cancel_wait_ex
While trying out different things to get to understand how greenlets work, I have received similar ("cancel_wait_ex: [Errno 9] File descriptor was closed in another greenlet") error messages often.
I have searched around but can't find/understand what is happening and what I am supposed to do.
If something goes wrong while running a greenlet, I do not get the exception that points to the problem (for instance when I try to print an integer), but a similar error message as above. How can I see the "original" raised exception?

Running several ApplicationSessions non-blockingly using autbahn.asyncio.wamp

I'm trying to run two autobahn.asyncio.wamp.ApplicationSessions in python at the same time. Previously, I did this using a modification of the autobahn library as suggested in this post's answer. I now
require a bit more professional solution.
After googling about for a while, this post appeared quite promising, but uses the twisted library, instead of asyncio. I wasn't able to identify a similar solution for the asyncio branch of the autobahn library, since it doesn't appear to be using Reactors.
The main problem I have, is that ApplicationRunner.run() is blocking (which is why I previously outsourced it to a thread), so I can't just run a second ApplicationRunner after it.
I do need to access 2 websocket channels at the same time, which I cannot appear to do with a single ApplicationSession.
My Code so far:
from autobahn.asyncio.wamp import ApplicationSession
from autobahn.asyncio.wamp import ApplicationRunner
from asyncio import coroutine
import time
channel1 = 'BTC_LTC'
channel2 = 'BTC_XMR'
class LTCComponent(ApplicationSession):
def onConnect(self):
self.join(self.config.realm)
#coroutine
def onJoin(self, details):
def onTicker(*args, **kwargs):
print('LTCComponent', args, kwargs)
try:
yield from self.subscribe(onTicker, channel1)
except Exception as e:
print("Could not subscribe to topic:", e)
class XMRComponent(ApplicationSession):
def onConnect(self):
self.join(self.config.realm)
#coroutine
def onJoin(self, details):
def onTicker(*args, **kwargs):
print('XMRComponent', args, kwargs)
try:
yield from self.subscribe(onTicker, channel2)
except Exception as e:
print("Could not subscribe to topic:", e)
def main():
runner = ApplicationRunner("wss://api.poloniex.com:443", "realm1", extra={})
runner.run(LTCComponent)
runner.run(XMRComponent) # <- is not being called
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
quit()
except Exception as e:
print(time.time(), e)
My knowledge of the autobahn library is limited, and I'm afraid the documentation isn't improving my situation much. Am I overlooking something here? A function, a parameter, which would enable me to either compound my components or run them both at once?
Perhaps a similar solution as provided here, which implements an alternative ApplicationRunner ?
Related Topics
Running two ApplicationSessions in twisted
Running Autobahn ApplicationRunner in Thread
Autobahn.wamp.ApplicationSession Source
Autobahn.wamp.Applicationrunner Source
As Requested, the Traceback from #stovfl's answer using multithreading code:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/nils/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/nils/git/tools/gemini_wss/t2.py", line 27, in run
self.appRunner.run(self.__ApplicationSession)
File "/home/nils/anaconda3/lib/python3.5/site-packages/autobahn- 0.14.1-py3.5.egg/autobahn/asyncio/wamp.py", line 143, in run
transport_factory = WampWebSocketClientFactory(create, url=self.url, serializers=self.serializers)
File "/home/nils/anaconda3/lib/python3.5/site-packages/autobahn- 0.14.1-py3.5.egg/autobahn/asyncio/websocket.py", line 319, in __init__
WebSocketClientFactory.__init__(self, *args, **kwargs)
File "/home/nils/anaconda3/lib/python3.5/site-packages/autobahn- 0.14.1-py3.5.egg/autobahn/asyncio/websocket.py", line 268, in __init__
self.loop = loop or asyncio.get_event_loop()
File "/home/nils/anaconda3/lib/python3.5/asyncio/events.py", line 626, in get_event_loop
return get_event_loop_policy().get_event_loop()
File "/home/nils/anaconda3/lib/python3.5/asyncio/events.py", line 572, in get_event_loop
% threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'Thread-2'.
Exception in thread Thread-1:
**Same as in Thread-2**
...
RuntimeError: There is no current event loop in thread 'Thread-1'.

As I see from the traceback, we only reach Step 2 of 4
From the asyncio docs:
This module provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources
So I drop my first proposal using multithreading.
I could imagin the following three options:
Do it with multiprocessing instead of multithreading
Do it with coroutine inside asyncio loop
Switch between channels in def onJoin(self, details)
Second proposal, first option using multiprocessing.
I can start two asyncio loops, so appRunner.run(...) should work.
You can use one class ApplicationSession if the channel are the only different.
If you need to pass different class ApplicationSession add it to the args=
class __ApplicationSession(ApplicationSession):
# ...
try:
yield from self.subscribe(onTicker, self.config.extra['channel'])
except Exception as e:
# ...
import multiprocessing as mp
import time
def ApplicationRunner_process(realm, channel):
appRunner = ApplicationRunner("wss://api.poloniex.com:443", realm, extra={'channel': channel})
appRunner.run(__ApplicationSession)
if __name__ == "__main__":
AppRun = [{'process':None, 'channel':'BTC_LTC'},
{'process': None, 'channel': 'BTC_XMR'}]
for app in AppRun:
app['process'] = mp.Process(target = ApplicationRunner_process, args = ('realm1', app['channel'] ))
app['process'].start()
time.sleep(0.1)
AppRun[0]['process'].join()
AppRun[1]['process'].join()

Following the approach you linked for twisted I managed to get same behaviour with asyncio setting start_loop=False
import asyncio
from autobahn.asyncio.wamp import ApplicationSession, ApplicationRunner
runner1 = ApplicationRunner(url, realm, extra={'cli_id': 1})
coro1 = runner1.run(MyApplicationSession, start_loop=False)
runner2 = ApplicationRunner(url, realm, extra={'cli_id': 2})
coro2 = runner2.run(MyApplicationSession, start_loop=False)
asyncio.get_event_loop().run_until_complete(coro1)
asyncio.get_event_loop().run_until_complete(coro2)
asyncio.get_event_loop().run_forever()
class MyApplicationSession(ApplicationSession):
def __init__(self, cfg):
super().__init__(cfg)
self.cli_id = cfg.extra['cli_id']
def onJoin(self, details):
print("session attached", self.cli_id)

Pexpect: async expect with several connection

I'm trying to write script for gathering some information from network devices using Pexpect async expect(python 3.5.1 and pexpect from github) and get some strange thing: all works fine with several devices and doesn't work with some more (usually > 5-6). I wrote this simple script for testing:
#asyncio.coroutine
def test_ssh_expect_async(num):
print('Task #{0} start'.format(num))
p = pexpect.spawn('ssh localhost', encoding='utf8')
#p.logfile = sys.stdout
yield from p.expect('password', async=True)
p.sendline('***')
yield from p.expect(r'#self-VirtualBox\:', async=True)
p.sendline('uptime')
yield from p.expect(r'#self-VirtualBox\:', async=True)
p.sendline('uname -a')
yield from p.expect(r'#self-VirtualBox\:', async=True)
p.sendline('ll')
yield from p.expect(r'#self-VirtualBox\:', async=True)
print('Task #{0} end'.format(num))
#asyncio.coroutine
def test_loop():
tasks = []
for i in range(1, 5):
tasks.append(test_ssh_expect_async(i))
yield from asyncio.wait(tasks)
print('All Tasks done')
print('--------------Async--------------------')
loop = asyncio.get_event_loop()
loop.run_until_complete(test_loop())
If i try to use range(1,3) as example i get this:
self#self-VirtualBox:/media/sf_netdev$ python3 simple-test.py
--------------Async--------------------
Task #3 running
Task #1 running
Task #2 running
Task #3 closed
Task #1 closed
Task #2 closed
All Tasks done
But if i increase upper limit i get some errors:
self#self-VirtualBox:/media/sf_netdev$ python3 simple-test.py
--------------Async--------------------
Task #3 running
Task #1 running
Task #4 running
Task #2 running
Exception in callback BaseSelectorEventLoop.add_reader(11, <bound method...d=11 polling>>)
handle: <Handle BaseSelectorEventLoop.add_reader(11, <bound method...d=11 polling>>)>
Traceback (most recent call last):
File "/usr/lib/python3.5/asyncio/selector_events.py", line 234, in add_reader
key = self._selector.get_key(fd)
File "/usr/lib/python3.5/selectors.py", line 191, in get_key
raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '11 is not registered'
During handling of the above exception, another exception occurred:
...
Why it happens? How to write working script with async pexpect?
---------------Answer------------
It was a bug https://github.com/pexpect/pexpect/issues/347. Now pexpect command fixed it.

It was a bug https://github.com/pexpect/pexpect/issues/347. Now pexpect command fixed it.

How to stop a TimerService from within the LoopingCall

Background
I've recently been a part of a project where twisted was used. We utilized a TimerService to daemonize a process. And yes, I realize that this approach may have been overkill, but we're trying to stay consistent and use a proven framework. Yesterday, an exception went unhandled within the LoopingCall which caused the TimerService to fail, but the twistd application was still running (see twisted enhancement request). To avoid this, we would like to stop the service at the end of a catch-all exception handler.
Question
How to stop both the TimerService and the Twistd application from within the LoopingCall callable method? My concern is that the linux process keeps running when the TimerService is unable to handle an exception, even though the TimerService isn't looping anymore.
For example:
def some_callable():
try:
# do stuff
except SomeSpecificError ex:
# handle & log error
except SomeOtherSpecificError ex:
# handle & log error
except:
# log sys.exc_info() details
# stop service.
NOTE: The following does not work within the callable.
from twisted.internet import reactor
reactor.stop()

You can't stop the reactor before it starts:
>>> from twisted.internet import reactor
>>> reactor.stop()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/internet/base.py", line 570, in stop
"Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
>>>
However, as long as the reactor is running already, reactor.stop works fine:
>>> from twisted.internet import reactor
>>> reactor.callLater(3, reactor.stop)
<twisted.internet.base.DelayedCall instance at 0xb762d2ec>
>>> reactor.run()
[... pause ...]
>>>
TimerService is a wrapper around LoopingCall. And more specifically, when it starts its LoopingCall, it passes now=True to run. That causes the function to be called the first time immediately, rather than after the specified interval elapses once.
So when TimerService.startService is called, your function is called. And the reactor isn't running yet. On that first call to your function, you can't stop the reactor, because it hasn't been started.
This program:
from twisted.application.internet import TimerService
def foo():
from twisted.internet import reactor
reactor.stop()
from twisted.application.service import Application
application = Application("timer stop")
TimerService(3, foo).setServiceParent(application)
produces these results:
exarkun#boson:/tmp$ twistd -ny timerstop.tac
2011-03-08 11:46:19-0500 [-] Log opened.
2011-03-08 11:46:19-0500 [-] using set_wakeup_fd
2011-03-08 11:46:19-0500 [-] twistd 10.2.0+r30835 (/usr/bin/python 2.6.4) starting up.
2011-03-08 11:46:19-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2011-03-08 11:46:19-0500 [-] Unhandled Error
Traceback (most recent call last):
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/application/service.py", line 277, in startService
service.startService()
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/application/internet.py", line 284, in startService
self._loop.start(self.step, now=True).addErrback(self._failed)
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/internet/task.py", line 163, in start
self()
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/internet/task.py", line 194, in __call__
d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/internet/defer.py", line 133, in maybeDeferred
result = f(*args, **kw)
File "timerstop.py", line 5, in foo
reactor.stop()
File "/home/exarkun/Projects/Twisted/branches/simplify-ssl-4905/twisted/internet/base.py", line 570, in stop
"Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
However, this one works fine:
from twisted.application.internet import TimerService
counter = 0
def foo():
global counter
if counter == 1:
from twisted.internet import reactor
reactor.stop()
else:
counter += 1
from twisted.application.service import Application
application = Application("timer stop")
TimerService(3, foo).setServiceParent(application)
And slightly less grossly, so does this one:
from twisted.application.internet import TimerService
def foo():
from twisted.internet import reactor
reactor.callWhenRunning(reactor.stop)
from twisted.application.service import Application
application = Application("timer stop")
TimerService(3, foo).setServiceParent(application)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS lambda, scrapy and catching exceptions - python

Twisted has scheduling primitives. For example, this program runs for about 60 seconds: from twisted.internet import reactor reactor.callLater(60, reactor.stop) reactor.run()

Related

How to catch exceptions thrown by functions executed using multiprocessing.Process() (python)

(Python) correct use of greenlets

Running several ApplicationSessions non-blockingly using autbahn.asyncio.wamp

Pexpect: async expect with several connection

How to stop a TimerService from within the LoopingCall

Categories

Resources