Stream scrapy logging output to websocket - python

I am attempting to build an API that will run a Scrapy web spider when requested via a websocket message.
I would like to forward the logging output to the websocket client so you see what's going on in the - sometimes quite long-running - process. When finished, I will also send the scraped results.
As it is possible to run Scrapy in-process, I would like to do exactly that. I found a solution that will stream an external process to a websocket here, but that doesn't seem right if it's possible to run Scrapy inside the server.
https://tomforb.es/displaying-a-processes-output-on-a-web-page-with-websockets-and-python
There are two ways I can imagine to make this work in Twisted: Somehow using a LogObserver, or defining a LogHandler (probably StreamHandler with StringIO) and then handle the Stream in some way in Twisted with autobahn.websocket classes like WebSocketServerProtocol.
Now I am quite stuck and don't know how to connect the ends.
Could someone please provide a short example how to stream logging output from twisted logging (avoiding a file if possible) to a websocket client?

I managed to solve this by myself somehow and wanted to let you know how I did it:
The basic idea was to have a process that gets called remotely and output a streaming log to a client, usually a browser.
Instead of building all the nasty details myself, I decided to go with autobahn.ws and crossbar.io, providing pubsub and rpc via the Wamp protocol which is essentially just JSON on websockets - exactly what I had planned to build, just way more advanced!
Here is a very basic example:
from twisted.internet.defer import inlineCallbacks
from autobahn.twisted.wamp import ApplicationSession
from example.spiders.basic_spider import BasicSpider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
import logging
class PublishLogToSessionHandler(logging.Handler):
def __init__(self, session, channel):
logging.Handler.__init__(self)
self.session = session
self.channel = channel
def emit(self, record):
self.session.publish(self.channel, record.getMessage())
class AppSession(ApplicationSession):
configure_logging(install_root_handler=False)
#inlineCallbacks
def onJoin(self, details):
logging.root.addHandler(PublishLogToSessionHandler(self, 'com.example.crawler.log'))
# REGISTER a procedure for remote calling
def crawl(domain):
runner = CrawlerRunner(get_project_settings())
runner.crawl("basic", domain=domain)
return "Running..."
yield self.register(crawl, 'com.example.crawler.crawl')

Related

HTTP server kick-off background python script without blocking

I'd like to be able to trigger a long-running python script via a web request, in bare-bones fashion. Also, I'd like to be able to trigger other copies of the script with different parameters while initial copies are still running.
I've looked at flask, aiohttp, and queueing possibilities. Flask and aiohttp seem to have the least overhead to set up. I plan on executing the existing python script via subprocess.run (however, I did consider refactoring the script into libraries that could be used in the web response function).
With aiohttp, I'm trying something like:
ingestion_service.py:
from aiohttp import web
from pprint import pprint
routes = web.RouteTableDef()
#routes.get("/ingest_pipeline")
async def test_ingest_pipeline(request):
'''
Get the job_conf specified from the request and activate the script
'''
#subprocess.run the command with lookup of job conf file
response = web.Response(text=f"Received data ingestion request")
await response.prepare(request)
await response.write_eof()
#eventually this would be subprocess.run call
time.sleep(80)
return response
def init_func(argv):
app = web.Application()
app.add_routes(routes)
return app
But though the initial request returns immediately, subsequent requests block until the initial request is complete. I'm running a server via:
python -m aiohttp.web -H localhost -P 8080 ingestion_service:init_func
I know that multithreading and concurrency may provide better solutions than asyncio. In this case, I'm not looking for a robust solution, just something that will allow me to run multiple scripts at once via http request, ideally with minimal memory costs.
OK, there were a couple of issues with what I was doing. Namely, time.sleep() is blocking, so asyncio.sleep() should be used. However, since I'm interested in spawning a subprocess, I can use asyncio.subprocess to do that in a non-blocking fashion.
nb:
asyncio: run one function threaded with multiple requests from websocket clients
https://docs.python.org/3/library/asyncio-subprocess.html.
Using these help, but there's still an issue with the webhandler terminating the subprocess. Luckily, there's a solution here:
https://docs.aiohttp.org/en/stable/web_advanced.html
aiojobs has a decorator "atomic" that will protect the process until it is complete. So, code along these lines will function:
from aiojobs.aiohttp import setup, atomic
import asyncio
import os
from aiohttp import web
#atomic
async def ingest_pipeline(request):
#be careful what you pass through to shell, lest you
#give away the keys to the kingdom
shell_command = "[your command here]"
response_text = f"running {shell_command}"
response_code = 200
response = web.Response(text=response_text, status=response_code)
await response.prepare(request)
await response.write_eof()
ingestion_process = await asyncio.create_subprocess_shell(shell_command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE)
stdout, stderr = await ingestion_process.communicate()
return response
def init_func(argv):
app = web.Application()
setup(app)
app.router.add_get('/ingest_pipeline', ingest_pipeline)
return app
This is very bare bones, but might help others looking for a quick skeleton for a temporary internal solution.

Running a Tornado Server within a Jupyter Notebook

Taking the standard Tornado demonstration and pushing the IOLoop into a background thread allows querying of the server within a single script. This is useful when the Tornado server is an interactive object (see Dask or similar).
import asyncio
import requests
import tornado.ioloop
import tornado.web
from concurrent.futures import ThreadPoolExecutor
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello, world")
def make_app():
return tornado.web.Application([
(r"/", MainHandler),
])
pool = ThreadPoolExecutor(max_workers=2)
loop = tornado.ioloop.IOLoop()
app = make_app()
app.listen(8888)
fut = pool.submit(loop.start)
print(requests.get("https://localhost:8888"))
The above works just fine in a standard python script (though it is missing safe shutdown). Jupyter notebook are optimal environment for these interactive Tornado server environments. However, when it comes to Jupyter this idea breaks down as there is already a active running loop:
>>> import asyncio
>>> asyncio.get_event_loop()
<_UnixSelectorEventLoop running=True closed=False debug=False>
This is seen when running the above script in a Jupyter notebook, both the server and the request client are trying to open a connection in the same thread and the code hangs. Building a new Asyncio loop and/or Tornado IOLoop does not seem to help and I suspect I am missing something in Jupyter itself.
The question: Is it possible to have a live Tornado server running in the background within a Jupyter notebook so that standard python requests or similar can connect to it from the primary thread? I would prefer to avoid Asyncio in the code presented to users if possible due to its relatively complexity for novice users.
Based on my recent PR to streamz, here is something that works, similar to your idea:
class InNotebookServer(object):
def __init__(self, port):
self.port = port
self.loop = get_ioloop()
self.start()
def _start_server(self):
from tornado.web import Application, RequestHandler
from tornado.httpserver import HTTPServer
from tornado import gen
class Handler(RequestHandler):
source = self
#gen.coroutine
def get(self):
self.write('Hello World')
application = Application([
('/', Handler),
])
self.server = HTTPServer(application)
self.server.listen(self.port)
def start(self):
"""Start HTTP server and listen"""
self.loop.add_callback(self._start_server)
_io_loops = []
def get_ioloop():
from tornado.ioloop import IOLoop
import threading
if not _io_loops:
loop = IOLoop()
thread = threading.Thread(target=loop.start)
thread.daemon = True
thread.start()
_io_loops.append(loop)
return _io_loops[0]
To call in the notebook
In [2]: server = InNotebookServer(9005)
In [3]: import requests
requests.get('http://localhost:9005')
Out[3]: <Response [200]>
Part 1: Let get nested tornado(s)
To find the information you need you would have had to follow the following crumbtrails, start by looking at what is described in the release notes of IPython 7
It particular it will point you to more informations on the async and await sections in the documentation, and to this discussion,
which suggest the use of nest_asyncio.
The Crux is the following:
A) either you trick python into running two nested event loops. (what nest_asyncio does)
B) You schedule coroutines on already existing eventloop. (I'm not sure how to do that with tornado)
I'm pretty sure you know all that, but I'm sure other reader will appreciate.
There are unfortunately no ways to make it totally transparent to users – well unless you control the deployment like on a jupyterhub, and can add these lines to the IPython startups scripts that are automatically loaded. But I think the following is simple enough.
import nest_asyncio
nest_asyncio.apply()
# rest of your tornado setup and start code.
Part 2: Gotcha Synchronous code block eventloop.
Previous section takes only care of being able to run the tornado app. But note that any synchronous code will block the eventloop; thus when running print(requests.get("http://localhost:8000")) the server will appear to not work as you are blocking the eventloop, which will restart only when the code finish execution which is waiting for the eventloop to restart...(understanding this is an exercise left to the reader). You need to either issue print(requests.get("http://localhost:8000")) from another kernel, or, use aiohttp.
Here is how to use aiohttp in a similar way as requests.
import aiohttp
session = aiohttp.ClientSession()
await session.get('http://localhost:8889')
In this case as aiohttp is non-blocking things will appear to work properly. You here can see some extra IPython magic where we autodetect async code and run it on the current eventloop.
A cool exercise could be to run a request.get in a loop in another kernel, and run sleep(5) in the kernel where tornado is running, and see that we stop processing requests...
Part 3: Disclaimer and other routes:
This is quite tricky and I would advise to not use in production, and warn your users this is not the recommended way of doing things.
That does not completely solve your case, you will need to run things not in the main thread which I'm not sure is possible.
You can also try to play with other loop runners like trio and curio; they might allow you to do stuff you can't with asyncio by default like nesting, but here be dragoons. I highly recommend trio and the multiple blog posts around its creation, especially if you are teaching async.
Enjoy, hope that helped, and please report bugs, as well as things that did work.
You can make the tornado server run in background using the %%script --bg magic command. The option --bg tells jupyter to run the code of the current cell in background.
Just create a tornado server in one cell alongwith the magic command and run that cell.
Example:
%%script python --bg
import tornado.ioloop
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello, world")
def make_app():
return tornado.web.Application([
(r"/", MainHandler),
])
loop = tornado.ioloop.IOLoop.current()
app = make_app()
app.listen(8000) # 8888 was being used by jupyter in my case
loop.start()
And then you can use requests in a separate cell to connect to the server:
import requests
print(requests.get("http://localhost:8000"))
# prints <Response [200]>
One thing to note here is that if you stop/interrupt the kernel on any cell, the background script will also stop. So you'll have to run this cell again to start the server.

Disable logging for a python module in one context but not another

I have some code that uses the requests module to communicate with a logging API. However, requests itself, through urllib3, does logging. Naturally, I need to disable logging so that requests to the logging API don't cause an infinite loop of logs. So, in the module I do the logging calls in, I do logging.getLogger("requests").setLevel(logging.CRITICAL) to mute routine request logs.
However, this code is intended to load and run arbitrary user code. Since the python logging module apparently uses global state to manage settings for a given logger, I am worried the user's code might turn logging back on and cause problems, for instance if they naively use the requests module in their code without realizing I have disabled logging for it for a reason.
How can I disable logging for the requests module when it is executed from the context of my code, but not affect the state of the logger for the module from the perspective of the user? Some sort of context manager that silences calls to logging for code within the manager would be ideal. Being able to load the requests module with a unique __name__ so the logger uses a different name could also work, though it's a bit convoluted. I can't find a way to do either of these things, though.
Regrettably, the solution will need to handle multiple threads, so procedurally turning off logging, then running the API call, then turning it back on will not work as global state is mutated.
I think I've got a solution for you:
The logging module is built to be thread-safe:
The logging module is intended to be thread-safe without any special
work needing to be done by its clients. It achieves this though using
threading locks; there is one lock to serialize access to the module’s
shared data, and each handler also creates a lock to serialize access
to its underlying I/O.
Fortunately, it exposes the second lock mentioned though a public API: Handler.acquire() lets you acquire a lock for a particular log handler (and Handler.release() releases it again). Acquiring that lock will block all other threads that try to log a record that would be handled by this handler until the lock is released.
This allows you to manipulate the handler's state in a thread-safe way. The caveat is this: Because it's intended as a lock around the I/O operations of the handler, the lock will only be acquired in emit(). So only once a record makes it through filters and log levels and would be emitted by a particular handler will the lock be acquired. That's why I had to subclass a handler and create the SilencableHandler.
So the idea is this:
Get the topmost logger for the requests module and stop propagation for it
Create your custom SilencableHandler and add it to the requests logger
Use the Silenced context manager to selectively silence the SilencableHandler
main.py
from Queue import Queue
from threading import Thread
from usercode import fetch_url
import logging
import requests
import time
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
class SilencableHandler(logging.StreamHandler):
def __init__(self, *args, **kwargs):
self.silenced = False
return super(SilencableHandler, self).__init__(*args, **kwargs)
def emit(self, record):
if not self.silenced:
super(SilencableHandler, self).emit(record)
requests_logger = logging.getLogger('requests')
requests_logger.propagate = False
requests_handler = SilencableHandler()
requests_logger.addHandler(requests_handler)
class Silenced(object):
def __init__(self, handler):
self.handler = handler
def __enter__(self):
log.info("Silencing requests logger...")
self.handler.acquire()
self.handler.silenced = True
return self
def __exit__(self, exc_type, exc_value, traceback):
self.handler.silenced = False
self.handler.release()
log.info("Requests logger unsilenced.")
NUM_THREADS = 2
queue = Queue()
URLS = [
'http://www.stackoverflow.com',
'http://www.stackexchange.com',
'http://www.serverfault.com',
'http://www.superuser.com',
'http://travel.stackexchange.com',
]
for i in range(NUM_THREADS):
worker = Thread(target=fetch_url, args=(i, queue,))
worker.setDaemon(True)
worker.start()
for url in URLS:
queue.put(url)
log.info('Starting long API request...')
with Silenced(requests_handler):
time.sleep(5)
requests.get('http://www.example.org/api')
time.sleep(5)
log.info('Done with long API request.')
queue.join()
usercode.py
import logging
import requests
import time
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
def fetch_url(i, q):
while True:
url = q.get()
response = requests.get(url)
logging.info("{}: {}".format(response.status_code, url))
time.sleep(i + 2)
q.task_done()
Example output:
(Notice how the call to http://www.example.org/api isn't logged, and all threads that try to log requests are blocked for the first 10 seconds).
INFO:__main__:Starting long API request...
INFO:__main__:Silencing requests logger...
INFO:__main__:Requests logger unsilenced.
INFO:__main__:Done with long API request.
Starting new HTTP connection (1): www.stackoverflow.com
Starting new HTTP connection (1): www.stackexchange.com
Starting new HTTP connection (1): stackexchange.com
Starting new HTTP connection (1): stackoverflow.com
INFO:root:200: http://www.stackexchange.com
INFO:root:200: http://www.stackoverflow.com
Starting new HTTP connection (1): www.serverfault.com
Starting new HTTP connection (1): serverfault.com
INFO:root:200: http://www.serverfault.com
Starting new HTTP connection (1): www.superuser.com
Starting new HTTP connection (1): superuser.com
INFO:root:200: http://www.superuser.com
Starting new HTTP connection (1): travel.stackexchange.com
INFO:root:200: http://travel.stackexchange.com
Threading code is based on Doug Hellmann's articles on threading and queues.

Gevent async server with blocking requests

I have what I would think is a pretty common use case for Gevent. I need a UDP server that listens for requests, and based on the request submits a POST to an external web service. The external web service essentially only allows one request at a time.
I would like to have an asynchronous UDP server so that data can be immediately retrieved and stored so that I don't miss any requests (this part is easy with the DatagramServer gevent provides). Then I need some way to send requests to the external web service serially, but in such a way that it doesn't ruin the async of the UDP server.
I first tried monkey patching everything and what I ended up with was a quick solution, but one in which my requests to the external web service were not rate limited in any way and which resulted in errors.
It seems like what I need is a single non-blocking worker to send requests to the external web service in serial while the UDP server adds tasks to the queue from which the non-blocking worker is working.
What I need is information on running a gevent server with additional greenlets for other tasks (especially with a queue). I've been using the serve_forever function of the DatagramServer and think that I'll need to use the start method instead, but haven't found much information on how it would fit together.
Thanks,
EDIT
The answer worked very well. I've adapted the UDP server example code with the answer from #mguijarr to produce a working example for my use case:
from __future__ import print_function
from gevent.server import DatagramServer
import gevent.queue
import gevent.monkey
import urllib
gevent.monkey.patch_all()
n = 0
def process_request(q):
while True:
request = q.get()
print(request)
print(urllib.urlopen('https://test.com').read())
class EchoServer(DatagramServer):
__q = gevent.queue.Queue()
__request_processing_greenlet = gevent.spawn(process_request, __q)
def handle(self, data, address):
print('%s: got %r' % (address[0], data))
global n
n += 1
print(n)
self.__q.put(n)
self.socket.sendto('Received %s bytes' % len(data), address)
if __name__ == '__main__':
print('Receiving datagrams on :9000')
EchoServer(':9000').serve_forever()
Here is how I would do it:
Write a function taking a "queue" object as argument; this function will continuously process items from the queue. Each item is supposed to be a request for the web service.
This function could be a module-level function, not part of your DatagramServer instance:
def process_requests(q):
while True:
request = q.get()
# do your magic with 'request'
...
in your DatagramServer, make the function running within a greenlet (like a background task):
self.__q = gevent.queue.Queue()
self.__request_processing_greenlet = gevent.spawn(process_requests, self.__q)
when you receive the UDP request in your DatagramServer instance, you push the request to the queue
self.__q.put(request)
This should do what you want. You still call 'serve_forever' on DatagramServer, no problem.

Python. Tornado. Non-blocking xmlrpc client

Basically we can call xmlrpc handlers following way:
import xmlrpclib
s = xmlrpclib.ServerProxy('http://remote_host/rpc/')
print s.system.listmethods()
In tornado we can integrate it like this:
import xmlrpclib
import tornado.web
s = xmlrpclib.ServerProxy('http://remote_host/rpc/')
class MyHandler(tornado.web.RequestHandler):
def get(self):
result = s.system.listmethods()
I have following, a little bit newbie, questions:
Will result = s.system.listmethods() block tornado?
Are there any non-blocking xmlrpc clients around?
How can we achieve result = yield gen.Task(s.system.listmethods)?
1.Yes it will block tornado, since xmlrpclib uses blocking python sockets (as it is)
2.Not that I'm aware of, but I'll provide a solution where you can keep xmlrpclib but have it async
3.My solution doesn't use tornado gen.
Ok, so one useful library to have at mind whenever you're doing networking and need to write async code is gevent, it's a really good high quality library that I would recommend to everyone.
Why is it good and easy to use ?
You can write asynchronous code in a synchronous manner (so that makes it easy)
All you have to do, to do so is monkey patch with one simple line :
from gevent import monkey; monkey.patch_all()
When using tornado you need to know two things (that you may already know) :
Tornado only supports asynchronous views when acting as a HTTPServer (WSGI isn't supported for async views)
Async views need to terminate the responses by themselves you do by using either self.finish() or self.render() (which calls self.finish())
Ok so here's an example illustrating what you would need with the necessary gevent integration with tornado :
# Python immports
import functools
# Tornado imports
import tornado.ioloop
import tornado.web
import tornado.httpserver
# XMLRpc imports
import xmlrpclib
# Asynchronous gevent decorator
def gasync(func):
#tornado.web.asynchronous
#functools.wraps(func)
def f(self, *args, **kwargs):
return gevent.spawn(func, self, *args, **kwargs)
return f
# Our XML RPC service
xml_service = xmlrpclib.ServerProxy('http://remote_host/rpc/')
class MyHandler(tornado.web.RequestHandler):
#gasync
def get(self):
# This doesn't block tornado thanks to gevent
# Which patches all of xmlrpclib's socket calls
# So they no longer are blocking
result = xml_service.system.listmethods()
# Do something here
# Write response to client
self.write('hello')
self.finish()
# Our URL Mappings
handlers = [
(r"/", MyHandler),
]
def main():
# Setup app and HTTP server
application = tornado.web.Application(handlers)
http_server = tornado.httpserver.HTTPServer(application)
http_server.listen(8000)
# Start ioloop
tornado.ioloop.IOLoop.instance().start()
if __name__ == "__main__":
main()
So give the example a try (adapt it to your needs obviously) and you should be good to go.
No need to write any extra code, gevent does all the work of patching up python sockets so they can be used asynchronously while still writing code in a synchronous fashion (which is a real bonus).
Hope this helps :)
I do not think so.
Because Tornado has it's own ioloop, but gevent's ioloop is libevent.
So gevent will block Tornado's ioloop.

Categories