Adapting celery.task.http.URL for tornado

Adapting celery.task.http.URL for tornado - python

Celery include a module that is able to make asynchronous HTTP requests using amqp or some other celery backend. I am using tornado-celery producer for asynchronous message publishing. As I understood tornado-celery uses pika for this. The question is how to adapt celery.task.http.URL for tornado (make it non-blocking). There are basically two places, which have to be refined:
HttpDispatch.make_request() have to be implemented using tornado async http client;
URL.get_async(**kw) or URL.post_async(**kw) must be reimplemented with corresponding non-blocking code using tornado API. For instance:
class NonBlockingURL(celery.task.http.URL):
#gen.coroutine
def post_async(self, **kwargs):
async_res = yield gen.Task(self.dispatcher.delay,
str(self), 'POST', **kwargs)
raise gen.Return(async_res)
But I could not understand how to do it in proper and concise way. How to make it fully as non-blocking as asynchronous ? By the way, I am using amqp backend.
Please, provide me nice guideline or even better, an example.

In fact, you have to decide if you use the async method of Tornado or if you use a queue like cellery. There is not point of using both, because the queue answers rapidly about the status of the queue, so there is no point of tornado doing something else while waiting for the queue to respond. To decide between the two solution, i would say:
Celery: more modulary, easy to distribute to different core or different machines, the task can be use by others than tornado, you have to install and keep running softare (amqp,cellery workers...)
Async in Tornado:more monolithic, one program do everything, shorter code, one program to run
To use the async method of Tornado, refer to the documentation.
Here is a short solution using celery and tornado together:
task.py
from celery import Celery,current_task
import time
celery=Celery('tasks',backend='amqp',result_backend='amqp')
#celery.task
def MyTask(url,resid):
for i in range(10):
time.sleep(1)
current_task.update_state(state='running',meta={'i': i})
return 'done'
server.py
import tasks
from tornado import RequestHandler,....
from tornado.web import Application
dictasks={}
class runtask(RequestHandler):
def post(self):
i=len(dictasks)
dictasks[i]=task.MyTask.delay()
self.write(i)
class chktask(RequestHandler):
def get(self,i):
i=int(i)
if dictasks[i].ready():
self.write(dictasks[i].result)
del dictasks[i]
else:
self.write(dictasks[i].state + ' i:' + dictasks[i].info.get('i',-1))
Application = Application([
(r"/runtask", runtask}),
(r"/chktask/([0-9]+)", chktask),
etc.

Related

How does gevent ensure that the same thread-local variables are not shared between multiple coroutines

I have a Python 2 django project, which was started with gunicorn, and write a lot of threading.currentThread().xxxxxx ='some value' in the code.
Because the coroutine reuses the same thread, I am curious how gevent guarantees that the currentThread variable created in coroutine A(Thread 1) will not affect coroutine B (same Thread 1).
After all, the writing on the code is:
import threading
threading.currentThread().xxxxx ='ABCD'
Instead of
import gevent
gevent.currentCoroutine().xxxxx ='ABCD' (simulate my guess)
thanks for your help

It doesn't as far as I'm aware. Normal Gevent coroutines run in the same thread - if you modify something on that thread in one coroutine, it will be modified in the other coroutine as well.
If this is a question about gunicorn, that's a different matter and the following answer has some great detail on that - https://stackoverflow.com/a/41696500/7970018.

You should create threading.localin mainThread.
After monkey patching, Gevent patched gevent.local = threading.local so you can save data in current via:
import threading
threadlocal = threading.local()
def func_in_thread():
# set data
setattr(threadlocal, "_key", "_value")
# do something
# do something
getattr(threadlocal, "_key", None)

Multiprocessing with any popular python webserver

I have used a number of python webservers including the standard http.server, flask, tornado, dash, twisted, and cherryPi. I have also read up on django. Afaict none of these have anything remotely resembling true multi-threading. With django for example the recommendation is to use celery which is a completely separate queue based task manager. Yes we can always resort to external queueing: but that then says there is not anything native that is closer to multithreading (in process). I am very aware of the GIL but at least would look for sharing the same code - akin to fork for a c program.
One thought is to try to use the multiprocessing library. And in fact there is a Q&A on that approach with the accepted answer https://stackoverflow.com/a/28149481/1056563 . However that approach seems to be pure socket tcp/ip: it does not include the important Http handling support. That leaves way too much work to be re-implemented (including round objects such as the wheel).
Is there any way to merge the multiprocessing library approach with an available webserver library such as twisted , tornado, dash etc? Otherwise how do we use their useful http handling capabilitiies?
Update We have a mix of workloads
small/quick responses (sub millisecond cpu): e.g. a couple of RDBMS calls
moderate compute (double digit milliscond cpu) : eg. encryption/decryption of audio files
significant compute (hundreds of milliseconds to single digit seconds): e.g. signal processing of audio and image files
We do need to be able to leverage multiple cpu's on a given machine to concurrently handle the mix of tasks/workloads.

If you need to have several http web server to work with just http requests, you can use Gunicorn which create several instances of your app as child processes.
If you have CPU bound OPs, they will eventually block all http ops, so they should be distributed to other processes. So on start every of your http servers creates several children processes which do heavy tasks.
So the scheme is Gunicorn->http servers->CPU heavy processes
Example with aiohttp:
from aiohttp import web
import time
import multiprocessing as mp
from random import randint
def cpu_heavy_operation(num):
"""Just some CPU heavy task"""
if num not in range(1, 10):
return 0
return str(num**1000000)[0:10]
def process_worker(q: mp.Queue, name: str):
"""Target function for mp.Process. Better convert it to class"""
print(f"{name} Started worker process")
while True:
i = q.get()
if i == "STOP": # poison pill to stop child process gracefully
break
else:
print(f"{name}: {cpu_heavy_operation(i)}")
print(f"{name} Finished worker process")
async def add_another_worker_process(req: web.Request) -> web.Response:
"""Create another one child process"""
q = req.app["cpu_bound_q"]
name = randint(100000, 999999)
pr = mp.Process(
daemon=False,
target=process_worker,
args=(q, f"CPU-Bound_Pr-New-{name}",),
)
pr.start()
req.app["children_pr"] += 1
return web.json_response({"New": name, "Children": req.app["children_pr"]})
async def test_endpoint(req: web.Request) -> web.Response:
"""Just endpoint which feed child processes with tasks"""
x = req.match_info.get("num")
req.app["cpu_bound_q"].put(int(x))
return web.json_response({"num": x})
async def stop_ops(app: web.Application) -> None:
"""To do graceful shutdowns"""
for i in range(app["children_pr"]):
app["cpu_bound_q"].put("STOP")
time.sleep(30) # give child processes chance to stop gracefully
async def init_func_standalone(args=None) -> web.Application:
"""Application factory for standalone run"""
app = web.Application()
app.router.add_get(r"/test/{num:\d+}", test_endpoint)
app.router.add_get("/add", add_another_worker_process)
# create cpu_bound_ops processes block
cpu_bound_q = mp.Queue()
prcs = [
mp.Process(
daemon=False,
target=process_worker,
args=(cpu_bound_q, f"CPU-Bound_Pr-{i}",),
) for i in range(4)
]
[i.start() for i in prcs]
app["children_pr"] = 4 # you should know how many children processes you need to stop gracefully
app["cpu_bound_q"] = cpu_bound_q # Queue for cpu bound ops - multiprocessing module
app.on_cleanup.append(stop_ops)
return app
async def init_func_gunicorn() -> web.Application:
"""is used to run aiohttp with Gunicorn"""
app = await init_func_standalone()
return app
if __name__ == '__main__':
_app = init_func_standalone()
web.run_app(_app, host='0.0.0.0', port=9999)
You see that I multiprocessing, I do it because I like to have more manual control, other option is to go with concurrent.futures. asyncio has run_in_executor method. So just create pool than send CPU heavy tasks to run_in_executor, but before wrap them is create_task asyncio method.

How do you understand the ioloop in tornado?

I am looking for a way to understand ioloop in tornado, since I read the official doc several times, but can't understand it. Specifically, why it exists.
from tornado.concurrent import Future
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop
def async_fetch_future():
http_client = AsyncHTTPClient()
future = Future()
fetch_future = http_client.fetch(
"http://mock.kite.com/text")
fetch_future.add_done_callback(
lambda f: future.set_result(f.result()))
return future
response = IOLoop.current().run_sync(async_fetch_future)
# why get current IO of this thread? display IO, hard drive IO, or network IO?
print response.body
I know what is IO, input and output, e.g. read a hard drive, display graph on the screen, get keyboard input.
by definition, IOLoop.current() returns the current io loop of this thread.
There are many IO device on my laptop running this python code. Which IO does this IOLoop.current() return? I never heard of IO loop in javascript nodejs.
Furthermore, why do I care this low level thing if I just want to do a database query, read a file?

I never heard of IO loop in javascript nodejs.
In node.js, the equivalent concept is the event loop. The node event loop is mostly invisible because all programs use it - it's what's running in between your callbacks.
In Python, most programs don't use an event loop, so when you want one, you have to run it yourself. This can be a Tornado IOLoop, a Twisted Reactor, or an asyncio event loop (all of these are specific types of event loops).
Tornado's IOLoop is perhaps confusingly named - it doesn't do any IO directly. Instead, it coordinates all the different IO (mainly network IO) that may be happening in the program. It may help you to think of it as an "event loop" or "callback runner".

Rather to say it is IOLoop, maybe EventLoop is clearer for you to understand.
IOLoop.current() doesn't really return an IO device but just a pure python event loop which is basically the same as asyncio.get_event_loop() or the underlying event loop in nodejs.
The reason why you need event loop to just do a database query is that you are using event-driven structure to do databse query(In your example, you are doing http request).
Most of time you do not need to care about this low level structure. Instead you just need to use async&await keywords.
Let's say there is a lib which supports asynchronous database access:
async def get_user(user_id):
user = await async_cursor.execute("select * from user where user_id = %s" % user_id)
return user
Then you just need to use this function in your handler:
class YourHandler(tornado.web.RequestHandler):
async def get():
user = await get_user(self.get_cookie("user_id"))
if user is None:
return self.finish("No such user")
return self.finish("Your are %s" % user.user_name)

is it possible to list all blocked tornado coroutines

I have a "gateway" app written in tornado using #tornado.gen.coroutine to transfer information from one handler to another. I'm trying to do some debugging/status testing. What I'd like to be able to do is enumerate all of the currently blocked/waiting coroutines that are live at a given moment. Is this information accessible somewhere in tornado?

You talk about ioloop _handlers dict maybe. Try to add this in periodic callback:
def print_current_handlers():
io_loop = ioloop.IOLoop.current()
print io_loop._handlers
update: I've checked source code and now think that there is no simple way to trace current running gen.corouitines, A. Jesse Jiryu Davis is right!
But you can trace all "async" calls (yields) from coroutines - each yield from generator go into IOLoop.add_callback (http://www.tornadoweb.org/en/stable/ioloop.html#callbacks-and-timeouts)
So, by examining io_loop._callbacks you can see what yields are in ioloop right now.
Many interesting stuff is here :) https://github.com/tornadoweb/tornado/blob/master/tornado/gen.py

No there isn't, but you could perhaps create your own decorator that wraps gen.coroutine, then updates a data structure when the coroutine begins.
import weakref
import functools
from tornado import gen
from tornado.ioloop import IOLoop
all_coroutines = weakref.WeakKeyDictionary()
def tracked_coroutine(fn):
coro = gen.coroutine(fn)
#functools.wraps(coro)
def start(*args, **kwargs):
future = coro(*args, **kwargs)
all_coroutines[future] = str(fn)
return future
return start
#tracked_coroutine
def five_second_coroutine():
yield gen.sleep(5)
#tracked_coroutine
def ten_second_coroutine():
yield gen.sleep(10)
#gen.coroutine
def tracker():
while True:
running = list(all_coroutines.values())
print(running)
yield gen.sleep(1)
loop = IOLoop.current()
loop.spawn_callback(tracker)
loop.spawn_callback(five_second_coroutine)
loop.spawn_callback(ten_second_coroutine)
loop.start()
If you run this script for a few seconds you'll see two active coroutines, then one, then none.
Note the warning in the docs about the dictionary changing size, you should catch "RuntimeError" in "tracker" to deal with that problem.
This is a bit complex, you might get all you need much more simply by turning on Tornado's logging and using set_blocking_log_threshold.

Running twisted reactor in iPython

I'm aware this is normally done with twistd, but I'm wanting to use iPython to test out code 'live' on twisted code.
How to start twisted's reactor from ipython asked basically the same thing but the first solution no longer works with current ipython/twisted, while the second is also unusable (thread raises multiple errors).
https://gist.github.com/kived/8721434 has something called TPython which purports to do this, but running that seems to work except clients never connect to the server (while running the same clients works in the python shell).
Do I have to use Conch Manhole, or is there a way to get iPython to play nice (probably with _threadedselect).
For reference, I'm asking using ipython 5.0.0, python 2.7.12, twisted 16.4.1

Async code in general can be troublesome to run in a live interpreter. It's best just to run an async script in the background and do your iPython stuff in a separate interpreter. You can intercommunicate using files or TCP. If this went over your head, that's because it's not always simple and it might be best to avoid the hassle of possible.
However, you'll be happy to know there is an awesome project called crochet for using Twisted in non-async applications. It truly is one of my favorite modules and I'm shocked that it's not more widely used (you can change that ;D though). The crochet module has a run_in_reactor decorator that runs a Twisted reactor in a separate thread managed by crochet itself. Here is a quick class example that executes requests to a Star Wars RESTFul API, then stores the JSON response in a list.
from __future__ import print_function
import json
from twisted.internet import defer, task
from twisted.web.client import getPage
from crochet import run_in_reactor, setup as setup_crochet
setup_crochet()
class StarWarsPeople(object):
people_id = [_id for _id in range(1, 89)]
people = []
#run_in_reactor
def requestPeople(self):
"""
Request Star Wars JSON data from the SWAPI site.
This occurs in a Twisted reactor in a separate thread.
"""
for _id in self.people_id:
url = 'http://swapi.co/api/people/{0}'.format(_id).encode('utf-8')
d = getPage(url)
d.addCallback(self.appendJSON)
def appendJSON(self, response):
"""
A callback which will take the response from the getPage() request,
convert it to JSON, then append it to self.people, which can be
accessed outside of the crochet thread.
"""
response_json = json.loads(response.decode('utf-8'))
#print(response_json) # uncomment if you want to see output
self.people.append(response_json)
Save this in a file (example: swapi.py), open iPython, import the newly created module, then run a quick test like so:
from swapi import StarWarsPeople
testing = StarWarsPeople()
testing.requestPeople()
from time import sleep
for x in range(5):
print(len(testing.people))
sleep(2)
As you can see it runs in the background and stuff can still occur in the main thread. You can continue using the iPython interpreter as you usually do. You can even have a manhole running in the background for some cool hacking too!
References
https://crochet.readthedocs.io/en/1.5.0/introduction.html#crochet-use-twisted-anywhere

While this doesn't answer the question I thought I had, it does answer (sort of) the question I posted. Embedding ipython works in the sense that you get access to business objects with the reactor running.
from twisted.internet import reactor
from twisted.internet.endpoints import serverFromString
from myfactory import MyFactory
class MyClass(object):
def __init__(self, **kwargs):
super(MyClass, self).__init__(**kwargs)
server = serverFromString(reactor, 'tcp:12345')
server.list(MyFactory(self))
def interact():
import IPython
IPython.embed()
reactor.callInThread(interact)
if __name__ == "__main__":
myclass = MyClass()
reactor.run()
Call the above with python myclass.py or similar.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.