Can celery cooperatively run coroutines as stateful/resumable tasks?

Can celery cooperatively run coroutines as stateful/resumable tasks? - python

I'm currently investigating Celery for use in an video-processing backend. Essentially my problem is as follows:
I have a frontend web server that concurrently processes a large number of video streams (on the order of thousands).
Each stream must be processed independently and in parallel.
Stream processing can be divided into two types of operations:
Frame-by-frame operations (computations that do not need information about the preceding or following frame(s))
Stream-level operations (computations that work on a subset of ordered, adjacent frames)
Given point 3, I need to maintain and update an ordered structure of frames throughout the process and farm computations on subsections of this structure to Celery workers. Initially, I thought about organizing things as follows:
[frontend server] -stream-> [celery worker 1 (greenlet)] --> [celery worker 2 (prefork)]
The idea is that celery worker 1 executes long-running tasks that are primarily I/O-bound. In essence, these tasks would only do the following:
Read a frame from the frontend server
Decode the frame from it's base64 representation
Enqueue it in the aforementioned ordered data structure (a collections.deque object, as it currently stands).
Any CPU-bound operations (i.e. image analysis) are shipped off to celery worker 2.
My problem is as follows:
I would like to execute a coroutine as a task such that I have a long-running tasks from which I can yield so as to not block celery worker 1's operations. In other words, I'd like to be able to do something akin to:
def coroutine(func):
#wraps(func)
def start(*args, **kwargs):
cr = func(*args, **kwargs)
cr.next()
return cr
return start
#coroutine
def my_taks():
stream = deque() # collections.deque
source = MyAsynchronousInputThingy() # something i'll make myself, probably using select
while source.open:
if source.has_data:
stream.append(Frame(source.readline())) # read data, build frame and enqueue to persistent structure
yield # cooperatively interrupt so that other tasks can execute
Is there a way to make a coroutine-based task run indefinitely, ideally producing results as they are yielded?

Primary idea behind Eventlet is that you want to write synchronous code, as with threads, socket.recv() should block current thread until next statement. This style is very easy to read, maintain and reason about while debugging. To make things effective and scalable, behind scenes, Eventlet does the magic to replace seemingly blocking code with green threads and epoll/kqueue/etc mechanisms to wake up those green threads at proper times.
So all you need is execute eventlet.monkey_patch() as soon as possible (e.g. second line in module) and make sure you use pure Python socket operations in MyInputThingy. Forget about asynchronous, just write normal blocking code as you would with threads.
Eventlet makes synchronous code good again.

Related

AIOHTTP: Respond to a POST quickly but process its data in the background

I have a ANPR (automated number plate reading) system. Essentially a few cameras configured. These make HTTP POSTs to locations we configure. Our cocktail of problems are such:
Our script needs to send this data onto multiple, occasionally slow places.
The camera locks up while it's POSTing.
So if my script takes 15 seconds to complete —it can— we might miss a read.
Here's the cut down version of my script at the moment. Apologies for the 3.4 syntax, we've got some old machines on site.
#asyncio.coroutine
def handle_plate_read(request):
post_data = yield from request.post()
print(post_data)
# slow stuff!
return web.Response()
app = web.Application()
app.router.add_post('/', handle_plate_read)
web.run_app(app, port=LISTEN_PORT)
This is functional but can I push back the 200 to the camera early (and disconnect it) and carry on processing the data, or otherwise easily defer that step until after the camera connection is handled?

If I understood your question correctly, you can of course carry on processing the data right after response or deffer it to process even later.
Here's how simple solution can look like:
Instead of doing slow stuff before response, add data needed to
do it to some Queue with unlimited size and return response
imideately.
Run some Task in background to process slow job that would
grab data from queue. Task itself runs parallely with other
coroutines and doesn't block them. (more info)
Since "slow stuff" is probably something CPU-related, you would
need to use run_in_executor with ProcessPoolExecutor to do
slow stuff in other process(es). (more info)
This should work in basic case.
But you should also think over how will it work under heavy load. For example if you grab data for slow stuff quicky, but process it much slower your queue will grow and you'll run out of RAM.
In that case it makes sense to store your data in DB other then in queue (you would probably need to create separate task to store your data in DB without blocking response). asyncio has drivers for many DB, for example.

Is this multi-threaded function asynchronous

I'm afraid I'm still a bit confused (despite checking other threads) whether:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so:
So I constructed this toy example:
from threading import *
from queue import Queue
import time
def do_something_with_io_lag(in_work):
out = in_work
# Imagine we do some work that involves sending
# something over the internet and processing the output
# once it arrives
time.sleep(0.5) # simulate IO lag
print("Hello, bee number: ",
str(current_thread().name).replace("Thread-",""))
class WorkerBee(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
while True:
# Get some work from the queue
work_todo = self.q.get()
# This function will simiulate I/O lag
do_something_with_io_lag(work_todo)
# Remove task from the queue
self.q.task_done()
if __name__ == '__main__':
def time_me(nmbr):
number_of_worker_bees = nmbr
worktodo = ['some input for work'] * 50
# Create a queue
q = Queue()
# Fill with work
[q.put(onework) for onework in worktodo]
# Launch processes
for _ in range(number_of_worker_bees):
t = WorkerBee(q)
t.start()
# Block until queue is empty
q.join()
# Run this code in serial mode (just one worker)
%time time_me(nmbr=1)
# Wall time: 25 s
# Basically 50 requests * 0.5 seconds IO lag
# For me everything gets processed by bee number: 59
# Run this code using multi-tasking (launch 50 workers)
%time time_me(nmbr=50)
# Wall time: 507 ms
# Basically the 0.5 second IO lag + 0.07 seconds it took to launch them
# Now everything gets processed by different bees
Is it asynchronous?
To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel).
However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
Can this function on just one thread?
Especially when compared to asyncio, which means it can run single-threaded
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.

First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable.
Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc.
Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution.
You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading.
Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway.
Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences.
Multithreading for computations (i.e. CPU bound code)
Asyncio for input/output (i.e. I/O bound code)
Regarding your original statement:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
I hope that I was able to show, that:
asynchronous code might be both single threaded and multi-threaded
all multi-threaded functions could be called "asynchronous"

I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do:
Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous.
Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous.
Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved.
Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.

Django - how to set up asynchronous longtime background data processing task?

Newb quesion about Django app design:
Im building reporting engine for my web-site. And I have a big (and getting bigger with time) amounts of data, and some algorithm which must be applied to it. Calculations promise to be heavy on resources, and it would be stupid if they are performed by requests of users. So, I think to put them into background process, which would be executed continuously and from time to time return results, which could be feed to Django views-routine for producing html output by demand.
And my question is - what proper design approach for building such system? Any thoughts?

Celery is one of your best choices. We are using it successfully. It has a powerful scheduling mechanism - you can either schedule tasks as a timed job or trigger tasks in background when user (for example) requests it.
It also provides ways to query for the status of such background tasks and has a number of flow control features. It allows for a very easy distribution of the work - i.e your celery background tasks can be run on a separate machine (this is very useful for example with heroku web/workers split where web process is limited to max 30s per request). It provides various queue backends (it can use database, rabbitMQ or a number of other queuing mechanisms. With simplest setup it can use the same database that your Django site already uses for that (which makes it easy to setup).
And if you are using automated tests it also has a feature that helps with testing - it can be set in "eager" mode, where background tasks are not executed in background - thus giving predictable logic testing.
More info here: http://docs.celeryproject.org:8000/en/latest/django/

You mean the results are returned into a database or do you want to create django-views directly from your independently running code?
If you have large amounts of data I like to use Pythons multiprocessing. You can create a Generator which fills a JoinableQueue with the different tasks to do and a pool of Workers consuming the different Tasks. This way you should be able to maximize the resource utilization on your system.
The multiprocessing module also allows you to do several tasks over the network (e.g. multiprocessing.Manager()). With this in mind you should easily be able to scale things up if you need a second machine to process the data in time.
Example:
This example shows how to spawn multiple processes. The generator function should query the database for all new entries that need heavy lifting. The consumers take the individual items from the queue and do the actual calculations.
import time
from multiprocessing.queues import JoinableQueue
from multiprocessing import Process
QUEUE = JoinableQueue(-1)
def generator():
""" Puts items in the queue. For example query database for all new,
unprocessed entries that need some serious math done.."""
while True:
QUEUE.put("Item")
time.sleep(0.1)
def consumer(consumer_id):
""" Consumes items from the queue... Do your calculations here... """
while True:
item = QUEUE.get()
print "Process %s has done: %s" % (consumer_id, item)
QUEUE.task_done()
p = Process(target=generator)
p.start()
for x in range(0, 2):
w = Process(target=consumer, args=(x,))
w.start()
p.join()
w.join()

Why don't you have a url or python script that triggers whatever sort of calculation you need to have done everytime it's run and then fetch that url or run that script via a cronjob on the server? From what your question was it doesn't seem like you need a whole lot more than that.

Non-blocking, non-concurrent tasks in Python

I am working on an implementation of a very small library in Python that has to be non-blocking.
On some production code, at some point, a call to this library will be done and it needs to do its own work, in its most simple form it would be a callable that needs to pass some information to a service.
This "passing information to a service" is a non-intensive task, probably sending some data to an HTTP service or something similar. It also doesn't need to be concurrent or to share information, however it does need to terminate at some point, possibly with a timeout.
I have used the threading module before and it seems the most appropriate thing to use, but the application where this library will be used is so big that I am worried of hitting the threading limit.
On local testing I was able to hit that limit at around ~2500 threads spawned.
There is a good possibility (given the size of the application) that I can hit that limit easily. It also makes me weary of using a Queue given the memory implications of placing tasks at a high rate in it.
I have also looked at gevent but I couldn't see an example of being able to spawn something that would do some work and terminate without joining. The examples I went through where calling .join() on a spawned Greenlet or on an array of greenlets.
I don't need to know the result of the work being done! It just needs to fire off and try to talk to the HTTP service and die with a sensible timeout if it didn't.
Have I misinterpreted the guides/tutorials for gevent ? Is there any other possibility to spawn a callable in fully non-blocking fashion that can't hit a ~2500 limit?
This is a simple example in Threading that does work as I would expect:
from threading import Thread
class Synchronizer(Thread):
def __init__(self, number):
self.number = number
Thread.__init__(self)
def run(self):
# Simulating some work
import time
time.sleep(5)
print self.number
for i in range(4000): # totally doesn't get past 2,500
sync = Synchronizer(i)
sync.setDaemon(True)
sync.start()
print "spawned a thread, number %s" % i
And this is what I've tried with gevent, where it obviously blocks at the end to
see what the workers did:
def task(pid):
"""
Some non-deterministic task
"""
gevent.sleep(1)
print('Task', pid, 'done')
for i in range(100):
gevent.spawn(task, i)
EDIT:
My problem stemmed out from my lack of familiarity with gevent. While the Thread code was indeed spawning threads, it also prevented the script from terminating while it did some work.
gevent doesn't really do that in the code above, unless you add a .join(). All I had to do to see the gevent code do some work with the spawned greenlets was to make it a long running process. This definitely fixes my problem as the code that needs to spawn the greenlets is done within a framework that is a long running process in itself.

Nothing requires you to call join in gevent, if you're expecting your main thread to last longer than any of your workers.
The only reason for the join call is to make sure the main thread lasts at least as long as all of the workers (so that the program doesn't terminate early).

Why not spawn a subprocess with a connected pipe or similar and then, instead of a callable, just drop your data on the pipe and let the subprocess handle it completely out of band.

As explained in Understanding Asynchronous/Multiprocessing in Python, asyncoro framework supports asynchronous, concurrent processes. You can run tens or hundreds of thousands of concurrent processes; for reference, running 100,000 simple processes takes about 200MB. If you want to, you can mix threads in rest of the system and coroutines with asyncoro (provided threads and coroutines don't share variables, but use coroutine interface functions to send messages etc.).

What's the best pattern to design an asynchronous RPC application using Python, Pika and AMQP?

The producer module of my application is run by users who want to submit work to be done on a small cluster. It sends the subscriptions in JSON form through the RabbitMQ message broker.
I have tried several strategies, and the best so far is the following, which is still not fully working:
Each cluster machine runs a consumer module, which subscribes itself to the AMQP queue and issues a prefetch_count to tell the broker how many tasks it can run at once.
I was able to make it work using SelectConnection from the Pika AMQP library. Both consumer and producer start two channels, one connected to each queue. The producer sends requests on channel [A] and waits for responses in channel [B], and the consumer waits for requests on channel [A] and send responses on channel [B]. It seems, however, that when the consumer runs the callback that calculates the response, it blocks, so I have only one task executed at each consumer at each time.
What I need in the end:
the consumer [A] subscribes his tasks (around 5k each time) to the cluster
the broker dispatches N messages/requests for each consumer, where N is the number of concurrent tasks it can handle
when a single task is finished, the consumer replies to the broker/producer with the result
the producer receives the replies, update the computation status and, in the end, prints some reports
Restrictions:
If another user submits work, all of his tasks will be queued after the previous user (I guess this is automatically true from the queue system, but I haven't thought about the implications on a threaded environment)
Tasks have an order to be submitted, but the order they are replied is not important
UPDATE
I have studied a bit further and my actual problem seems to be that I use a simple function as callback to the pika's SelectConnection.channel.basic_consume() function. My last (unimplemented) idea is to pass a threading function, instead of a regular one, so the callback would not block and the consumer can keep listening.

As you have noticed, your process blocks when it runs a callback. There are several ways to deal with this depending on what your callback does.
If your callback is IO-bound (doing lots of networking or disk IO) you can use either threads or a greenlet-based solution, such as gevent, eventlet, or greenhouse. Keep in mind, though, that Python is limited by the GIL (Global Interpreter Lock), which means that only one piece of python code is ever running in a single python process. This means that if you are doing lots of computation with python code, these solutions will likely not be much faster than what you already have.
Another option would be to implement your consumer as multiple processes using multiprocessing. I have found multiprocessing to be very useful when doing parallel work. You could implement this by either using a Queue, having the parent process being the consumer and farming out work to its children, or by simply starting up multiple processes which each consume on their own. I would suggest, unless your application is highly concurrent (1000s of workers), to simply start multiple workers, each of which consumes from their own connection. This way, you can use the acknowledgement feature of AMQP, so if a consumer dies while still processing a task, the message is sent back to the queue automatically and will be picked up by another worker, rather than simply losing the request.
A last option, if you control the producer and it is also written in Python, is to use a task library like celery to abstract the task/queue workings for you. I have used celery for several large projects and have found it to be very well written. It will also handle the multiple consumer issues for you with the appropriate configuration.

Your setup sounds good to me. And you are right, you can simply set the callback to start a thread and chain that to a separate callback when the thread finishes to queue the response back over Channel B.
Basically, your consumers should have a queue of their own (size of N, amount of parallelism they support). When a request comes in via Channel A, it should store the result in the queue shared between the main thread with Pika and the worker threads in the thread pool. As soon it is queued, pika should respond back with ACK, and your worker thread would wake up and start processing.
Once the worker is done with its work, it would queue the result back on a separate result queue and issue a callback to the main thread to send it back to the consumer.
You should take care and make sure that the worker threads are not interfering with each other if they are using any shared resources, but that's a separate topic.

Being unexperienced in threading, my setup would run multiple consumer processes (the number of which basically being your prefetch count). Each would connect to the two queues and they would process jobs happily, unknowning of eachother's existence.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.