Elegant way to handle queuing up multiple tasks into single bulkadd/batch - python

I use task queue quite extensively in an application.
Most of the time it's using the following pattern:
yield (add_foo_task_async(), add_bar_task_async(), add_baz_task_async())
# add_foo_task_async() etc are defined like this
#classmethod
#ndb.tasklet
def add_foo_task_async(cls, param):
queue = taskqueue.Queue("foo")
# perform various modifications on params etc...
params = {
"param": param,
}
task = taskqueue.Task(url=uri_for("tasks/foo_worker"), params=params)
result = yield queue.add_async(task)
raise ndb.Return(result)
The problem is it seems to create a "ladder" of "bulkAdds".
I'd like to improve the performance so that there aren't all these ladders.
One solution I'm considering is creating a class where the tasks are created and stored in a list. The class would also have "add_tasks_to_taskqueue" method which queues them all to the actual task queue. One issue however is that the quite a lot of tasks that I use are queued up in _post_put_hook (so I'd need a way to pass this class everywhere). Another concern is that I use multiple queues at the moment, so I assume I'll need to change that?
update
I've seen that ndb context has some auto-batching code for memcache and urlfetch. Could the proposed solution somehow use a similar method where we extend ndb context (is that possible?) and use something like get_contenxt().add_task_to_batch_queue(task)
Is there a better/elegant way to handle what I'm trying to achieve?
Thanks

Related

dask clusters with context manager

Consider a simple workflow like this:
from dask.distributed import Client
import time
with Client() as client:
futs = client.map(time.sleep, list(range(10)))
The above code will submit and almost immediately cancel the futures since the context manager will close. It's possible to keep the context manager open until tasks are completed with client.gather, however that will block further execution in the current process.
I am interested in submitting tasks to multiple clusters (e.g. local and distributed) within the same process, ideally without blocking the current process. It's straightforward to do with explicit definition of different clients and clusters, but is it also possible with context managers (one for each unique client/cluster)?
It might sound like a bit of an anti-pattern, but maybe there is way to close the cluster only after computations all futures run. I tried fire_and_forget and also tried passing shutdown_on_close=False, but that doesn't seem to be implemented.
For some Dask cluster/scheduler types, such as the dask-cloudprovider ECSCluster, the approach described above using the with block and shutdown_on_close=False would work fine.
Both ECSCluster and SLURMCluster are derived from SpecCluster. However, ECSCluster passes its **kwargs (including shutdown_on_close) down to the SpecCluster constructor via this call:
super().__init__(**kwargs)
(see the ECSCluster code here)
SLURMCluster does not: it calls the JobQueueCluster constructor which in turn instantiates SpecCluster with only a subset of its parameters:
super().__init__(
scheduler=scheduler,
worker=worker,
loop=loop,
security=security,
silence_logs=silence_logs,
asynchronous=asynchronous,
name=name,
)
See the JobQueueCluster code here
Therefore SLURMCluster/JobQueueCluster is ignoring shutdown_on_close (and other optional parameters). Looks like an update to JobQueueCluster would be required for your use case.

Tracking the progress of a function in another function

I have a Python back end running Flask, and it is going to have a function (or a few functions chained together) which will run several AJAX calls and perform some database operations.
This will take a while, so on the front end I'm looking to poll the server at regular intervals and update the UI as progress is made. A general outline might be something like this:
app.route('/update', methods=['GET'])
def getUpdate():
# return a response with the current status of the update
#app.route('/update', methods=['POST'])
def runUpdate():
# asynchronously call update() and return status
def update():
# perform ajax calls
# update database
# query database
# ...
I considered WebSockets, but I don't know if that's making things a little too complex for just a simple update in the UI. I know I could also use a module-scoped variable or store the status in a database table, but either of those feels like bad design to me. Is there a simple pattern I can use to achieve this?
Use a database to store the status. If you use something like redis, you can even do it realtime with pub/sub and websockets.
The module-scoped variable is a bad choice. It doesn't scale.
If it is a long running task, consider using a task queue, like rq or celery.

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

Asynchronous object instantiation

How can I make the following object instantiation asynchronous:
class IndexHandler(tornado.web.RequentHandler):
def get(self, id):
# Async the following
data = MyDataFrame(id)
self.write(data.getFoo())
The MyDataFrame returns a pandas DataFrame object and can take some time depending on the file it has to parse.
MyDataFrame() is a synchronous interface; to use it without blocking you need to do one of two things:
Rewrite it to be asynchronous. You can't really make an __init__ method asynchronous, so you'll need to refactor things into a static factory function instead of a constructor. In most cases this path only makes sense if the method depends on network I/O (and not reading from the filesystem or processing the results on the CPU).
Run it on a worker thread and asynchronously wait for its result on the main thread. From the way you've framed the question, this sounds like the right approach for you. I recommend the concurrent.futures package (in the standard library since Python 3.2; available via pip install futures for 2.x).
This would look something like:
#tornado.gen.coroutine
def get(self, id):
data = yield executor.submit(MyDataFrame, id)
self.write(data.getFoo())
where executor is a global instance of ThreadPoolExecutor.

Advice on backgrounding a task with variables?

I have a python webapp which accepts some data via POST. The method which is called can take a while to complete (30-60s), so I would like to "background" the method so I can respond to the user with a "processing" message.
The data is quite sensitive, so I'd prefer not to use any queue-based solutions. I also want to ensure that the backgrounded method doesn't get interrupted should the webapp fail in any way.
My first thought is to fork a process, however I'm unsure how I can pass variables to a process.
I've used Gevent before, which has a handy method: gevent.spawn(function, *args, **kwargs). Is there anything like this that I could use at the process-level?
Any other advice?
The simplest approach would be to use a thread. Pass data to and from a thread with a Queue.

Categories