Executing python code in parallel with ndb tasklets

Executing python code in parallel with ndb tasklets - python

First of all i know i can use threading to accomplish such task, like so:
import Queue
import threading
# called by each thread
def do_stuff(q, arg):
result = heavy_operation(arg)
q.put(result)
operations = range(1, 10)
q = Queue.Queue()
for op in operations:
t = threading.Thread(target=do_stuff, args = (q,op))
t.daemon = True
t.start()
s = q.get()
print s
However, in google app engine there's something called ndb tasklets and according to their documentation you can execute code in parallel using them.
Tasklets are a way to write concurrently running functions without
threads; tasklets are executed by an event loop and can suspend
themselves blocking for I/O or some other operation using a yield
statement. The notion of a blocking operation is abstracted into the
Future class, but a tasklet may also yield an RPC in order to wait for
that RPC to complete.
Is it possible to accomplish something like the example with threading above?
I already know how to handle retrieving entities using get_async() (got it from their examples at doc page) but its very unclear to me when it comes to parallel code execution.
Thanks.

The answer depended on what your heavy_operation really is. If the heavy_operation use RPC (Remote Procedure Call, such as datastore access, UrlFetch, ... etc), then the answer is yes.
In
how to understand appengine ndb.tasklet?
I asked a similar question, you may find more details there.
May I put any kind of code inside a function and decorate it as ndb.tasklet? Then used it as async function later. Or it must be appengine RPC?
The Answer
Technically yes, but it will not run asynchronously. When you decorate a non-yielding function with #tasklet, its Future's value is computed and set when you call that function. That is, it runs through the entire function when you call it. If you want to achieve asynchronous operation, you must yield on something that does asynchronous work. Generally in GAE it will work its way down to an RPC call.

Related

How to specify a part of code to run in a particular thread in a multithreaded environment in python?

How to achieve something like:
def call_me():
# doing some stuff which requires distributed locking
def i_am_calling():
# other logic
call_me()
# other logic
This code runs in a multithreaded environment. How can I make it something like, only a single thread from the thread pool has responsibility to run call_me() part of the i_am_calling()?

It depends on the exact requirement in hand and on the system architecture / solution. Accordingly, one of the approach can be based on lock to ensure that only one process does the locking at a time.
You can arrive on logic by trying usage of apply_async of the multiprocessing module that could enable invocation of a number of different functions (not of same type of function) with pool.apply_async. It shall use only one process when that function is invoked only once, however you can bundle up tasks ahead and pass/submit these tasks to the various worker processes. There is also the pool.apply that submits a task to the pool , but it blocks until the function is completed or result is available. The equivalent of it is pool.apply_async(func, args, kwargs).get() based on get() or a callback function with pool.apply_async without get(). Also, it should be noted that pool.apply(f, args) ensures that only one of the workers of the pool will execute f(args).
You can also arrive on logic by trying of making a respective call in its own thread using executor.submit that is part of concurrent.futures which is a standard Python library . The asyncio can be coupled with concurrent.futures such that it can await functions executed in thread or process pools provided by concurrent.futures as highlighted in this example.
If you would like to run a routine functionality at regular interval, then you can arrive on a logic based on threading.timer.

Python asyncio: synchronize all access to a shared object

I have a class which processes a buch of work elements asynchronously (mainly due to overlapping HTTP connection requests) using asyncio. A very simplified example to demonstrate the structure of my code:
class Work:
...
def worker(self, item):
# do some work on item...
return
def queue(self):
# generate the work items...
yield from range(100)
async def run(self):
with ThreadPoolExecutor(max_workers=10) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, self.worker, item)
for item in self.queue()
]
for result in await asyncio.gather(*tasks):
pass
work = Work()
asyncio.run(work.run())
In practice, the workers need to access a shared container-like object and call its methods which are not async-safe. For example, let's say the worker method calls a function defined like this:
def func(shared_obj, value):
for node in shared_obj.filter(value):
shared_obj.remove(node)
However, calling func from a worker might affect the other asynchronous workers in this or any other function involving the shared object. I know that I need to use some synchronization, such as a global lock, but I don't find its usage easy:
asyncio.Lock can be used only in async functions, so I would have to mark all such function definitions as async
I would also have to await all calls of these functions
await is also usable only in async functions, so eventually all functions between worker and func would be async
if the worker was async, it would not be possible to pass it to loop.run_in_executor (it does not await)
Furthermore, some of the functions where I would have to add async may be generic in the sense that they should be callable from asynchronous as well as "normal" context.
I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock and work with it in a couple of places, without having to further annotate the functions. Also, there is a nice solution to wrap the shared object such that all access is transparently guarded by a lock. I'm wondering if something similar is possible with asyncio...

I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock...
What you are missing is that you're not really using asyncio at all. run_in_executor serves to integrate CPU-bound or legacy sync code into an asyncio application. It works by submitting the function it to a ThreadPoolExecutor and returning an awaitable handle which gets resolved once the function completes. This is "async" in the sense of running in the background, but not in the sense that is central to asyncio. An asyncio program is composed of non-blocking pieces that use async/await to suspend execution when data is unavailable and rely on the event loop to efficiently wait for multiple events at once and resume appropriate async functions.
In other words, as long as you rely on run_in_executor, you are just using threading (more precisely concurrent.futures with a threading executor). You can use a threading.Lock to synchronize between functions, and things will work exactly as if you used threading in the first place.
To get the benefits of asyncio such as scaling to a large number of concurrent tasks or reliable cancellation, you should design your program as async (or mostly async) from the ground up. Then you'll be able to modify shared data atomically simply by doing it between two awaits, or use asyncio.Lock for synchronized modification across awaits.

Running long blocking calculations in parallel in twisted

I am trying to learn twisted framework. But, I am not able to get a handle of it.
Say, I have this function.
def long_blocking_call(arg1, arg2):
# do something
time.sleep(5) # simulate blocking call
return result
results = []
for k, v in args.iteritems():
r = long_blocking_call(k,v)
results.append(r)
But, I was wondering how can I leverage deferToThread (or something else in twisted world) to run the long_blocking_call in "parallel"
I found this example: Periodically call deferToThread
But, I am not exactly sure if that is running things in parallel?

deferToThread uses Python's built-in threading support to run the function passed to it in a separate thread (from a thread pool).
So deferToThread has all of the same properties as the built-in threading module when it comes to parallelism. On CPython, threads can run in parallel as long as only one of them is holding the Global Interpreter Lock.
Since there is no universal cause of "blocking" there is also no universal solution to "blocking" - so there's no way to say whether deferToThread will result in parallel execution or not in general. However, a general rule of thumb is that if the blocking comes from I/O it probably will and if it comes from computation it probably won't.
Of course, if it comes from I/O, you might be better off using some other feature from Twisted instead of multithreading.

Twisted callRemote

I have to make remote calls that can take quite a long time (over 60 seconds). Our entire code relies on processing the return value from the callRemote, so that's pretty bad since we're blocking on IO the whole time despite using twqisted + 50 worker threads running.
We currently use something like
result = threads.blockingCallFromThread(reactor, callRemote, "method", args)
and get the result/go on, but as its name says it's blocking the event loop so we cannot wait for several results at the same time.
THere's no way I can refactor the whole code to make it asynchronous so I think the only way is to defer the long IO tasks to threads.
I'm trying to make the remote calls in threads, but I can't find a way to get the result from the blocking calls back. The remoteCalls are made, the result is somewhere but I just can't get a hook on it.
What I'm trying to do currently looks like
reactor.callInThread(callRemote, name, *args, **kw)
which returns a empty Deferred (why ?).
I'm trying to put the result in some sort of queue but it just won't work. How do I do that ?

AFAIK, blockingCallFromThread executes code in reactor's thread. That's why it doesn't work as you need.
If I understand you properly, you need to move some operation out off reactors thread and get the result into reactors thread.
I use approach with deferToThread for the same case.
Example with deferreds:
import time
from twisted.internet import reactor, threads
def doLongCalculation():
time.sleep(1)
return 3
def printResult(x):
print x
# run method in thread and get result as defer.Deferred
d = threads.deferToThread(doLongCalculation)
d.addCallback(printResult)
reactor.run()
Also, you might be interested in threads.deferToThreadPool.
Documentation about threading in Twisted.

Making file-handling code compatible with asyncio

The "traditional" way for a library to take file input is to do something like this:
def foo(file_obj):
data = file_obj.read()
# Do other things here
The client code is responsible for opening the file, seeking to the appropriate point (if necessary), and closing it. If the client wants to hand us a pipe or socket (or a StringIO, for that matter), they can do that and it Just Works.
But this isn't compatible with asyncio, which requires a syntax more like this:
def foo(file_obj):
data = yield from file_obj.read()
# Do other things here
Naturally, this syntax only works with asyncio objects; trying to use it with traditional file objects makes a mess. The reverse is also true.
Worse, it seems to me there's no way to wrap this yield from inside a traditional .read() method, because we need to yield all the way up to the event loop, not just at the site where the reading happens. The gevent library does do something like this, but I don't see how to adapt their greenlet code into generators.
If I'm writing a library that handles file input, how should I deal with this situation? Do I need two versions of the foo() function? I have many such functions; duplicating all of them is not scalable.
I could tell my client developers to use run_in_executor() or some equivalent, but that feels like working against asyncio instead of with it.

This is one of the downsides of explicit asynchronous frameworks. Unlike gevent, which can monkeypatch synchronous code to make it asynchronous without any code changes, you can't make synchronous code asyncio-compatible without rewriting it to use asyncio.coroutine and yield from (or at least asyncio.Futures and callbacks) all the way down.
There's no way that I know of to have the same function work properly in both an asyncio and normal, synchronous context; any code that's asyncio compatible is going to rely on the event loop to be running to drive the asynchronous portions, so it won't work in a normal context, and synchronous code is always going to end up blocking the event loop if its run in an asyncio context. This is why you generally see asyncio-specific (or at least asynchronous framework-specific) versions of libraries alongside synchronous versions. There's just no good way to present a unified API that works with both.

Having considered this some more, I've come to the conclusion that it is possible to do this, but it's not exactly beautiful.
Start with the traditional version of foo():
def foo(file_obj):
data = file_obj.read()
# Do other things here
We need to pass a file object which will behave "correctly" here. When the file object needs to do I/O, it should follow this process:
It creates a new event.
It creates a closure which, when invoked, performs the necessary I/O and then sets the event.
It hands the closure off to the event loop using call_soon_threadsafe().
It blocks on the event.
Here's some example code:
import asyncio, threading
# inside the file object class
def read(self):
event = threading.Event()
def closure():
# self.reader is an asyncio StreamReader or similar
self._tmp = yield from self.reader.read()
event.set()
asyncio.get_event_loop().call_soon_threadsafe(closure)
event.wait()
return self._tmp
We then arrange for foo(file_obj) to be run in an executor (e.g. using run_in_executor() as suggested in the OP).
The nice thing about this technique is that it works even if the author of foo() has no knowledge of asyncio. It also ensures I/O is served on the event loop, which could be desirable in certain circumstances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Executing python code in parallel with ndb tasklets - python

Related

How to specify a part of code to run in a particular thread in a multithreaded environment in python?

Python asyncio: synchronize all access to a shared object

Running long blocking calculations in parallel in twisted

Twisted callRemote

Making file-handling code compatible with asyncio

Categories

Resources