Python Asyncio - Eventloop in a contextmanager - python

Since I dont like the approach to use loop.run() for various reasons I wanted to code contextual loop, since the docs states on different occasions that if you don't go with the canonical .run() you have to prevent memory leaks by yourself (i.e). After a bit of research it seems like the python devs answer this feature with We don't need it!. While contextmanagers seems in general perfectly fine if you using the lower level api of asyncio, see PEP 343 - The “with” Statement exampel 10:
This can be used to deterministically close anything with a close
method, be it file, generator, or something else. It can even be used
when the object isn’t guaranteed to require closing (e.g., a function
that accepts an arbitrary iterable)
So can we do it anyway?
Related links:
https://bugs.python.org/issue24795
https://bugs.python.org/issue32875
https://groups.google.com/g/python-tulip/c/8bRLexUzeU4
https://bugs.python.org/issue19860#msg205062
https://github.com/python/asyncio/issues/261

Yes we can have a context manager for our event loop, even if there seems no good practice via subclassing due the c implementations (i.e). Basically the idea crafted out below is the following:
TL;DR
Create an Object with __enter__ and __exit__ to have the syntax with working.
Instead of usually returning the object, we return the loop that is served by asyncio
wrapping the asycio.loop.close() so that the loop gets stopped and our __exit__ method gets invoked before.
Close all connections which can lead to memory leaks and then close the loop.
Side-note
The implementation is due a wrapper object that returns a new loop into an annonymous block statement. Be aware that loop.stop() will finalize the loop and no further actions should be called. Overall the code below is just a little help and more a styling choice in my opinion, especially due the fact that it is no real subclass. But I think if someone wants to use the lower api without minding to finalize everything before, here is a possibility.
import asyncio
class OpenLoop:
def close(self,*args, **kwargs):
self._loop.stop()
def _close_wrapper(self):
self._close = self._loop.close
self._loop.close = self.close
def __enter__(self):
self._loop = asyncio.new_event_loop()
self._close_wrapper()
return self._loop
def __exit__(self,*exc_info):
asyncio.run(self._loop.shutdown_asyncgens())
asyncio.run(self._loop.shutdown_default_executor())
#close other services
self._close()
if __name__ == '__main__':
with OpenLoop() as loop:
loop.call_later(1,loop.close)
loop.run_forever()
assert loop.is_closed()

Related

Behaviour of class members when threading inside a class

I have some code which sets up an interrupt handler in the main thread and runs a loop in a side thread. This is so I can Ctrl-C the main thread to signal to the loop to gracefully shutdown, and this all happens inside one class, which looks like:
class MyClass:
# non-relevant stuff omitted for brevity
def run(self):
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(self.my_loop, self.arg_1, self.arg_2)
try:
future.result()
except KeyboardInterrupt as e:
self.exit_event.set() # read in my_loop(), exits after finishing an iteration
future.result()
This works fine. My question is, are there are special types of objects or characteristics of objects I should be aware of with this approach, specifically regarding self. members on MyClass? I think it's fine because my_loop is spawned inside MyClass and so no copies of the self. properties are made - initial testing points this to be the case. I'm really wondering if there are any more exotic objects (eg non-pickleable, which does work fine in this) I should consider?
As this is threads instead of between processes communication, pickleability does not matter as nothing is transmitted in queues. Your objects within your class (or outside the class) can be anything.
The only thing you need to keep in mind with class variables is that you need a lock to protect access to them. If you use several threads to modify a class variable, your results will eventually be something unexpected.

Python asyncio: Queue concurrent to normal code

Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!

How should I use asyncio in a library without interfering with other callers?

I want to write a library that manages child processes with asyncio. I don't want to force my callers to be asynchronous themselves, so I'd prefer to get a new_event_loop, do a run_until_complete, and then close it. Ideally I'd like to do this without conflicting with any other asyncio stuff the caller might be doing.
My problem is that waiting on subprocesses doesn't work unless you call set_event_loop, which attaches the internal watcher. But of course if I do that, I might conflict with other event loops in the caller. A workaround is to cache the caller's current loop (if any), and then call set_event_loop one more time when I'm done to restore the caller's state. That almost works. But if the caller is not an asyncio user, a side effect of calling get_event_loop is that I've now created a global loop that didn't exist before, and Python will print a scary warning if the program exits without calling close on that loop.
The only meta-workaround I can think of is to do an atexit.register callback that closes the global loop. That won't conflict with the caller because close is safe to call more than once, unless the caller has done something crazy like trying to start the global loop during exit. So it's still not perfect.
Is there a perfect solution to this?
What you're trying to achieve looks very much like ProcessPoolExecutor (in concurrent.futures).
Asyncronous caller:
#coroutine
def in_process(callback, *args, executor=ProcessPoolExecutor()):
loop = get_event_loop()
result = yield from loop.run_in_executor(executor, callback, *args)
return result
Synchronous caller:
with ProcessPoolExecutor() as executor:
future = executor.submit(callback, *args)
result = future.result()

How monkey_patch(time=True) affects eventlet.spawn?

Normally, when using greenthread, I can write code as:
def myfun():
print "in my func"
eventlet.spawn(myfunc)
eventlet.sleep(0) #then myfunc() is triggered as a thread.
But after use money_patch(time=true), I can change the code as:
eventlet.monkey_patch(time=True)
eventlet.spawn(myfunc) # now myfunc is called immediately
Why I dont need to call eventlet.sleep(0) this time?
And after I write my own sleep function:
def my_sleep(seconds):
print "oh, my god, my own..."
and set the sleep attr of built-in time module as my_sleep func, then I find my_sleep func would be called so many times with a lot of outputs.
But I can only see only one debug-thread in eclipse, which did not call my_sleep func.
So, the conclusion is, sleep function is called continually by default, and I think that the authors of eventlet know this, so they developed the monkey_patch() function. Is that rigth?
#
According the answer of #temoto, CPython cannot reproduce the same result. I thinks I should add some message to introduce how I find this interesting thing.(Why I did not add this message at first time, cause input so many words in not easy, and my English is not that good.^^)
I find this thing when I remote-debug openstack code using eclipse.
In nova/network/model.py, a function is written like this:
class NetworkInfoAsyncWrapper(NetworkInfo):
"""Wrapper around NetworkInfo that allows retrieving NetworkInfo
in an async manner.
This allows one to start querying for network information before
you know you will need it. If you have a long-running
operation, this allows the network model retrieval to occur in the
background. When you need the data, it will ensure the async
operation has completed.
As an example:
def allocate_net_info(arg1, arg2)
return call_neutron_to_allocate(arg1, arg2)
network_info = NetworkInfoAsyncWrapper(allocate_net_info, arg1, arg2)
[do a long running operation -- real network_info will be retrieved
in the background]
[do something with network_info]
"""
def __init__(self, async_method, *args, **kwargs):
self._gt = eventlet.spawn(async_method, *args, **kwargs)
methods = ['json', 'fixed_ips', 'floating_ips']
for method in methods:
fn = getattr(self, method)
wrapper = functools.partial(self._sync_wrapper, fn)
functools.update_wrapper(wrapper, fn)
setattr(self, method, wrapper)
When I first debug to this function, After exeute
self._gt = eventlet.spawn(async_method, *args, **kwargs)
the call-back function async_method is exeuted at once. But please remember, this is a thread, it should be triggered by eventlet.sleep(0).
But I didnt find the code to call sleep(0), so if sleep is perhaps called by eclipse, then in the real world(not-debug world), who triggered it?
TL;DR: Eventlet API does not require sleep(0) to start a green thread. spawn(fun) will start function some time in future, including right now. You only should call sleep(0) to ensure that it starts right now, and even then it's better to use explicit synchronization, like Event or Semaphore.
I can't reproduce this behavior using CPython 2.7.6 and 3.4.3, eventlet 0.17.4 in either IPython or pure python console. So it's probably Eclipse who is calling time.sleep in background.
monkey_patch was introduced as a shortcut to running through all your (and third party) code and replacing time.sleep -> eventlet.sleep, and similar for os, socket, etc modules. It's not related to Eclipse (or something else) repeating time.sleep calls.
But this is an interesting observation, thank you.

Strange blocking behavior with python multiprocessing queue put() and get()

I have written a class in python 2.7 (under linux) that uses multiple processes to manipulate a database asynchronously. I encountered a very strange blocking behaviour when using multiprocessing.Queue.put() and multiprocessing.Queue.get() which I can't explain.
Here is a simplified version of what I do:
from multiprocessing import Process, Queue
class MyDB(object):
def __init__(self):
self.inqueue = Queue()
p1 = Process(target = self._worker_process, kwargs={"inqueue": self.inqueue})
p1.daemon = True
started = False
while not started:
try:
p1.start()
started = True
except:
time.sleep(1)
#Sometimes I start a same second process but it makes no difference to my problem
p2 = Process(target = self._worker_process, kwargs={"inqueue": self.inqueue})
#blahblah... (same as above)
#staticmethod
def _worker_process(inqueue):
while True:
#--------------this blocks depite data having arrived------------
op = inqueue.get(block = True)
#do something with specified operation
#---------------problem area end--------------------
print "if this text gets printed, the problem was solved"
def delete_parallel(self, key, rawkey = False):
someid = ...blahblah
#--------------this section blocked when I was posting the question but for unknown reasons it's fine now
self.inqueue.put({"optype": "delete", "kwargs": {"key":key, "rawkey":rawkey}, "callid": someid}, block = True)
#--------------problem area end----------------
print "if you see this text, there was no blocking or block was released"
If I run the code above inside a test (in which I call delete_parallel on the MyDB object) then everything works, but if I run it in context of my entire application (importing other stuff, inclusive pygtk) strange things happen:
For some reason self.inqueue.get blocks and never releases despite self.inqueue having the data in its buffer. When I instead call self.inqueue.get(block = False, timeout = 1) then the call finishes by raising Queue.Empty, despite the queue containing data. qsize() returns 1 (suggests that data is there) while empty() returns True (suggests that there is no data).
Now clearly there must be something somewhere else in my application that renders self.inqueue unusable by causing acquisition of some internal semaphore. However I don't know what to look for. Eclipse dubugging becomes useless once a blocking semaphore is reached.
Edit 8 (cleaning up and summarizing my previous edits) Last time I had a similar problem, it turned out that pygtk was hijacking the global interpreter lock, but I solved it by calling gobject.threads_init() before I called anything else. Could this issue be related?
When I introduce a print "successful reception" after the get() method and execute my application in terminal, the same behaviour happens at first. When I then terminate by pressing CTRL+D I suddenly get the string "successful reception" inbetween messages. This looks to me like some other process/thread is terminated and releases the lock that blocks the process that is stuck at get().
Since the process that was stuck terminates later, I still see the message. What kind of process could externally mess with a Queue like that? self.inqueue is only accessed inside my class.
Right now it seems to come down to this queue which won't return anything despite the data being there:
the get() method seems to get stuck when it attempts to receive the actual data from some internal pipe. The last line before my debugger hangs is:
res = self._recv()
which is inside of multiprocessing.queues.get()
Tracking this internal python stuff further I find the assignments
self._recv = self._reader.recv and self._reader, self._writer = Pipe(duplex=False).
Edit 9
I'm currently trying to hunt down the import that causes it. My application is quite complex with hundreds of classes and each class importing a lot of other classes, so it's a pretty painful process. I have found a first candidate class which Uses 3 different MyDB instances when I track all its imports (but doesn't access MyDB.inqueue at any time as far as I can tell). The strange thing is, it's basically just a wrapper and the wrapped class works just fine when imported on its own. This also means that it uses MyDB without freezing. As soon as I import the wrapper (which imports that class), I have the blocking issue.
I started rewriting the wrapper by gradually reusing the old code. I'm testing each time I introduce a couple of new lines until I will hopefully see which line will cause the problem to return.
queue.Queue uses internal threads to maintain its state. If you are using GTK then it will break these threads. So you will need to call gobject.init_threads().
It should be noted that qsize() only returns an approximate size of the queue. The real size may be anywhere between 0 and the value returned by qsize().

Categories