set dask workers with an event loop for actors - python

Context
I am trying to instantiate a legacy data extractor by my dask worker using an actor pattern
from dask.distributed import Client
client = Client()
connector = Sharepoint(CONF.sources["sharepoint"])
items = connector.enumerate_items()
# extraction
remote_extractor = client.submit(
SharepointExtractor, CONF.sources["sharepoint"], connector, actor=True
) # Create Extractor on a worker
extractor = remote_extractor.result() # Get back a pointer to that object
futures = client.map(
extractor.job,
[i for i in items],
retries=5,
pure=False,
)
_ = await client.gather(futures)
The first thing the SharepointExtractor does is to get an http session from its connector
class SharepointExtractor:
def __init__(
self, conf: ConfigTree, connector: Sharepoint, *args, **kwargs
) -> None:
self.conf = conf
self.session = connector.session_factory()
.session_factory() basically returns a aiohttp.client.ClientSession enriched with an Oauth token (which motivates the choice for an actor).
The problem
at one point ClientSession's constructor calls asyncio.get_event_loop() which does not seem available in the worker
...
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/eteel/connectors/rest.py", line 96, in session_factory
connector=TCPConnector(limit=30),
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/aiohttp/connector.py", line 767, in __init__
super().__init__(
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/aiohttp/connector.py", line 234, in __init__
loop = get_running_loop(loop)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/aiohttp/helpers.py", line 287, in get_running_loop
loop = asyncio.get_event_loop()
File "/usr/lib/python3.10/asyncio/events.py", line 656, in get_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Dask-Default-Threads-484036-0'.
Since I am in a dev/local context, from what I understand, I end up with a LocalCluster
Going async
I naively thought that going async would automagicaly inject the notion of event_loop into the workers.
client = await Client(asynchronous=True)
connector = Sharepoint(CONF.sources["sharepoint"])
items = connector.enumerate_items()
# extraction
remote_extractor = await client.submit(
SharepointExtractor, CONF.sources["sharepoint"], connector, actor=True
) # Create Extractor on a worker
extractor = await remote_extractor # Get back a pointer to that object
But the same error occurs
Setting an event loop explicitly
loop = asyncio.new_event_loop()
client = await Client(
asynchronous=True, loop=loop
)
This time, the error is slightly more enigmatic
....
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/client.py", line 923, in __init__
self._loop_runner = LoopRunner(loop=loop, asynchronous=asynchronous)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/utils.py", line 451, in __init__
if not loop.asyncio_loop.is_running():
AttributeError: '_UnixSelectorEventLoop' object has no attribute 'asyncio_loop'
(not sure what this constructor is waiting for loop)
Do you have examples of dask actors involving resources from aiohttp (or any other async lib)? How should I set dask workers got get an event loop avaiblable to my actors?
Edit
Following #mdurant approach (a kind of singleton based importation of the extractor from a importable module)
def get_extractor(CONF):
if extractor[0] is None:
connector = Sharepoint(CONF.sources["sharepoint"])
extractor[0] = SharepointBis(CONF.sources["sharepoint"], connector)
return extractor[0]
def workload(CONF, item):
extractor = get_extractor(CONF)
return extractor.job(item)
def main():
client = Client()
connector = Sharepoint(CONF.sources["sharepoint"])
items = connector.enumerate_items()
futures = client.map(
workload,
[CONF for _ in range(len(items))],
[i for i in items],
retries=5,
pure=False,
)
_ = client.gather(futures)
I still get
2022-12-01 10:05:54,923 - distributed.worker - WARNING - Compute Failed
Key: workload-ffcf0f1a-8aee-41d1-9ad2-f7eea91fa107-41
Function: workload
args: (<eteel.conf.ConfGenerator object at 0x7fae8040d4e0>, 'firex1.sharepoint.com,930e9ef8-6bdf-4484-9883-6aa9965c548f,aed0d0bd-a659-4dbf-bbaa-a56f4efa3b0c')
kwargs: {}
Exception: 'RuntimeError("There is no current event loop in thread \'Dask-Default-Threads-166860-1\'.")'
same goes with a Client(asynchronous=True);which drives me back to my question: how can I have an event loop in a Dask Thread? I have a strong intuition that this has something to do with Client(asynchronous=True, loop={this parameter})

OK, I think there is some confusion going on in this question, so I will do my best to clarify the situation. There are three main points:
some things cannot be serialised between processes easily or at all
some objects are expensive to create per process, and it would be nice to only do it once
the work must happen in an async context
Here is how I would do it. Put this in an importable module.
extractor = [None]
def get_extractor(CONF):
if extractor[0] is None:
connector = Sharepoint(CONF.sources["sharepoint"])
extractor[0] = SharepointExtractor(CONF.sources["sharepoint"], connector)
return extractor[0]
async def workload(CONF, item):
extractor = get_extractor(CONF)
return await extractor.job(item, retries=5)
if __name__ == "__main__": # or run this elsewhere
client = ...
items = ...
futures = client.map(workload, items)
output = client.gather(futures)
I do not know from the OP which parts of the workload are coroutines, I am guessing the .job method - but it should be obvious what I am doing. I note the original code would not have worked in a simple non-dask session, and it is always best to start off with something that works before trying to daskify it.
On async in dask:
client.map/submit supports coroutine functions, and they will be executed on the same event loop as the main worker. That's all you need here. All the distributed components (worker, scheduler, client) are async, server-like implementations with event loops, but execution of worker code does not normally happen in the same thread as the one running that server.
client(asynchronous=True) implies that the client is to be constructed and operated on only from within coroutines - and that the client's event loop is in the current thread. This is probably not what you want, unless you know what you are doing.

Related

Python 3 asyncio with aioboto3 seems sequential

I am porting a simple python 3 script to AWS Lambda.
The script is simple: it gathers information from a dozen of S3 objects and returns the results.
The script used multiprocessing.Pool to gather all the files in parallel. Though multiprocessing cannot be used in an AWS Lambda environment since /dev/shm is missing.
So I thought instead of writing a dirty multiprocessing.Process / multiprocessing.Queue replacement, I would try asyncio instead.
I am using the latest version of aioboto3 (8.0.5) on Python 3.8.
My problem is that I cannot seem to gain any improvement between a naive sequential download of the files, and an asyncio event loop multiplexing the downloads.
Here are the two versions of my code.
import sys
import asyncio
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import boto3
import aioboto3
BUCKET = 'some-bucket'
KEYS = [
'some/key/1',
[...]
'some/key/10',
]
async def download_aio():
"""Concurrent download of all objects from S3"""
async with aioboto3.client('s3') as s3:
objects = [s3.get_object(Bucket=BUCKET, Key=k) for k in KEYS]
objects = await asyncio.gather(*objects)
buffers = await asyncio.gather(*[o['Body'].read() for o in objects])
def download():
"""Sequentially download all objects from S3"""
s3 = boto3.client('s3')
for key in KEYS:
object = s3.get_object(Bucket=BUCKET, Key=key)
object['Body'].read()
def run_sequential():
download()
def run_concurrent():
loop = asyncio.get_event_loop()
#loop.set_default_executor(ProcessPoolExecutor(10))
#loop.set_default_executor(ThreadPoolExecutor(10))
loop.run_until_complete(download_aio())
The timing for both run_sequential() and run_concurrent() are quite similar (~3 seconds for a dozen of 10MB files).
I am convinced the concurrent version is not, for multiple reasons:
I tried switching to Process/ThreadPoolExecutor, and I the processes/threads spawned for the duration of the function, though they are doing nothing
The timing between sequential and concurrent is very close to the same, though my network interface is definitely not saturated, and the CPU is not bound either
The time taken by the concurrent version increases linearly with the number of files.
I am sure something is missing, but I just can't wrap my head around what.
Any ideas?
After loosing some hours trying to understand how to use aioboto3 correctly, I decided to just switch to my backup solution.
I ended up rolling my own naive version of multiprocessing.Pool for use within an AWS lambda environment.
If someone stumble across this thread in the future, here it is. It is far from perfect, but easy enough to replace multiprocessing.Pool as-is for my simple cases.
from multiprocessing import Process, Pipe
from multiprocessing.connection import wait
class Pool:
"""Naive implementation of a process pool with mp.Pool API.
This is useful since multiprocessing.Pool uses a Queue in /dev/shm, which
is not mounted in an AWS Lambda environment.
"""
def __init__(self, process_count=1):
assert process_count >= 1
self.process_count = process_count
#staticmethod
def wrap_pipe(pipe, index, func):
def wrapper(args):
try:
result = func(args)
except Exception as exc: # pylint: disable=broad-except
result = exc
pipe.send((index, result))
return wrapper
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, exc_traceback):
pass
def map(self, function, arguments):
pending = list(enumerate(arguments))
running = []
finished = [None] * len(pending)
while pending or running:
# Fill the running queue with new jobs
while len(running) < self.process_count:
if not pending:
break
index, args = pending.pop(0)
pipe_parent, pipe_child = Pipe(False)
process = Process(
target=Pool.wrap_pipe(pipe_child, index, function),
args=(args, ))
process.start()
running.append((index, process, pipe_parent))
# Wait for jobs to finish
for pipe in wait(list(map(lambda t: t[2], running))):
index, result = pipe.recv()
# Remove the finished job from the running list
running = list(filter(lambda x: x[0] != index, running))
# Add the result to the finished list
finished[index] = result
return finished
it's 1.5 years later and aioboto3 is still not well documented or supported.
The multithreading option is good. but AIO is an easier and more clear implementation
I don't actually know what's wrong with your AIO code. It's even not running now because of the updates I guess. but using aiobotocore this code worked. my test was with 100 images. in the sequential code, it takes 8 sec. in average. in IO it was less than 2.
with 1000 images it was 17 sec.
import asyncio
from aiobotocore.session import get_session
async def download_aio(s3,bucket,file_name):
o = await s3.get_object(Bucket=bucket, Key=file_name)
x = await o['Body'].read()
async def run_concurrent():
tasks =[]
session = get_session()
async with session.create_client('s3') as s3:
for k in KEYS[:100]:
tasks.append(asyncio.ensure_future(get_object(s3,BUCKET,k)))
await asyncio.gather(*tasks)

Racing Two Tasks in Different Event Loops

I am using the Docker SDK, and I am trying to race a task that times out after some number of seconds against another task that waits on a Docker container to finish. In effect, I want to know if a given container finishes within the timeout I've set.
I have the following code to do it (adapted from this post):
container = # ... create container with Docker SDK
timeout = # ... some int
killed = None
# our tasks
async def __timeout():
await asyncio.sleep(timeout)
return True
async def __run():
container.wait()
return False
# loop and runner
wait_loop = asyncio.new_event_loop()
done, pending = wait_loop.run_until_complete(
asyncio.wait({__run(), __timeout()}, return_when=asyncio.FIRST_COMPLETED)
)
# result extraction
for task in done:
if killed is None:
killed = task.result()
# ... do something with result
# clean up
for task in pending:
task.cancel()
with contextlib.suppress(asyncio.CancelledError):
wait_loop.run_until_complete(task)
wait_loop.close()
Unfortunately, I keep getting the following error:
File "/usr/lib/python3.5/asyncio/base_events.py", line 387, in run_until_complete
return future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
result = coro.throw(exc)
File "/usr/lib/python3.5/asyncio/tasks.py", line 347, in wait
return (yield from _wait(fs, timeout, return_when, loop))
File "/usr/lib/python3.5/asyncio/tasks.py", line 430, in _wait
yield from waiter
File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
yield self # This tells Task to wait for completion.
RuntimeError: Task <Task pending coro=<wait() running at /usr/lib/python3.5/asyncio/tasks.py:347> cb=[_run_until_complete_cb() at /usr/lib/python3.5/asyncio/base_events.py:164]> got Future <Future> pending> attached to a different loop
It seems I can't race with the wait task because it belongs to a different loop. Is there any way I can get around this error so that I can determine which task finishes first?
The problem is simple, there is one default loop in every thread. Which is set by asyncio.set_event_loop(loop). Then you can get this loop by loop = asyncio.get_event_loop().
So the problem is, mostly, some packages use asyncio.get_event_loop() by default to get current running loop. Take aiohttp as an example:
class aiohttp.ClientSession(*, connector=None, loop=None, cookies=None, headers=None, skip_auto_headers=None, auth=None, json_serialize=json.dumps, version=aiohttp.HttpVersion11, cookie_jar=None, read_timeout=None, conn_timeout=None, timeout=sentinel, raise_for_status=False, connector_owner=True, auto_decompress=True, requote_redirect_url=False, trust_env=False, trace_configs=None)
As you can see, it accepts loop parameter to specify running loop. But You can also just leave it blank to use asyncio.get_event_loop() by default.
Your problem is you are launching coroutines in a new created loop. But you cannot confirm that all your interal operations are also using this new created one. As they may use asyncio.get_event_loop(), they will be attached into another loop which is the default loop in current thread.
As far as I think, you don't really need to create a new one, but let users do that. Just like the example above, you accept an argument loop, and if it is None, use the default one.
Or you need to carefully inspect your code to ensure that every possible coroutine is using the loop you create.

Running an event loop within its own thread

I'm playing with Python's new(ish) asyncio stuff, trying to combine its event loop with traditional threading. I have written a class that runs the event loop in its own thread, to isolate it, and then provide a (synchronous) method that runs a coroutine on that loop and returns the result. (I realise this makes it a somewhat pointless example, because it necessarily serialises everything, but it's just as a proof-of-concept).
import asyncio
import aiohttp
from threading import Thread
class Fetcher(object):
def __init__(self):
self._loop = asyncio.new_event_loop()
# FIXME Do I need this? It works either way...
#asyncio.set_event_loop(self._loop)
self._session = aiohttp.ClientSession(loop=self._loop)
self._thread = Thread(target=self._loop.run_forever)
self._thread.start()
def __enter__(self):
return self
def __exit__(self, *e):
self._session.close()
self._loop.call_soon_threadsafe(self._loop.stop)
self._thread.join()
self._loop.close()
def __call__(self, url:str) -> str:
# FIXME Can I not get a future from some method of the loop?
future = asyncio.run_coroutine_threadsafe(self._get_response(url), self._loop)
return future.result()
async def _get_response(self, url:str) -> str:
async with self._session.get(url) as response:
assert response.status == 200
return await response.text()
if __name__ == "__main__":
with Fetcher() as fetcher:
while True:
x = input("> ")
if x.lower() == "exit":
break
try:
print(fetcher(x))
except Exception as e:
print(f"WTF? {e.__class__.__name__}")
To avoid this sounding too much like a "Code Review" question, what is the purpose of asynchio.set_event_loop and do I need it in the above? It works fine with and without. Moreover, is there a loop-level method to invoke a coroutine and return a future? It seems a bit odd to do this with a module level function.
You would need to use set_event_loop if you called get_event_loop anywhere and wanted it to return the loop created when you called new_event_loop.
From the docs
If there’s need to set this loop as the event loop for the current context, set_event_loop() must be called explicitly.
Since you do not call get_event_loop anywhere in your example, you can omit the call to set_event_loop.
I might be misinterpreting, but i think the comment by #dirn in the marked answer is incorrect in stating that get_event_loop works from a thread. See the following example:
import asyncio
import threading
async def hello():
print('started hello')
await asyncio.sleep(5)
print('finished hello')
def threaded_func():
el = asyncio.get_event_loop()
el.run_until_complete(hello())
thread = threading.Thread(target=threaded_func)
thread.start()
This produces the following error:
RuntimeError: There is no current event loop in thread 'Thread-1'.
It can be fixed by:
- el = asyncio.get_event_loop()
+ el = asyncio.new_event_loop()
The documentation also specifies that this trick (creating an eventloop by calling get_event_loop) only works on the main thread:
If there is no current event loop set in the current OS thread, the OS thread is main, and set_event_loop() has not yet been called, asyncio will create a new event loop and set it as the current one.
Finally, the docs also recommend to use get_running_loop instead of get_event_loop if you're on version 3.7 or higher

Recipe for tracking in Python coroutines the chain of current awaits in a neat way

Want to know on which awaitable a Python asyncio task is currently on?
After some research, I did not found anything from the Python 3 asyncio library on how to know the current await or yield from directive a Task instance is currently on.
I want to share this with you: feel free to give some advice on how it can be improved or if something better exists from your own recipes or from the standard library itself :)
import inspect
import linecache
import traceback
def get_awaitable_stack_trace(task, file):
"""
Get the callstack representing the chain of awaits-yield from directives
that a Task object is currently on.
:param task: The Task object.
:param file: The file-like object on which the callstack is written.
"""
extracted_list = []
coro = task._coro
while True:
# Get the information on the current coroutine or generator.
coro_name = coro.__name__
try:
frame = coro.cr_frame
except AttributeError:
frame = coro.gi_frame
coro_filename = frame.f_code.co_filename
await_line_number = frame.f_lineno
linecache.checkcache(coro_filename)
line = linecache.getline(coro_filename, await_line_number,
frame.f_globals)
# Record the stack trace info for this coroutine.
extracted_list.append(
(coro_filename, await_line_number, coro_name, line))
# Get the next awaitable object in the chain.
try:
coro = coro.cr_await
except AttributeError:
coro = coro.gi_yieldfrom
if not inspect.isawaitable(coro):
break
traceback.print_list(extracted_list, file=file)
Given this code:
async def coro_1():
try:
await asyncio.sleep(999)
except asyncio.CancelledError:
# Do something.
pass
async def coro_2():
await coro_1()
A task that calls coro_2, when passed to get_awaitable_stack_trace, will produce this call stack:
File "D:\Dev\Git\test.py", line 164, in agent
await coro_2()
File "D:\Dev\Git\test.py", line 148, in coro_2
await coro_1()
File "D:\Dev\Git\test.py", line 142, in coro_1
await asyncio.sleep(999)
File "C:\Python35\Lib\asyncio\tasks.py", line 508, in sleep
return (yield from future)
Hope this helps!

How to execute multi query from list at once time in Python?

I'm using Tornado and Postgres, I have some queries (4 or 5) that I appended to the list during program and now I want to execute all of them at once time!
when I tried to execute I got error that was:
"DummyFuture does not support blocking for results"
I executed this code:
yield self.db.execute(''.join(queries)).result()
"queries" is list of queries!
This is my connection pool and also Tonado setting:
ioloop = IOLoop.instance()
application.db = momoko.Pool(
dsn='dbname=xxx user=xxx password=xxxx host=x port=xxxx'
size=xx,
ioloop=ioloop,
)
# this is a one way to run ioloop in sync
future = application.db.connect()
ioloop.add_future(future, lambda f: ioloop.stop())
ioloop.start()
future.result() # raises exception on connection error
http_server = HTTPServer(application)
http_server.listen(8888, 'localhost')
ioloop.start()
Don't call result() on a Future in a Tornado coroutine. Get results like this:
#gen.coroutine
def method(self):
result = yield self.db.execute('...')
Also, I don't think it will work to just join your queries as strings. The outcome won't be valid SQL. Instead:
#gen.coroutine
def method(self):
results = yield [self.db.execute(q) for q in queries]

Categories