How to retrieve a task from a future in python? - python

Let say I have the following code to run multiple task in parallel.
with concurrent.futures.ThreadPoolExecutor(max_workers=connections) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
fun,
arg
)
for i in range(connections)
]
for result in await asyncio.gather(*futures):
# I want to access the futures task here
pass
Is it possible to read the the futures' task once it has been executed?

Is it possible to read the the futures' task once it has been executed?
In asyncio, the word task has a specialized meaning, referring to a class that subclasses Future specialized for driving coroutines.
In your code, asyncio.gather() returns results, and you also have the futures variable that contains the Future objects which can also be used to access the same results. If you need to access additional information (like the original fun or arg), you can attach it to the appropriate Future or use a dict to map it. For example:
futures = []
for conn in connections:
fut = loop.run_in_executor(executor, fun, arg)
fut.conn = conn # or other info you need
await asyncio.wait(futures)
# at this point all the futures are done, and you can use future.result()
# to access the result of an individual future, and future.conn to obtain
# the connection the future was created for

Related

Stream processing mixed sync and async items

I have a list of objects to process. Some can be processed immediately, but others need to be processed by first fetching a URL. The organization looks something like:
processed_items = []
for item in list:
if url := item.get('location'):
fetched_item = fetch_item_from(url)
processed_item = process(fetched_item)
else:
processed_item = process(item)
if processed_item:
processed_items.append(processed_item)
The problem is that there are so many items that the only way to handle this in a memory efficient way is to process these files as they come in. On the other hand, doing them sequentially like this takes forever -- it's much more efficient to make the network requests asynchronously.
In theory, you could save all the items with URLs, then fetch them all at once using tasks and asyncio.gather. I have actually done this and it works. But this list of unfetched items can quickly eat up your memory, since the items are being streamed in, and making a ton of network requests all at once can make the server mad.
I think I'm looking for a result that leaves me with an array like
processed_items = [1, 2, <awaitable>, 3, <awaitable>, ...]
which I can then await the result of.
Is this the right approach? And if so, what's this design pattern called? Any first steps?
Just execute your code above in an asynchronous function - in a way that each item is processed in a separate task, and wrap your "fetch_item_from" function in an async function that uses an asyncio.Semaphore to limit the number of parallel requests to one you find optimal - be it 7, 10, 50 or 100.
If the rest of your processing is just CPU intensive you won't need any other async features there.
Actually, if your `fetch_item_from" is not async itself, you can simply do "run_in_executor" - and the nature of the process.future.Executor itself will limit the amount of concurrent requests, without the need to use a Semaphore are all.
import asyncio
MAXREQUESTS = 20
# Use this part if your original `fetch_item_from` is synchronous:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(MAXREQUESTS)
async def fetch_item_from_with_executor(url):
asyncio.get_running_loop()
# This is automatically limited to the number of workers in the executor
return await asyncio.run_in_executor(executor, fetch_item_from, url)
# Use this part if fetch_item_from is asynchronous itself
semaphore = asyncio.Semaphore(MAXREQUESTS)
async def fetch_item_from_async(url):
with semaphore:
return await fetch_item_from(url)
# common code:
async def process_item(item):
if url := item.get('location'):
item = await fetch_item_from_executor(url) # / fetch_item_from_async
return process(item)
async def main(list_):
pending_list = [asyncio.create_task(item) for item in list_]
processed_items = []
while pending_list:
# The timeout=10 bellow is optional, and will return the control
# here with the already completed tasks each 10 seconds:
# this way you can print some progress indicator to see how
# things are going - or even improve the code so that
# finished tasks are yielded earlier to be consumed by the callers of "main"
# in parallel
# if the timeout argument is omitted, all items are processed in a single batch.
done, pending_list = await asyncio.wait(pending_list, timeout=10)
processed_items.extend(done) # the filter builtin will add just
# retrieve the results from each task and filter out the falsy (None?) ones:
return [result for item in processed_items if (result:=item.result())]
list_= ...
processed_items = asyncio.run(main(list_))
(missing above is any error handling - if either fetch_item_from or process can raise any exception, you have to unfold the list-comprehension which calls .result() in each task blindly to
separate the ones that raised from the ones that completed sucessfully)

How to use concurrency as well as multiprocessing to handle the data received through websocket in python?

CODE:
url = 'ws://xx.xx.xx.xx:1234'
ws = create_connection(url)
ws.send(json.dumps(subscribe_msg))
ws.recv()
while True:
result = ws.recv()
# handle the result using a different core each time
handle_parallely(result)
The while loop result=ws.recv() needs to be concurrent, so that ws.recv can be repeatedly called without waiting for handle_parallely to return.
handle_parallely needs to run parallely when it is called.
The data received and its processing is independent of any previous or future data.
You can use a ProcessPoolExecutor from the concurrent futures module. This could look like
from concurrent.futures import ProcessPoolExecutor
max_number_of_processes = 4 # just put any number here
futures = []
with ProcessPoolExecutor(max_worker=max_number_of_processes) as executor:
while True:
result = ws.recv()
# handle the result using a different core each time
future = executor.submit(handle_parallely, result)
futures.append(future)
futures = [f for f in futures if not f.done()]
This of course only works if result and handle_parallely are pickable, see this for which types are pickable by default if you run into issues with PickelingError.
Storing the futures in that list is optional, but maybe you want to keep track of references to them.

Implementing a coroutine in python

Lets say I have a C++ function result_type compute(input_type input), which I have made available to python using cython. My python code executes multiple computations like this:
def compute_total_result()
inputs = ...
total_result = ...
for input in inputs:
result = compute_python_wrapper(input)
update_total_result(total_result)
return total_result
Since the computation takes a long time, I have implemented a C++ thread pool (like this) and written a function std::future<result_type> compute_threaded(input_type input), which returns a future that becomes ready as soon as the thread pool is done executing.
What I would like to do is to use this C++ function in python as well. A simple way to do this would be to wrap the std::future<result_type> including its get() function, wait for all results like this:
def compute_total_results_parallel()
inputs = ...
total_result = ...
futures = []
for input in inputs:
futures.append(compute_threaded_python_wrapper(input))
for future in futures:
update_total_result(future.get())
return total_result
I suppose this works well enough in this case, but it becomes very complicated very fast, because I have to pass futures around.
However, I think that conceptually, waiting for these C++ results is no different from waiting for file or network I/O.
To facilitate I/O operations, the python devs introduced the async / await keywords. If my compute_threaded_python_wrapper would be part of asyncio, I could simply rewrite it as
async def compute_total_results_async()
inputs = ...
total_result = ...
for input in inputs:
result = await compute_threaded_python_wrapper(input)
update_total_result(total_result)
return total_result
And I could execute the whole code via result = asyncio.run(compute_total_results_async()).
There are a lot of tutorials regarding async programming in python, but most of them deal with using coroutines where the bedrock seem to be some call into the asyncio package, mostly calling asyncio.sleep(delay) as a proxy for I/O.
My question is: (How) Can I implement coroutines in python, enabling python to await the wrapped future object (There is some mention of a __await__ method returning an iterator)?
First, an inaccuracy in the question needs to be corrected:
If my compute_threaded_python_wrapper would be part of asyncio, I could simply rewrite it as [...]
The rewrite is incorrect: await means "wait until the computation finishes", so the loop as written would execute the code sequentially. A rewrite that actually runs the tasks in parallel would be something like:
# a direct translation of the "parallel" version
def compute_total_results_async()
inputs = ...
total_result = ...
tasks = []
# first spawn all the tasks
for input in inputs:
tasks.append(
asyncio.create_task(compute_threaded_python_wrapper(input))
)
# and then await them
for task in tasks:
update_total_result(await task)
return total_result
This spawn-all-await-all pattern is so uniquitous that asyncio provides a helper function, asyncio.gather(), which makes it much shorter, especially when combined with a list comprehension:
# a more idiomatic version
def compute_total_results_async()
inputs = ...
total_result = ...
results = await asyncio.gather(
*[compute_threaded_python_wrapper(input) for input in inputs]
)
for result in results:
update_total_result(result)
return total_result
With that out of the way, we can proceed to the main question:
My question is: (How) Can I implement coroutines in python, enabling python to await the wrapped future object (There is some mention of a __await__ method returning an iterator)?
Yes, awaitable objects are implemented using iterators that yield to indicate suspension. But that is way too low-level a tool for what you actually need. You don't need just any awaitable, but one that works with the asyncio event loop, which has specific expectations of the underlying iterator. You need a mechanism to resume the awaitable when the result is ready, where you again depend on asyncio.
Asyncio already provides awaitable objects that can be externally assigned a value: futures. An asyncio future represents an async value that will become available at some point in the future. They are related to, but not semantically equivalent to C++ futures, and should not to be confused with multi-threaded futures from the concurrent.futures stdlib module.
To create an awaitable object that is activated by something that happens in another thread, you need to create a future, and then start your off-thread task, instructing it to mark the future as completed when it finishes execution. Since asyncio futures are not thread-safe, this must be done using the call_soon_threadsafe event loop method provided by asyncio for such situations. In Python it would be done like this:
def run_async():
loop = asyncio.get_event_loop()
future = loop.create_future()
def on_done(result):
# when done, notify the future in a thread-safe manner
loop.call_soon_threadsafe(future.set_result, resut)
# start the worker in a thread owned by the pool
pool.submit(_worker, on_done)
# returning a future makes run_async() awaitable, and
# passable to asyncio.gather() etc.
return future
def _worker(on_done):
# this runs in a different thread
... processing goes here ...
result = ...
on_done(result)
In your case, the worker would be presumably implemented in Cython combined with C++.

Is asyncio.wait order guaranteed?

I'm trying to implement fair queuing in my library that is based on asyncio.
In some function, I have a statement like (assume socketX are tasks):
done, pending = asyncio.wait(
[socket1, socket2, socket3],
return_when=asyncio.FIRST_COMPLETED,
)
Now I read the documentation for asyncio.wait many times but it does not contain the information I'm after. Mainly, I'd like to know if:
socket1, socket2 and socket3 happened to be already ready when I issue the call. Is it guaranteed that done will contain them all or could it be that it returns only one (or two) ?
In the second case, does the order of the tasks passed to wait() matter ?
I'm trying to assert if I can just apply fair-queuing in the set of done tasks (by picking one and leaving the other tasks for later resolution) or if I also need to care about the order I pass the tasks in.
The documentation is kinda silent about this. Any idea ?
This is only taken according to the source code of Python 3.5.
If the future is done before calling wait, they will all be placed in the done set:
import asyncio
async def f(n):
return n
async def main():
(done, pending) = await asyncio.wait([f(1), f(2), f(3)], return_when=asyncio.FIRST_COMPLETED)
print(done) # prints set of 3 futures
print(pending) # prints empty set
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

How to know which coroutines were done with asyncio.wait()

I have two StreamReader objects and want to read from them in a loop. I'm using asyncio.wait like this:
done, pending = await asyncio.wait(
[reader.read(1000), freader.read(1000)],
return_when=asyncio.FIRST_COMPLETED)
Now done.pop() gives me the future that finished first. The problem is I don't know how to find which read() operation completed. I tried putting [reader.read(1000), freader.read(1000)] in a tasks variable and comparing the done future with those. But this seems to be incorrect since the done future is equal to none of the original tasks. So how am I supposed to find which coroutine was finished?
You need to create a separate task for each .read call, and pass those tasks to .wait. You can then check where the tasks are in the results.
reader_task = asyncio.ensure_future(reader.read(1000))
...
done, pending = await asyncio.wait(
[reader_task, ...],
return_when=asyncio.FIRST_COMPLETED,
)
if reader_task in done:
...
...
See e.g. this example from the websockets documentation.

Categories