Stream processing mixed sync and async items - python

I have a list of objects to process. Some can be processed immediately, but others need to be processed by first fetching a URL. The organization looks something like:
processed_items = []
for item in list:
if url := item.get('location'):
fetched_item = fetch_item_from(url)
processed_item = process(fetched_item)
else:
processed_item = process(item)
if processed_item:
processed_items.append(processed_item)
The problem is that there are so many items that the only way to handle this in a memory efficient way is to process these files as they come in. On the other hand, doing them sequentially like this takes forever -- it's much more efficient to make the network requests asynchronously.
In theory, you could save all the items with URLs, then fetch them all at once using tasks and asyncio.gather. I have actually done this and it works. But this list of unfetched items can quickly eat up your memory, since the items are being streamed in, and making a ton of network requests all at once can make the server mad.
I think I'm looking for a result that leaves me with an array like
processed_items = [1, 2, <awaitable>, 3, <awaitable>, ...]
which I can then await the result of.
Is this the right approach? And if so, what's this design pattern called? Any first steps?

Just execute your code above in an asynchronous function - in a way that each item is processed in a separate task, and wrap your "fetch_item_from" function in an async function that uses an asyncio.Semaphore to limit the number of parallel requests to one you find optimal - be it 7, 10, 50 or 100.
If the rest of your processing is just CPU intensive you won't need any other async features there.
Actually, if your `fetch_item_from" is not async itself, you can simply do "run_in_executor" - and the nature of the process.future.Executor itself will limit the amount of concurrent requests, without the need to use a Semaphore are all.
import asyncio
MAXREQUESTS = 20
# Use this part if your original `fetch_item_from` is synchronous:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(MAXREQUESTS)
async def fetch_item_from_with_executor(url):
asyncio.get_running_loop()
# This is automatically limited to the number of workers in the executor
return await asyncio.run_in_executor(executor, fetch_item_from, url)
# Use this part if fetch_item_from is asynchronous itself
semaphore = asyncio.Semaphore(MAXREQUESTS)
async def fetch_item_from_async(url):
with semaphore:
return await fetch_item_from(url)
# common code:
async def process_item(item):
if url := item.get('location'):
item = await fetch_item_from_executor(url) # / fetch_item_from_async
return process(item)
async def main(list_):
pending_list = [asyncio.create_task(item) for item in list_]
processed_items = []
while pending_list:
# The timeout=10 bellow is optional, and will return the control
# here with the already completed tasks each 10 seconds:
# this way you can print some progress indicator to see how
# things are going - or even improve the code so that
# finished tasks are yielded earlier to be consumed by the callers of "main"
# in parallel
# if the timeout argument is omitted, all items are processed in a single batch.
done, pending_list = await asyncio.wait(pending_list, timeout=10)
processed_items.extend(done) # the filter builtin will add just
# retrieve the results from each task and filter out the falsy (None?) ones:
return [result for item in processed_items if (result:=item.result())]
list_= ...
processed_items = asyncio.run(main(list_))
(missing above is any error handling - if either fetch_item_from or process can raise any exception, you have to unfold the list-comprehension which calls .result() in each task blindly to
separate the ones that raised from the ones that completed sucessfully)

Related

How handle concurrent requests with custom priority in Python?

I have a high QPS of requests.
I can only handle ONE request at a time. So the pending request needs to be stored in a local Array/Queue.
When there is contention, I do not want FCFS (first-come-first-serve). Instead I want to process the request based own some custom logic.
Pseudocode like this
def webApiHandler(request):
future = submit(request)
response = wait(future) # Wait time is depending on its priority
return response
What primitives I can use to implement this? Event loop? Asyncio? Threads?
------------edit------------
This is synchronous API call, everything should be handled locally and response ASAP once the computation is done. I do not plan to use job queue like celery.
With your requirements (but why do you not want concurrency ?) what you might want to use is literraly a priority queue, which is a queue with ... priority : info of implementation here , and you can use it in python with the queue module (doc here)
it is sorted by priority, so higher priority are at the end of the queue.
Your implementation will then decide how to value the request and set the priority on this particular request. But two identic priority will be treated as in a queue.
Then, you write a consumer in another thread that will pop the first (or last depending on what you consider top priority) item in the queue.
What you might want to look at, and enable concurrency and extra features is celery, which is a distributed task queue framework. (it allows for queue, priority, and also can be run with a any number of worker (any=1 in your case, but are you really into non-concurrency for high number of request ?).
Example:
import asyncio
import threading
from typing import Dict
# Local queue for pending compute (watchout, may need to replace dict for thread-safe ⚠️)
pending_computes: Dict[int, list[threading.Event]] = {}
# The queue manager to pick a pending compute
async def poll_next():
# 1. A computed example after updating model priority
priorities = [2, 1, 3] # TODO: handle the empty array.
# 2. Find the next compute event reference
next = pending_computes.get(priorities[0]) # TODO: handle empty dict.
# 3. Kick off the next compute
next.set()
# The FastAPI async handler
async def cloud_compute_api(model_id: int, intput: bytes):
# 1. Enqueue current compute as an event.
compute_event = threading.Event()
pending_computes.get(model_id, []).append(compute_event)
# 2. Poll the next pending compute based on.
asyncio.create_task(poll_next)
# 3. Wait until its own compute is set to GO. Wait up to 10 seconds.
compute_event.wait(10)
# 4. compute starts 🚀🚀🚀
res = compute(model_id, intput)
asyncio.create_task(poll_next) # Poll next pending compute, if any
return res

Asyncify string joining in Python

I have the following code snippet which I want to transform into asynchronous code (data tends to be a large Iterable):
transformed_data = (do_some_transformation(d) for d in data)
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
I managed to rewrite the do_some_transformation-function to be async so I can do the following:
transformed_data = (await do_some_transformation(d) for d in data)
async_generator = (json.dumps(event, separators=(",", ":")) async for t in transformed_data)
stacked_jsons = ???
What's the best way to incrementally join the jsons produced by the async generator so that the joining process is also asynchronous?
This snippet is part of a larger I/O-bound-application which and has many asynchronous components and thus would profit from asynchifying everything.
The point of str.join is to transform an entire list at once.1 If items arrive incrementally, it can be advantageous to accumulate them one by one.
async def join(by: str, _items: 'AsyncIterable[str]') -> str:
"""Asynchronously joins items with some string"""
result = ""
async for item in _items:
if result and by: # only add the separator between items
result += by
result += item
return result
The async for loop is sufficient to let the async iterable suspend between items so that other tasks may run. The primary advantage of this approach is that even for very many items, this never stalls the event loop for longer than adding the next item.
This utility can directly digest the async generator:
stacked_jsons = join("\n\n", (json.dumps(event, separators=(",", ":")) async for t in transformed_data))
When it is know that the data is small enough that str.join runs in adequate time, one can directly convert the data to a list instead and use str.join:
stacked_jsons = "\n\n".join([json.dumps(event, separators=(",", ":")) async for t in transformed_data])
The [... async for ...] construct is an asynchronous list comprehension. This internally works asynchronously to iterate, but produces a regular list once all items are fetched – only this resulting list is passed to str.join and can be processed synchronously.
1 Even when joining an iterable, str.join will internally turn it into a list first.
More in depth explanation about my comment:
Asyncio is a great tool if your processor has a lot of waiting to do.
For example: when you make request to a db over the network, after the request is sent your cpu just does nothing until it gets an answer.
Using the async await syntax you can have your processor execute other tasks while "waiting" for the current one to finish. this does not mean it runs them in parallel. There is only one task running at a time.
In your case (for what i can see) the cpu never waits for something it is constantly running string operations.
if you want to run these operations in parallel you might want to take a look at ProcesPools.
This is not bound by a single process and core but will spread the processing over several cores to run it in parallel.
from concurrent.futures import ProcessPoolExecutor
def main():
with ProcessPoolExecutor() as executor:
transformed_data = executor.map(do_some_transformation, data) #returns an iterable
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
if __name__ == '__main__':
main()
I hope the provided code can help you.
ps.
The if __name__ part is required
edit: i saw your comment about 10k dicts, assume you have 8 cores (ignore multithreading) then each process will only transform 1250 dicts, instead of the 10k your main thread does now. These processes run simultaniously and although the performance increase is not linear it should process them a lot faster.
TL;DR: Consider using producer/consumer pattern, if do_some_transformation is IO bound, and you really want an incremental aggregation.
Of course, async itself only brings an advantage if you actually have any other proper async tasks to begin with.
As #MisterMiyagi said, if do_some_transformation is IO bound and time consuming, firing all transformation as a horde of async tasks can be a good idea.
Example code:
import asyncio
import json
data = ({"large": "data"},) * 3 # large
stacked_jsons = ""
async def transform(d: dict, q: asyncio.Queue) -> None:
# `do_some_transformation`: long IO bound task
await asyncio.sleep(1)
await q.put(d)
# WARNING: incremental concatination of string would be slow,
# since string is immutable.
async def join(q: asyncio.Queue):
global stacked_jsons
while True:
d = await q.get()
stacked_jsons += json.dumps(d, separators=(",", ":")) + "\n\n"
q.task_done()
async def main():
q = asyncio.Queue()
producers = [asyncio.create_task(transform(d, q)) for d in data]
consumer = asyncio.create_task(join(q))
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too
consumer.cancel()
print(stacked_jsons)
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")
So that do_some_transformation don't block each other. Output:
$ python test.py
{"large":"data"}
{"large":"data"}
{"large":"data"}
test.py executed in 1.00 seconds.
Besides, I don't think incremental concatenation of string is a good idea, since string is immutable and a lot of memory would be wasted ;)
Reference: Async IO in Python: A Complete Walkthrough - Real Python

How to retrieve a task from a future in python?

Let say I have the following code to run multiple task in parallel.
with concurrent.futures.ThreadPoolExecutor(max_workers=connections) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
fun,
arg
)
for i in range(connections)
]
for result in await asyncio.gather(*futures):
# I want to access the futures task here
pass
Is it possible to read the the futures' task once it has been executed?
Is it possible to read the the futures' task once it has been executed?
In asyncio, the word task has a specialized meaning, referring to a class that subclasses Future specialized for driving coroutines.
In your code, asyncio.gather() returns results, and you also have the futures variable that contains the Future objects which can also be used to access the same results. If you need to access additional information (like the original fun or arg), you can attach it to the appropriate Future or use a dict to map it. For example:
futures = []
for conn in connections:
fut = loop.run_in_executor(executor, fun, arg)
fut.conn = conn # or other info you need
await asyncio.wait(futures)
# at this point all the futures are done, and you can use future.result()
# to access the result of an individual future, and future.conn to obtain
# the connection the future was created for

Is asyncio.wait order guaranteed?

I'm trying to implement fair queuing in my library that is based on asyncio.
In some function, I have a statement like (assume socketX are tasks):
done, pending = asyncio.wait(
[socket1, socket2, socket3],
return_when=asyncio.FIRST_COMPLETED,
)
Now I read the documentation for asyncio.wait many times but it does not contain the information I'm after. Mainly, I'd like to know if:
socket1, socket2 and socket3 happened to be already ready when I issue the call. Is it guaranteed that done will contain them all or could it be that it returns only one (or two) ?
In the second case, does the order of the tasks passed to wait() matter ?
I'm trying to assert if I can just apply fair-queuing in the set of done tasks (by picking one and leaving the other tasks for later resolution) or if I also need to care about the order I pass the tasks in.
The documentation is kinda silent about this. Any idea ?
This is only taken according to the source code of Python 3.5.
If the future is done before calling wait, they will all be placed in the done set:
import asyncio
async def f(n):
return n
async def main():
(done, pending) = await asyncio.wait([f(1), f(2), f(3)], return_when=asyncio.FIRST_COMPLETED)
print(done) # prints set of 3 futures
print(pending) # prints empty set
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

When to use and when not to use Python 3.5 `await` ?

I'm getting the flow of using asyncio in Python 3.5 but I haven't seen a description of what things I should be awaiting and things I should not be or where it would be neglible. Do I just have to use my best judgement in terms of "this is an IO operation and thus should be awaited"?
By default all your code is synchronous. You can make it asynchronous defining functions with async def and "calling" these functions with await. A More correct question would be "When should I write asynchronous code instead of synchronous?". Answer is "When you can benefit from it". In cases when you work with I/O operations as you noted you will usually benefit:
# Synchronous way:
download(url1) # takes 5 sec.
download(url2) # takes 5 sec.
# Total time: 10 sec.
# Asynchronous way:
await asyncio.gather(
async_download(url1), # takes 5 sec.
async_download(url2) # takes 5 sec.
)
# Total time: only 5 sec. (+ little overhead for using asyncio)
Of course, if you created a function that uses asynchronous code, this function should be asynchronous too (should be defined as async def). But any asynchronous function can freely use synchronous code. It makes no sense to cast synchronous code to asynchronous without some reason:
# extract_links(url) should be async because it uses async func async_download() inside
async def extract_links(url):
# async_download() was created async to get benefit of I/O
html = await async_download(url)
# parse() doesn't work with I/O, there's no sense to make it async
links = parse(html)
return links
One very important thing is that any long synchronous operation (> 50 ms, for example, it's hard to say exactly) will freeze all your asynchronous operations for that time:
async def extract_links(url):
data = await download(url)
links = parse(data)
# if search_in_very_big_file() takes much time to process,
# all your running async funcs (somewhere else in code) will be frozen
# you need to avoid this situation
links_found = search_in_very_big_file(links)
You can avoid it calling long running synchronous functions in separate process (and awaiting for result):
executor = ProcessPoolExecutor(2)
async def extract_links(url):
data = await download(url)
links = parse(data)
# Now your main process can handle another async functions while separate process running
links_found = await loop.run_in_executor(executor, search_in_very_big_file, links)
One more example: when you need to use requests in asyncio. requests.get is just synchronous long running function, which you shouldn't call inside async code (again, to avoid freezing). But it's running long because of I/O, not because of long calculations. In that case, you can use ThreadPoolExecutor instead of ProcessPoolExecutor to avoid some multiprocessing overhead:
executor = ThreadPoolExecutor(2)
async def download(url):
response = await loop.run_in_executor(executor, requests.get, url)
return response.text
You do not have much freedom. If you need to call a function you need to find out if this is a usual function or a coroutine. You must use the await keyword if and only if the function you are calling is a coroutine.
If async functions are involved there should be an "event loop" which orchestrates these async functions. Strictly speaking it's not necessary, you can "manually" run the async method sending values to it, but probably you don't want to do it. The event loop keeps track of not-yet-finished coroutines and chooses the next one to continue running. asyncio module provides an implementation of event loop, but this is not the only possible implementation.
Consider these two lines of code:
x = get_x()
do_something_else()
and
x = await aget_x()
do_something_else()
Semantic is absolutely the same: call a method which produces some value, when the value is ready assign it to variable x and do something else. In both cases the do_something_else function will be called only after the previous line of code is finished. It doesn't even mean that before or after or during the execution of asynchronous aget_x method the control will be yielded to event loop.
Still there are some differences:
the second snippet can appear only inside another async function
aget_x function is not usual, but coroutine (that is either declared with async keyword or decorated as coroutine)
aget_x is able to "communicate" with the event loop: that is yield some objects to it. The event loop should be able to interpret these objects as requests to do some operations (f.e. to send a network request and wait for response, or just suspend this coroutine for n seconds). Usual get_x function is not able to communicate with event loop.

Categories