Python asyncio: as_completed in order - python
TL;DR
Is there a way to wait on multiple futures, and yield from them as they are completed in a given order?
Long story
Imagine you have two data sources. One gives you id -> name mapping, the other gives you id -> age mapping. You want to compute (name, age) -> number_of_ids_with_that_name_and_age.
There is too much data to just load it, but both data sources support paging/iterating and ordering by id.
So you write something like
def iterate_names():
for page in get_name_page_numbers():
yield from iterate_name_page(page) # yields (id, name) pairs
and the same for age, and then you iterate over iterate_names() and iterate_ages().
What is wrong with that? What happens is:
you request one page of names and ages
you get them
you process the data until you reach the end of a page, let's say, ages
you request another page of ages
you process the data until ...
Basically, you are not waiting for any requests while you process data.
You could use asyncio.gather to send all requests and wait for all data, but then:
when the first pages arrives, you still wait for others
you run out of memory
There is asyncio.as_completed which allows you to send all requests and process pages as you get results, but you will get pages out of order, so you will not be able to do the processing.
Ideally, there would be a function that would make the first request, and as the response comes, make the second request and yield results from the first at the same moment.
Is that possible?
There are a lot of things going on in your question; I'll try to get to all of them.
Is there a way to wait on multiple futures, and yield from them as they are completed in a given order?
Yes. Your code can yield from or await any number of futures in sequence. If you are talking about Tasks specifically and you want these tasks to be executing concurrently, they simply need to be assigned to the loop (done when you asyncio.ensure_future() or loop.create_task()) and the loop needs to be running.
As for yielding from them in sequence, you can establish what that sequence is in the first place as you create the tasks. In a simple example where you have created all of the tasks/futures before you start to process their results, you could use a list to store the task futures and finally pull from the list:
loop = asyncio.get_event_loop()
tasks_im_waiting_for = []
for thing in things_to_get:
task = loop.create_task(get_a_thing_coroutine(thing))
tasks_im_waiting_for.append(task)
#asyncio.coroutine
def process_gotten_things(getter_tasks):
for task in getter_tasks:
result = yield from task
print("We got {}".format(result))
loop.run_until_complete(process_gotten_things(tasks_im_waiting_for))
That example will only process one result at a time, but will still allow any of the scheduled getter tasks to continue doing their thing while it's waiting for the next one in the sequence to complete. If the processing order didn't matter as much and we wanted to process more than one potentially-ready result at a time, then we could use a deque instead of a list, with more than one process_gotten_things task .pop()ing the getter tasks from the deque. If we wanted to get even more advanced, we can do as Vincent suggests in a comment to your question and use an asyncio.Queue instead of a deque. With such a queue, you can have a producer adding tasks to the queue running concurrently with the task-processing consumers.
Using a deque, or Queue for sequencing futures for processing has a disadvantage though, and that's that you are only processing as many futures concurrently as you have running processor tasks. You could create a new processor task every single time you queued up a new future to be processed, but at that point, this queue becomes a completely redundant data structure because asyncio already gives you a queue-like object where every thing added gets processed concurrently: the event loop. For every task we schedule, we can also schedule its processing. Revising the above example:
for thing in things_to_get:
getter_task = loop.create_task(get_a_thing_coroutine(thing))
processor_task = loop.create_task(process_gotten_thing(getter_task))
# Tasks are futures; the processor can await the result once started
Now let's say that our getter might return multiple things (kind of like your scenario) and each of those things needs some processing. That brings me to a different asyncio design pattern: sub-tasks. Your tasks can schedule other tasks on the event loop. As the event loop is run, the order of your first tasks will still be maintained, but if any one of them ends up waiting on something, there's a chance one of your sub-tasks will get started in the midst of things. Revising the above scenario, we might pass the loop to our coroutine so the coroutine can schedule the tasks that processes its results:
for thing in things_to_get:
task = loop.create_task(get_a_thing_coroutine(thing, loop))
#asyncio.coroutine
def get_a_thing_coroutine(thing, loop):
results = yield from long_time_database_call(thing)
subtasks = []
for result in results:
subtasks.append(loop.create_task(process_result(result)))
# With subtasks scheduled in the order we like, wait for them
# to finish before we consider THIS task complete.
yield from asyncio.wait(subtasks)
All these advanced patterns start tasks in the order you want, but might finish processing them in any order. If you truly need to process the results in the exact same order that you started getting those results, then stick to a single processor pulling result futures from a sequence or yielding from an asyncio.Queue.
You'll also notice that to ensure tasks starting in a predictable order, I explicitly schedule them with loop.create_task(). While asyncio.gather() and asyncio.wait() will happily take coroutine objects and schedule/wrap them as Tasks, they have problems with scheduling them in a predictable order as of me writing this. See asyncio issue #432.
OK, let's get back to your specific case. You have two separate sources of results, and those results need to be joined together by a common key, an id. The patterns I mentioned for getting things and processing those things don't account for such a problem, and I don't know the perfect pattern for it off the top of my head. I'll go through what I might do to attempt this though.
We need some objects to maintain the state of what we know and what we've done so far for the sake of correlating that knowledge as it grows.
# defaultdicts are great for representing knowledge that an interested
# party might want whether or not we have any knowledge to begin with:
from collections import defaultdict
# Let's start with a place to store our end goal:
name_and_age_to_id_count = defaultdict(int)
# Given we're correlating info from two sources, let's make two places to
# store that info, keyed by what we're joining on: id
# When we join correlate this info, only one side might be known, so use a
# Future on both sides to represent data we may or may not have yet.
id_to_age_future = defaultdict(loop.create_future)
id_to_name_future = defaultdict(loop.create_future)
# As soon as we learn the name or age for an id, we can begin processing
# the joint information, but because this information is coming from
# multiple sources we want to process concurrently we need to keep track
# of what ids we've started processing the joint info for.
ids_scheduled_for_processing = set()
We know we'll be getting this information in "pages" via the iterators you mentioned, so let's start there in designing our tasks:
#asyncio.coroutine
def process_name_page(page_number):
subtasks = []
for id, name in iterate_name_page(page_number):
name_future = id_to_name_future[id]
name_future.set_result(name)
if id not in ids_scheduled_for_processing:
age_future = id_to_age_future[id]
task = loop.create_task(increment_name_age_pair(id, name_future, age_future))
subtasks.append(task)
ids_scheduled_for_processing.add(id)
yield from asyncio.wait(subtasks)
#asyncio.coroutine
def process_age_page(page_number):
subtasks = []
for id, age in iterate_age_page(page_number):
age_future = id_to_age_future[id]
age_future.set_result(age)
if id not in ids_scheduled_for_processing:
name_future = id_to_name_future[id]
task = loop.create_task(increment_name_age_pair(id, name_future, age_future))
subtasks.append(task)
ids_scheduled_for_processing.add(id)
yield from asyncio.wait(subtasks)
Those coroutines schedule the name/age pair of an id to be processed—more specifically, the name and age futures for an id. Once started, the processor will await both futures' results (joining them, in a sense).
#asyncio.coroutine
def increment_name_age_pair(id, name_future, age_future):
# This will wait until both futures are resolved and let other tasks work in the meantime:
pair = ((yield from name_future), (yield from age_future))
name_and_age_to_id_count[pair] += 1
# If memory is a concern:
ids_scheduled_for_processing.discard(id)
del id_to_age_future[id]
del id_to_name_future[id]
OK, we've got tasks for getting/iterating the pages and subtasks for processing what's in those pages. Now we need to actually schedule the getting of those pages. Back to your problem, we've got two datasources we want to pull from, and we want to pull from them in parallel. We assume the order of information from one closely correlates to the order of information from another, so we interleave the processing of both in the event loop.
page_processing_tasks = []
# Interleave name and age pages:
for name_page_number, age_page_number in zip_longest(
get_name_page_numbers(),
get_age_page_numbers()
):
# Explicitly schedule it as a task in the order we want because gather
# and wait have non-deterministic scheduling order:
if name_page_number is not None:
page_processing_tasks.append(loop.create_task(process_name_page(name_page_number)))
if age_page_number is not None:
page_processing_tasks.append(loop.create_task(process_age_page(age_page_number)))
Now that we have scheduled the top level tasks, we can finally actually do the things:
loop.run_until_complete(asyncio.wait(page_processing_tasks))
print(name_and_age_to_id_count)
asyncio may not solve all of your parallel processing woes. You mentioned that the "processing" each page to iterate takes forever. If it takes forever because it's awaiting responses from a server, then this architecture is a neat lightweight approach to do what you need (just make sure the i/o is being done with asyncio loop-aware tools).
If it takes forever because Python is crunching numbers or moving things around with CPU and memory, asyncio's single-threaded event loop doesn't help you much because only one Python operation is happening at a time. In this scenario, you may want to look into using loop.run_in_executor with a pool of Python interpreter processes if you'd like to stick with asyncio and the sub-task pattern. You could also develop a solution using the concurrent.futures library with a process pool instead of using asyncio.
Note: The example generator you gave might be confusing to some because it uses yield from to delegate generation to an inner generator. It just so happens that asyncio coroutines use the same expression to await a future result and tell the loop it can run other coroutines' code if it wants.
asyncio has no such functionality but you may write a simple wrapper around as_completed for yielding data in-order.
It may be built using small sliding window buffer for storing newer completed data while older result is not available yet.
Related
How to process a list of tasks while limiting the number of threads that are started simultaneously by first checking if a session is available
I am currently working on a test system that uses selenium grid for WhatsApp automation. WhatsApp requires a QR code scan to log in, but once the code has been scanned, the session persists as long as the cookies remain saved in the browser's user data directory. I would like to run a series of tests concurrently while making sure that every session is only used by one thread at any given time. I would also like to be able to add additional tests to the queue while tests are being run. So far I have considered using the ThreadPoolExecutor context manager in order to limit the maximum available workers to the maximum number of sessions. Something like this: import queue from concurrent.futures import ThreadPoolExecutor def make_queue(questions): q = queue.Queue() for question in questions: q.put(question) return q def test_conversation(q): item = q.get() # Whatsapp test happens here q.task_done() def run_tests(questions): q = make_queue(questions) with ThreadPoolExecutor(max_workers=number_of_sessions) as executor: while not q.empty() test_results = executor.submit(test_conversation, q) for f in concurrent.futures.as_completed(test_results): # save results somewhere It does not include some way to make sure that every thread gets its own session though and as far as I know I can only send one parameter to the function that the executor calls. I could make some complicated checkout system that works like borrowing books from a library so that every session can only be checked out once at any given time, but I'm not confident in making something that is thread safe and works in all cases. Even the ones I can't think of until they happen. I am also not sure how I would keep the thing going while adding items to the queue without it locking up my entire application. Would I have to run run_tests() in its own thread? Is there an established way to do this? Any help would be much appreciated.
Python: Multiprocessing Recipe for Queue(s) with Many Consumers
Since I have been wasting a lot of time trying to join() workers in python multiprocessing architectures that get their tasks from a multiprocessing.Queue, which is at the same time fed by a feeder function through put(): Can someone contribute a short but robust recipe for this kind of architecture? Let's say up to 10 feeders and up to 100 workers. Queue items might be large. I can only guess that in the past I have often had the queue busy and not responding anymore, but mostly, I had the task done, only the jobs never joined, or wether they joined or not seemed to depend on arbitrary parameters. So imagine the following workflow: One or more feeder() jobs read from input (i.e. disk) and create tasks for the first line of workers. They put the tasks into a queue Q. When there are no more tasks to create, the feeders should join. Many workers take the tasks out of Q and process it. Then they put the result into another queue R. When these workers fail to get more jobs from Q and are done with the last task, they should join. Consider that they might see an exception due to invalid input/task. They should still join. Let's add another line of workers that takes always two results from R, merges them and puts the result back into R. When there is only one result left in R, all workers of the second line should join. Finally I can take the last remaining aggregated result out of R and be happy. I think this setup should include enough aspects to allow for generalization of many different tasks. To avoid blocking of the queue(s), I have already tried out the following ideas: Use multiple queues with always putting to the smallest queue and getting from random queue Do not block in get(), but try getting without wait, if fails, sleep for a short time and retry, keep track of number of tries, after a while, give up, assuming queue is empty. I can add some code, but maybe this would lead to bug fixing instead of sharing best practices.
Can you add coroutine to front of event loop queue?
Is there a way to create a task but to have it specifically the next task run in the event loop? Suppose I have an event loop currently running several low priority coroutines. Perhaps a few high priority API request tasks come along and I want to immediately asynchronously make these requests and then yield control back to the tasks previously in the loop. I realize that the latency with a network request is orders of magnitude larger than a few CPU cycles saved by reordering the cooperative tasks in the loop, but nevertheless I am curious if there is a way to achieve this.
I want to immediately asynchronously make these requests and then yield control back to the tasks previously in the loop. There is no way to do that in the current asyncio, where all runnable tasks reside in a non-prioritized queue. But there is a deeper issue with the above requirement. Asynchronous tasks potentially yield control to the event loop at every blocking IO call, or more generally at every await. So "immediately" and "asynchronously" don't go together: a truly asynchronous operation cannot be immediate because it has to be suspendable, and when it is suspended, other tasks will proceed. If you really want something to happen immediately, you need to do it synchronously. Other tasks will be blocked anyway because the synchronous operation will not allow them to run. This is likely the reason why asyncio doesn't support task prioritization. By their very nature tasks execute in short slices that can be interleaved in arbitrary ways, so the order in which they execute should not matter in general. In cases when the order does matter, one is expected to use the provided synchronization devices.
Is this multi-threaded function asynchronous
I'm afraid I'm still a bit confused (despite checking other threads) whether: all asynchronous code is multi-threaded all multi-threaded functions are asynchronous My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so: So I constructed this toy example: from threading import * from queue import Queue import time def do_something_with_io_lag(in_work): out = in_work # Imagine we do some work that involves sending # something over the internet and processing the output # once it arrives time.sleep(0.5) # simulate IO lag print("Hello, bee number: ", str(current_thread().name).replace("Thread-","")) class WorkerBee(Thread): def __init__(self, q): Thread.__init__(self) self.q = q def run(self): while True: # Get some work from the queue work_todo = self.q.get() # This function will simiulate I/O lag do_something_with_io_lag(work_todo) # Remove task from the queue self.q.task_done() if __name__ == '__main__': def time_me(nmbr): number_of_worker_bees = nmbr worktodo = ['some input for work'] * 50 # Create a queue q = Queue() # Fill with work [q.put(onework) for onework in worktodo] # Launch processes for _ in range(number_of_worker_bees): t = WorkerBee(q) t.start() # Block until queue is empty q.join() # Run this code in serial mode (just one worker) %time time_me(nmbr=1) # Wall time: 25 s # Basically 50 requests * 0.5 seconds IO lag # For me everything gets processed by bee number: 59 # Run this code using multi-tasking (launch 50 workers) %time time_me(nmbr=50) # Wall time: 507 ms # Basically the 0.5 second IO lag + 0.07 seconds it took to launch them # Now everything gets processed by different bees Is it asynchronous? To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel). However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor: with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor: future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()} for future in concurrent.futures.as_completed(future_to_url): url = future_to_url[future] try: data = future.result() Can this function on just one thread? Especially when compared to asyncio, which means it can run single-threaded There are only two ways to have a program on a single processor do “more than one thing at a time.” Multi-threaded programming is the simplest and most popular way to do it, but there is another very different technique, that lets you have nearly all the advantages of multi-threading, without actually using multiple threads. It’s really only practical if your program is largely I/O bound. If your program is processor bound, then pre-emptive scheduled threads are probably what you really need. Network servers are rarely processor bound, however.
First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable. Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc. Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution. You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading. Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway. Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences. Multithreading for computations (i.e. CPU bound code) Asyncio for input/output (i.e. I/O bound code) Regarding your original statement: all asynchronous code is multi-threaded all multi-threaded functions are asynchronous I hope that I was able to show, that: asynchronous code might be both single threaded and multi-threaded all multi-threaded functions could be called "asynchronous"
I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do: Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous. Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous. Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved. Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.
using multiple threads in Python
I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take? I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach: data_list = get_data(...) output = [] for datum in data: output.append(get_URL_data(datum)) return output There's no other shared state. I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue. Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool. Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one. Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs. So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded. Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one. However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.