Difference between AysncResult, and callback with error_callback in python pools - python

I am using Pools to kick off worker processes in python3.6. The workers will return True or False after completion, and I was wondering what the difference is between using the AsyncResult returned object or using a callback function to check if the worker returned True or False. From my understanding the callback is called in the main process, the same place I would do the checking anyway.
#Using the AsyncResult way
def check_result(result):
if result:
#Successful do something
else:
#Failed
with Pool() as pool:
result = pool.apply_async(upload, (args, ))
check_result(result.get())
#Using callbacks
def check_result(result):
if result:
#Successful do something
def err_result(result):
#Do something
with Pool() as pool:
pool.appy_async(upload, (args,), callback=check_result, error_callback=err_result)
I see that in python3.6 they allow error_callback, so are these two bits of code equivalent? What are the pros and cons of both?
Thanks

The comparison between AsyncResult and callback is somewhat unlucky.
Note that you only have callbacks available for asynchronous methods (returning AsyncResult objects), so there is no 'versus' in this story regarding these things.
When you write check_result(result.get()), you don't pass some AsyncResult-object into check_result, but an already awaited normal result, in your case a boolean value (if not an exception). So it's not a difference between AsyncResult and callback, but between manually calling check_result on a result or registering a callback beforehand.
I see that in python3.6 they allow error_callback, so are these two bits of code equivalent? What are the pros and cons of both?
No, these two snippets are not equivalent. error_callback is an exception handler, your possible False-result won't trigger that, but an exception will.
Your result argument within err_result will be filled with an exception instance in such a case. The difference with your upper snippet is, that an exception there will blow up in your face as soon as you call result.get() and you have not enclosed it within an try-except-block.
The obvious 'pro' of an error_callback is the omitted try-except-block, the 'pro' of the regular callback also is reduced code length. Use both only for immediately returning tasks like checking and logging, to prevent blocking the thread your pool runs in.

Related

Correct way to parallelize work with asyncio

There are many posts on SO asking specific questions about asyncio, but I cannot grasp the right way on what to use for a given situation.
Let's say I want to parse and crawl a number of web pages in parallel. I can do this in at least 3 different ways with asyncio:
with pool.submit:
with ThreadPoolExecutor(max_workers=10) as pool:
result_futures = list(map(lambda x: pool.submit(my_func, x), my_list))
for future in as_completed(result_futures):
results.append(future.result())
return results
With asyncio.gather:
loop = asyncio.get_running_loop()
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [loop.run_in_executor(pool, my_func, x) for x in my_list]
results = await asyncio.gather(*futures)
With just pool.map:
with ThreadPoolExecutor(max_workers=10) as pool:
results = [x for x in pool.map(my_func, arg_list)]
my_func is something like
async def my_func(arg):
async with aiohttp.ClientSession() as session:
async with session.post(...):
...
Could somebody help me understand what would be the differences between those 3 approaches? I understand that I can, for example, handle exceptions independently in the first one, but any other differences?
None of these. ThreadPoolExecutor and run_in_executor will all execute your code in another thread, no matter you use the asyncio loop to watch for their execution. And at that point you might just as well not use asyncio at all: the idea of async is exactly managing to run everything on a single thread - getting some CPU cycles and easing a lot on race-conditions that emerge on multi-threaded code.
If your my_func is using async correctly, all the way (it looks like it is, but the code is incomplete), you have to create an asyncio Task for each call to your "async defined" function. On that, maybe the shortest path is indeed using asyncio.gather:
import asyncio
import aiohttp, ... # things used inside "my_func"
def my_func(x):
...
my_list = ...
results = asyncio.run(asyncio.gather(*(my_func(x) for x in my_list)))
An that is all there is for it.
Now going back to your code, and checking the differences:
your code work almost by chance, as in, you really just passed the async functiona and its parameters to the threadpool executor: on calling any async function in this way, they return imediatelly, with no work done. That means nothing (but some thin boiler plate inner code used to create the co-routines) is executed in your threadpool executors. The values returned by the call that runs in the target threads (i.e. the actual my_func(x) call) are the "co-routines": these are the objects that are to be awaited in the main thread and that will actually performe the network I/O. That is: your "my_func" is a "co-routine function" and when called it retoruns immediately with a "co-routine object". When the co-routine object is awaited the code inside "my_func" is actually executed.
Now, with that out of the way: in your first snippet you call future.result on the concurrent.futures Future: that will jsut give you the co-routine object: that code does not work - if you would write results.append(await future.result()) then, yes, if there are no exceptions in the execution, it would work, but would make all the calls in sequence: "await" stops the execution of the current thread until the awaited object resolves, and since awaiting for the other results would happen in this same code, they will queue and be executed in order, with zero parallelism.
Your pool.map code does the same, and your asyncio.gather code is wrong in a different way: the loop.run_in_executor code will take your call and run it on another thread, and gives you an awaitable object which is suitable to be used with gather. However, awaiting on it will return you the "co-routine object", not the result of the HTTP call.
Your real options regarding getting the exceptions raised in the parallel code are either using asyncio.gather, asyncio.wait or asyncio.as_completed. Check the docs here: https://docs.python.org/3/library/asyncio-task.html

How do I make my ThreadPool work better with requests

I currently have this function, which does a api call, each api call is requesting different data. I can do up to 300 concurrent api calls at a time.
Doing this does not seem to go fast, since this is just waiting for the repl I was wondering how I would make this function faster?
from multiprocessing.pool import ThreadPool
import requests
pool = ThreadPool(processes=500)
variables = VariableBaseDict
for item in variables:
async_result = pool.apply_async(requests.get(url.json()))
result = async_result.get()
#do stuff with result
Your current code is not actually farming any real work off to a worker thread. You are calling requests.get(url.json()) right in the main thread, and then passing the object that returns to pool.apply_async. You should be doing pool.apply_async(requests.get, (url.json(),)) instead. That said, even if you corrected this problem, you are then immediately waiting for the reply to the call, which means you never actually run any calls concurrently. You farm one item off to a thread, wait for it to be done, then wait for the next item.
You need to:
Fix the issue where you're accidentally calling requests.get(...) in the main thread.
Either use pool.map to farm the list of work off to the worker threads concurrently, or continue using pool.apply_async, but instead of immediately calling async_result.get(), store all the async_result objects in a list, and once you've iterated over variables, iterate over the async_result list and call .get() on each item. That way you actually end up running all the calls concurrently.
So, if you used apply_async, you'd do something like this:
async_results = [pool.apply_async(requests.get, (build_url(item),)) for item in variables]
for ar in async_results:
result = ar.get()
# do stuff with result
With pool.map it would be:
results = pool.map(requests.get, [build_url(item) for item in variables])

Detect failed tasks in concurrent.futures

I've been using concurrent.futures as it has a simple interface and let user easily control the max number of threads/processes. However, it seems like concurrent.futures hides failed tasks and continue the main thread after all tasks finished/failed.
import concurrent.futures
def f(i):
return (i + 's')
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
fs = [executor.submit(f, i ) for i in range(10)]
concurrent.futures.wait(fs)
Calling f on any integer leads an TypeError. However, the whole script runs just fine and exits with code 0. Is there any way to make it throw an exception/error when any thread failed?
Or, is there a better way to limit number of threads/processes without using concurrent.futures?
concurrent.futures.wait will ensure all the tasks completed, but it doesn't check success (something return-ed) vs. failure (exception raised and not caught in worker function). To do that, you need to call .result() on each Future (which will cause it to either re-raise the exception from the task, or produce the return-ed value). There are other methods to check without actually raising in the main thread (e.g. .exception()), but .result() is the most straightforward method.
If you want to make it re-raise, the simplest approach is just to replace the wait() call with:
for fut in concurrent.futures.as_completed(fs):
fut.result()
which will process results as Futures complete, and promptly raise an Exception if one occurred. Alternatively, you continue to use wait so all tasks finish before you check for exceptions on any of them, then iterate over fs directly and call .result() on each.
There is another way to do the same with multiprocessing.Pool (for processes) or multiprocessing.pool.ThreadPool (for threads). As far as I know it rethrows any caught exceptions.

Communication between threads in Python (without using Global Variables)

Let's say if we have a main thread which launches two threads for test modules - " test_a" and " test_b".
Both the test module threads maintain their state whether they are done performing test or if they encountered any error, warning or if they want to update some other information.
How main thread can get access to this information and act accordingly.
For example, if " test_a" raised an error flag; How "main" will know and stop rest of the tests before existing with error ?
One way to do this is using global variables but that gets very ugly.. Very soon.
The obvious solution is to share some kind of mutable variable, by passing it in to the thread objects/functions at constructor/start.
The clean way to do this is to build a class with appropriate instance attributes. If you're using a threading.Thread subclass, instead of just a thread function, you can usually use the subclass itself as the place to stick those attributes. But I'll show it with a list just because it's shorter:
def test_a_func(thread_state):
# ...
thread_state[0] = my_error_state
# ...
def main_thread():
test_states = [None]
test_a = threading.Thread(target=test_a_func, args=(test_states,))
test_a.start()
You can (and usually want to) also pack a Lock or Condition into the mutable state object, so you can properly synchronize between main_thread and test_a.
(Another option is to use a queue.Queue, an os.pipe, etc. to pass information around, but you still need to get that queue or pipe to the child thread—which you do in the exact same way as above.)
However, it's worth considering whether you really need to do this. If you think of test_a and test_b as "jobs", rather than "thread functions", you can just execute those jobs on a pool, and let the pool handle passing results or errors back.
For example:
try:
with concurrent.futures.ThreadPoolExecutor(workers=2) as executor:
tests = [executor.submit(job) for job in (test_a, test_b)]
for test in concurrent.futures.as_completed(tests):
result = test.result()
except Exception as e:
# do stuff
Now, if the test_a function raises an exception, the main thread will get that exception—and, because that means exiting the with block, and all of the other jobs get cancelled and thrown away, and the worker threads shut down.
If you're using 2.5-3.1, you don't have concurrent.futures built in, but you can install the backport off PyPI, or you can rewrite things around multiprocessing.dummy.Pool. (It's slightly more complicated that way, because you have to create a sequence of jobs and call map_async to get back an iterator over AsyncResult objects… but really that's still pretty simple.)

How to efficiently iterate over multiple generators?

I've got three different generators, which yields data from the web. Therefore, each iteration may take a while until it's done.
I want to mix the calls to the generators, and thought about roundrobin (Found here).
The problem is that every call is blocked until it's done.
Is there a way to loop through all the generators at the same time, without blocking?
You can do this with the iter() method on my ThreadPool class.
pool.iter() yields threaded function return values until all of the decorated+called functions finish executing. Decorate all of your async functions, call them, then loop through pool.iter() to catch the values as they happen.
Example:
import time
from threadpool import ThreadPool
pool = ThreadPool(max_threads=25, catch_returns=True)
# decorate any functions you need to aggregate
# if you're pulling a function from an outside source
# you can still say 'func = pool(func)' or 'pool(func)()
#pool
def data(ID, start):
for i in xrange(start, start+4):
yield ID, i
time.sleep(1)
# each of these calls will spawn a thread and return immediately
# make sure you do either pool.finish() or pool.iter()
# otherwise your program will exit before the threads finish
data("generator 1", 5)
data("generator 2", 10)
data("generator 3", 64)
for value in pool.iter():
# this will print the generators' return values as they yield
print value
In short, no: there's no good way to do this without threads.
Sometimes ORMs are augmented with some kind of peek function or callback that will signal when data is available. Otherwise, you'll need to spawn threads in order to do this. If threads are not an option, you might try switching out your database library for an asynchronous one.

Categories