How to parallelize computation with asyncio?

How to parallelize computation with asyncio? - python

I have a block of code which takes a long time to execute and is CPU intense. I want to run that block several times and want to use the full power of my CPU for that. Looking at asyncio I understood that it is mainly for asynchronous communication, but is also a general tool for asynchronous tasks.
In the following example the time.sleep(y) is a placeholder for the code I want to run. In this example every co-routine is executed one after the other and the execution takes about 8 seconds.
import asyncio
import logging
import time
async def _do_compute_intense_stuff(x, y, logger):
logger.info('Getting it started...')
for i in range(x):
time.sleep(y)
logger.info('Almost done')
return x * y
logging.basicConfig(format='[%(name)s, %(levelname)s]: %(message)s', level='INFO')
logger = logging.getLogger(__name__)
loop = asyncio.get_event_loop()
co_routines = [
asyncio.ensure_future(_do_compute_intense_stuff(2, 1, logger.getChild(str(i)))) for i in range(4)]
logger.info('Made the co-routines')
responses = loop.run_until_complete(asyncio.gather(*co_routines))
logger.info('Loop is done')
print(responses)
When I replace time.sleep(y) with asyncio.sleep(y) it returns nearly immediately. With await asyncio.sleep(y) it takes about 2 seconds.
Is there a way to parallelize my code using this approach or should I use multiprocessing or threading? Would I need to put the time.sleep(y) into a Thread?

Executors use multithreading to accomplish this (or mulitprocessing, if you prefer). Asyncio is used to optimize code where you wait frequently for input, output operations to run. Sometimes that can be writing to files or loading websites.
However, with cpu heavy operations (that don't just rely on waiting for IO), it's recommended to use something akin to threads, and, in my opinion, concurrent.futures provides a very nice wrapper for that and it is similar to Asyncio's wrapper.
The reason why Asyncio.sleep would make your code run faster because it starts the function and then starts checking coroutines to see if they are ready. This doesn't scale well with CPU-heavy operations, as there is no IO to wait for.
To change the following example from multiprocessing to multi-threading Simply change ProcessPoolExecutor to ThreadPoolExecutor.
Here is a multiprocessing example:
import concurrent.futures
import time
def a(z):
time.sleep(1)
return z*12
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(a, i) for i in range(5)}
for future in concurrent.futures.as_completed(futures):
data = future.result()
print(data)
This is a simplified version of the example provided in the documentation for executors.

Related

Using Threadpool in an Async method without run_in_executor of asyncio.get_event_loop()

Folllowing is my code, which runs a long IO operation from an Async method using Thread Pool from Concurrent.Futures Package as follows:
# io_bound/threaded.py
import concurrent.futures as futures
import requests
import threading
import time
import asyncio
data = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
def sleepy(n):
time.sleep(n//2)
return n*2
async def ExecuteSleep():
l = len(data)
results = []
# Submit needs explicit mapping of I/p and O/p
# Output may not be in the same Order
with futures.ThreadPoolExecutor(max_workers=l) as executor:
result_futures = {d:executor.submit(sleepy,d) for d in data}
results = {d:result_futures[d].result() for d in data}
return results
if __name__ == '__main__':
print("Starting ...")
t1 = time.time()
result = asyncio.run(ExecuteSleep())
print(result)
print("Finished ...")
t2 = time.time()
print(t2-t1)
Following is my question:
What could be the potential issue if I run the Threadpool directly without using the following asyncio apis:
loop = asyncio.get_event_loop()
loop.run_in_executor(...)
I have reviewed the docs, ran simple test cases to me this looks perfectly fine and it will run the IO operation in the Background using the Custom thread pool, as listed here, I surely can't use pure async await to receive the Output and have to manage calls using map or submit methods, beside that I don't see a negative here.
Ideone link of my code https://ideone.com/lDVLFh

What could be the potential issue if I run the Threadpool directly
There is no issue if you just submit stuff to your thread pool and never interact with it or wait for results. But your code does wait for results¹.
The issue is that ExecuteSleep is blocking. Although it's defined as async def, it is async in name only because it doesn't await anything. While it runs, no other asyncio coroutines can run, so it defeats the main benefit of asyncio, which is running multiple coroutines concurrently.
¹ Even if you remove the call to `result()`, the `with` statement will wait for the workers to finish their jobs in order to be able to terminate them. If you wanted the sync functions to run completely in the background, you could make the pool global and not use `with` to manage it.

How to run two functions with different arguments run parallely in python with in a fastapi method?

I have two functions which take in different arguments as input, and I am looking for a way to run them parallely at once, both functions returns somethings and I want to get the results back as well,
def func1(arg1),
return arg2
def func1(arg2):
return args
I have tried using, ray but its taking a lot of time than executing sequentially,
import ray
ray.init()
#ray.remote
a = func1.remote([1,2,3])
b = func2.remote([4,5,6])
func1_results, func2_results = ray.get([a, b])
I know we can use multiprocessing, but how can I store the return results and is this the correct method? Also how can I pass arguments for each function separately?
I got this example from another answer,
from multiprocessing import Process
p1 = Process(target=method1) # create a process object p1
p1.start() # starts the process p1
p2 = Process(target=method2)
p2.start()
Note: Multiprocessing directly is not working inside fastapi method

You can use threads for simplicity but if you really must use Multiprocessing you can share results with Queues. Pass the queue through parameters or initialize it outside the scope.
Following your example...
from multiprocessing import Process,Queue
import os
def func1(arg1,queue):
return queue.put(arg1)
def func2(arg2,queue):
queue.put(arg2)
if __name__ == '__main__':
queue = Queue()
p1 = Process(target=func1, args=([1,2,3],queue))
p2 = Process(target=func2, args=([4,5,6],queue))
p1.start()
p2.start()
p1.join()
p2.join()
queue.put("Done")
while True:
msg = queue.get()
print(msg)
if msg =="Done":
break
Output:
[4, 5, 6]
[1, 2, 3]
Done

If you’re using an up-to-date Python version you can achieve what you want using asyncio, coroutines and tasks.
https://docs.python.org/3/library/asyncio-task.html
import asyncio
import time
async def func1(arg1):
await asyncio.sleep(2)
return "1 says " + str(arg1)
async def func2(arg1):
await asyncio.sleep(3)
return "2 says " + str(arg1)
async def main():
task1 = asyncio.create_task(func1('hello'))
task2 = asyncio.create_task(func2('world'))
print(f"started at {time.strftime('%X')}")
print('should take 3 secs, not 5.')
done = await asyncio.gather(task1, task2)
print(done)
print(f"finished at {time.strftime('%X')}")
asyncio.run(main())
Sample output. Note that if func1 and func2 were cpu-bound then asyncio would not be able to schedule them so fairly, as mentioned in another answer.
$ python3 user_12_gather.py
started at 19:29:24
should take 3 secs, not 5.
['1 says hello', '2 says world']
finished at 19:29:27

The real answer is that it depends almost entirely on what func1 and func2 will be doing. There are many parallelism options in Python 3 and each is designed for a different use case.
If each function is reading a data stream and doing some relatively quick operation on each item then asyncio will work really well for that use case, because each function will spend most of its time waiting for io completion and asyncio will be able to switch the execution context between them very well and easily.
However if the functions are cpu-bound (e.g. you're sending each one off to compute pi to 1000 digits) then iosync won't work well because it has no opportunity to switch between them. You can improve that by adding async.sleep(0) as a hint that this would be a good place to switch, but fundamentally only one will run at a time.
For the cpu-bound use case, you need more than one execution context to achieve parallelism. The simplest option is threads in a single process, but the global interpreter lock in Python limits the achievable performance with that solution, as you noted.
The other option is to fork each work stream into a separate process and let the kernel handle the concurrency. There are many ways to implement that, ranging from simple fork, to the subprocess module, to the multiprocessing module. Each has its own merits in a trade off between complexity and functionality, but fundamentally they all fork separate processes and let the kernel handle the parallelism.

How to use multiple threadpools in twisted with coroutines

I want to use a separate threadPool in Python Twisted for some special kind of work that is hard on CPU usage.
The functions I want to execute are all defined as inlineCallback and I can not modify that.
from twisted.python.threadpool import ThreadPool
#inflineCallbacks
def myFunc(i):
tmp_result = yield intenseCPUwork(i)
# more work ...
return result
pool = ThreadPool(0, 2)
for i in range(100):
pool.callInThread(myFunc, i)
...
But since my functions return a deferred they return immediately when called like that in the threadpool resulting in all 100 calls to be executed at once even though the threadPool has just size 2.
How can I ensure only two async functions being called at the same time in twisted?

The functions I want to execute are all defined as inlineCallback and I can not modify that.
This means you can't run them in a non-reactor thread. Except for a couple exceptions (mostly having to do with logging or thread management) Twisted APIs are not thread-safe. You can only use them in the same thread as the reactor is running.
If you cannot change the target functions to stop using Twisted APIs, you cannot run them in another thread.

i believe what you need is async/await support of functions for intense works, it would be something like below
import asyncio
async def myFunc(i):
tmp_result = yield intenseCPUwork(i)
# more work ...
return result
loop = asyncio.get_event_loop()
tasks = [
loop.create_task(myFunc(1)),
loop.create_task(myFunc(2))
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?

You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.

As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.

Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on

I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.

Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

What kind of problems (if any) would there be combining asyncio with multiprocessing?

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.