I'm reading in several thousand files at once, and for each file I need to perform operations on before yielding rows from each file. To increase performance I thought I could use asyncio to perhaps perform operations on files (and yield rows) whilst waiting for new files to be read in.
However from print statements I can see that all the files are opened and gathered, then each file is iterated over (same as would occur without asyncio).
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
import asyncio
async def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
async def async_generator():
file_outputs = await asyncio.gather(*[open_files(file) for file in files])
for file_output in file_ouputs:
print('using open file')
for row in file_output:
# Do stuff to row
yield row
async def main():
async for yield_value in async_generator():
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
opening files
opening files
.
.
.
using open file
using open file
EDIT
Using the code supplied by #user4815162342, I noticed that, although it was 3x quicker, the set of rows yielded from the generator were slightly different than if done without concurrency. I'm unsure as of yet if this is because some yields were missed out from each file, or if the files were somehow re-ordered. So I introduced the following changes to the code from user4815162342 and entered a lock into the pool.submit()
I should have mentioned when first asking, the ordering of rows in each file and of the files themselves is required.
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
m = multiprocessing.Manager()
lock = m.Lock()
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file, lock) for file in files]
for fut in concurrent.futures.as_completed(file_output_futures):
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
This way my non-concurrent and concurrent approaches yield the same values each time, however I have just lost all the speed gained from using concurrency.
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
There are two issues with your code. The first one is that asyncio.gather() by design waits for all the futures to complete in parallel, and only then returns their results. So the processing you do in the generator is not interspersed with the IO in open_files as was your intention, but only begins after all the calls to open_files have returned. To process async calls as they are done, you should be using something like asyncio.as_completed.
The second and more fundamental issue is that, unlike threads which can parallelize synchronous code, asyncio requires everything to be async from the ground up. It's not enough to add async to a function like open_files to make it async. You need to go through the code and replace any blocking calls, such as calls to IO, with equivalent async primitives. For example, connecting to a network port should be done with open_connection, and so on. If your async function doesn't await anything, as appears to be the case with open_files, it will execute exactly like a regular function and you won't get any benefits of asyncio.
Since you use IO on regular files, and operating systems don't expose portable async interface for regular files, you are unlikely to profit from asyncio. There are libraries like aiofiles that use threads under the hood, but they are as likely to make your code slower than to speed it up because their nice-looking async APIs involve a lot of internal thread synchronization. To speed up your code, you can use a classic thread pool, which Python exposes through the concurrent.futures module. For example (untested):
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file) for file in files]
for fut in file_output_futures:
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
Related
This code is supposed to control a servo from stdin
import asyncio
import sys
import threading
from multiprocessing import Process
async def connect_stdin_stdout():
loop = asyncio.get_event_loop()
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
w_transport, w_protocol = await loop.connect_write_pipe(asyncio.streams.FlowControlMixin, sys.stdout)
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
return reader, writer
servo0ang = 90
async def main():
reader, writer = await connect_stdin_stdout()
while True:
res = await reader.read(100)
if not res:
break
servo0ang = int(res)
# Main program logic follows:
def runAsync():
asyncio.run(main())
def servoLoop():
pwm = Servo()
while True:
pwm.setServoPwm('0', servo0ang)
if __name__ =="__main__":
p = Process(target = servoLoop)
p.start()
runAsync()
p.join()
When i run it the async function starts but servoLoop doesn't
It was supposed to turn the servo to the angle specified in stdin. I'm a bit rusty at Python.
The Servo class is from an example program that came with the robot I'm working with and it works there
So, as I said in comment, you are not sharing servo0ang. You have two processes. Each of them has its own variables. They have the same name, and the same initial value, as in a fork in other languages. Because the new process starts as a copy of the main one. But they are just 2 different python running, with almost nothing to do which each other (one if the parent of the other, so can join it).
If you need to share data, you have either to send them through pipes connecting the two processes. Or by creating a shared memory that both process will access to (it seems easy. And in python, it is quite easy. But it is also easy to have some inefficient polling systems, like yours seems to be, with its infinite loop trying to poll values of servo0ang as fast as it can to not miss any change. In reality, very often, it would be a better idea to wait on pipes. But well, I won't discuss the principles of your project. Just how to do what you want to do, not whether it is a good idea or not).
In python, you have, in the multiprocessing module a Value class that creates memory that can then be shared among processes of the same machine (with Manager you could even share value among processes of different machines, but that is slower)
from multiprocessing import Process, Value
import time # I don't like infinite loop without sleep
v=Value('i',90) # Creates an integer, with initial value of 90 in shared memory
x=90 # Just a normal integer, by comparison
print(v.value,x) # read it
v.value=80 # Modify it
x=80
def f(v):
global x
while True:
time.sleep(1)
v.value = (v.value+1)%360
x = (x+1)%360
p=Process(target=f, args=(v,))
p.start()
while True:
print("New val", v.value, x)
time.sleep(5)
As you see, the value in the main loop increases approx. 5 at each loop. Because the process running f increased it by 1 5 times in the meantime.
But x in that same loop doesn't change. Because it is only the x of the process that runs f (the same global x, but different process. It is as you were running the same program, twice, into two different windows) that changes.
Now, applied to your code
import asyncio
import sys
import threading
import time
from multiprocessing import Process, Value
async def connect_stdin_stdout():
loop = asyncio.get_event_loop()
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
w_transport, w_protocol = await loop.connect_write_pipe(asyncio.streams.FlowControlMixin, sys.stdout)
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
return reader, writer
servo0ang = Value('i', 90)
async def main():
reader, writer = await connect_stdin_stdout()
while True:
res = await reader.read(100)
if not res:
break
servo0ang.value = int(res)
# Main program logic follows:
def runAsync():
asyncio.run(main())
class Servo:
def setServoPwm(self, s, ang):
time.sleep(1)
print(f'\033[31m{ang=}\033[m')
def servoLoop():
pwm = Servo()
while True:
pwm.setServoPwm('0', servo0ang.value)
if __name__ =="__main__":
p = Process(target = servoLoop)
p.start()
runAsync()
p.join()
I used a dummy Servo class that just prints in red the servo0ang value.
Note that I've change nothing in your code, outside that.
Which means, that, no, asyncio.run was not blocking the other process. I still agree with comments you had, on the fact that it is never great to combine both asyncio and processes. Here, you have no other concurrent IO, so your async/await is roughly equivalent to a good old while True: servo0ang.value=int(input()). It is not like your input could yield to something else. There is nothing else, at least not in this process (if your two processes were communicating through a pipe, that would be different)
But, well how ever vainly convoluted your code may be, it works, and asyncio.run is not blocking the other process. It is just that the other process was endlessly calling setPwm with the same, constant, 90 value, that could never change, since that process was doing nothing else with this variable than calling setPwm with. It was doing nothing to try do grab a new value from the main process.
With Value shared memory, there is nothing to do neither. But this time, since it is shared memory, it is less vain to expect that the value changes when nothing changes it in the process.
I have 200 pairs of paths to diff. I wrote a little function that will diff each pair and update a dictionary which itself is one of the arguments to the function. Assume MY_DIFFER is some diffing tool I am calling via subprocess under the hood.
async def do_diff(path1, path2, result):
result[f"{path1} {path2}"] = MY_DIFFER(path1, path2)
As you can see I have nothing to return from this async function. I am just capturing the result in result.
I call this function in parallel elsewhere using asyncio like so:
path_tuples = [("/path11", "/path12"), ("/path21", "/path22"), ... ]
result = {}
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(do_diff(path1, path2, result) for path1, path2 in path_tuples)
)
)
Questions:
I don't know where to put await in the do_diff function. But the code seems to work without it as well.
I am not sure if the diffs are really happening in parallel, because when I look at the output of ps -eaf in another terminal, I see only one instance of the underlying tool I am calling at a time.
The speed of execution is same as when I was doing the diffs sequentially
So I am clearly doing something wrong. How can I REALLY do the diffs in parallel?
PS: I am in Python 3.6
Remember that asyncio doesn't run things in parallel, it runs things concurrently, using a cooperative multitasking model -- which means that coroutines need to explicitly yield time to other coroutines for them to run. This is what the await command does; it says "go run some other coroutines while I'm waiting for something to finish".
If you're never awaiting on something, you're not getting concurrent execution.
What you want is for your do_diff method to be able to await on the execution of your external tool, but you can't do that with just the subprocess module. You can do that using the run_in_executor method, which arranges to run a synchronous command (e.g., subprocess.run) in a separate thread or process and wait asynchronously for the result. That might look something like:
async def do_diff(path1, path2, result):
loop = asyncio.get_event_loop()
result[f"{path1} {path2}"] = await loop.run_in_executor(None, MY_DIFFER, path1, path2)
This will by default run MY_DIFFER in a separate thread, although you can utilize a separate process instead by passing an explicit executor as the first argument to run_in_executor.
Per my comment, solving this with concurrent.futures might look something like this:
import concurrent.futures
import time
# dummy function that just sleeps for 2 seconds
# replace this with your actual code
def do_diff(path1, path2):
print(f"diffing path {path1} and {path2}")
time.sleep(2)
return path1, path2, "information about diff"
# create 200 path tuples for demonstration purposes
path_tuples = [(f"/path{x}.1", f"/path{x}.2") for x in range(200)]
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=100) as executor:
for path1, path2 in path_tuples:
# submit the job to the executor
futures.append(executor.submit(do_diff, path1, path2))
# read the results
for future in futures:
print(future.result())
I have the following code snippet which I want to transform into asynchronous code (data tends to be a large Iterable):
transformed_data = (do_some_transformation(d) for d in data)
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
I managed to rewrite the do_some_transformation-function to be async so I can do the following:
transformed_data = (await do_some_transformation(d) for d in data)
async_generator = (json.dumps(event, separators=(",", ":")) async for t in transformed_data)
stacked_jsons = ???
What's the best way to incrementally join the jsons produced by the async generator so that the joining process is also asynchronous?
This snippet is part of a larger I/O-bound-application which and has many asynchronous components and thus would profit from asynchifying everything.
The point of str.join is to transform an entire list at once.1 If items arrive incrementally, it can be advantageous to accumulate them one by one.
async def join(by: str, _items: 'AsyncIterable[str]') -> str:
"""Asynchronously joins items with some string"""
result = ""
async for item in _items:
if result and by: # only add the separator between items
result += by
result += item
return result
The async for loop is sufficient to let the async iterable suspend between items so that other tasks may run. The primary advantage of this approach is that even for very many items, this never stalls the event loop for longer than adding the next item.
This utility can directly digest the async generator:
stacked_jsons = join("\n\n", (json.dumps(event, separators=(",", ":")) async for t in transformed_data))
When it is know that the data is small enough that str.join runs in adequate time, one can directly convert the data to a list instead and use str.join:
stacked_jsons = "\n\n".join([json.dumps(event, separators=(",", ":")) async for t in transformed_data])
The [... async for ...] construct is an asynchronous list comprehension. This internally works asynchronously to iterate, but produces a regular list once all items are fetched – only this resulting list is passed to str.join and can be processed synchronously.
1 Even when joining an iterable, str.join will internally turn it into a list first.
More in depth explanation about my comment:
Asyncio is a great tool if your processor has a lot of waiting to do.
For example: when you make request to a db over the network, after the request is sent your cpu just does nothing until it gets an answer.
Using the async await syntax you can have your processor execute other tasks while "waiting" for the current one to finish. this does not mean it runs them in parallel. There is only one task running at a time.
In your case (for what i can see) the cpu never waits for something it is constantly running string operations.
if you want to run these operations in parallel you might want to take a look at ProcesPools.
This is not bound by a single process and core but will spread the processing over several cores to run it in parallel.
from concurrent.futures import ProcessPoolExecutor
def main():
with ProcessPoolExecutor() as executor:
transformed_data = executor.map(do_some_transformation, data) #returns an iterable
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
if __name__ == '__main__':
main()
I hope the provided code can help you.
ps.
The if __name__ part is required
edit: i saw your comment about 10k dicts, assume you have 8 cores (ignore multithreading) then each process will only transform 1250 dicts, instead of the 10k your main thread does now. These processes run simultaniously and although the performance increase is not linear it should process them a lot faster.
TL;DR: Consider using producer/consumer pattern, if do_some_transformation is IO bound, and you really want an incremental aggregation.
Of course, async itself only brings an advantage if you actually have any other proper async tasks to begin with.
As #MisterMiyagi said, if do_some_transformation is IO bound and time consuming, firing all transformation as a horde of async tasks can be a good idea.
Example code:
import asyncio
import json
data = ({"large": "data"},) * 3 # large
stacked_jsons = ""
async def transform(d: dict, q: asyncio.Queue) -> None:
# `do_some_transformation`: long IO bound task
await asyncio.sleep(1)
await q.put(d)
# WARNING: incremental concatination of string would be slow,
# since string is immutable.
async def join(q: asyncio.Queue):
global stacked_jsons
while True:
d = await q.get()
stacked_jsons += json.dumps(d, separators=(",", ":")) + "\n\n"
q.task_done()
async def main():
q = asyncio.Queue()
producers = [asyncio.create_task(transform(d, q)) for d in data]
consumer = asyncio.create_task(join(q))
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too
consumer.cancel()
print(stacked_jsons)
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")
So that do_some_transformation don't block each other. Output:
$ python test.py
{"large":"data"}
{"large":"data"}
{"large":"data"}
test.py executed in 1.00 seconds.
Besides, I don't think incremental concatenation of string is a good idea, since string is immutable and a lot of memory would be wasted ;)
Reference: Async IO in Python: A Complete Walkthrough - Real Python
Working on replacing my implementation of a server query tool that uses ThreadPoolExecutors with all asynchronous calls using asyncio and aiohttp. Most of the transition is straight forward since network calls are non-blocking IO, it's the saving of the responses that has me in a conundrum.
All the examples I am using, even the docs for both libraries, use asyncio.gather() which collects all the awaitable results. In my case, these results can be files in the many GB range, and I don't want to store them in memory.
Whats an appropriate way to solve this? Is it to use asyncio.as_completed() and then:
for f in as_completed(aws):
earliest_result = await f
# Assumes `loop` defined under `if __name__` block outside coroutine
loop = get_event_loop()
# Run the blocking IO in an exectuor and write to file
_ = await loop.run_in_executor(None, save_result, earliest_result)
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor by default) thus making this an asynchronous, multi-threaded program vice an asynchronous, single-threaded program?
Futher, does this ensure only 1 earliest_result is being written to file at any time? I dont want the call to await loop.run_in_executor(...) to be running, then another result comes in and I try to run to the same file; I could limit with a semaphore I suppose.
I'd suggest to make use of aiohttp Streaming API. Write your responses directly to the disk instead of RAM and return file names instead of responses itself from gather. Doing so won't use a lot of memory at all. This is a small demo of what I mean:
import asyncio
import aiofiles
from aiohttp import ClientSession
async def make_request(session, url):
response = await session.request(method="GET", url=url)
filename = url.split('/')[-1]
async for data in response.content.iter_chunked(1024):
async with aiofiles.open(filename, "ba") as f:
await f.write(data)
return filename
async def main():
urls = ['https://github.com/Tinche/aiofiles',
'https://github.com/aio-libs/aiohttp']
async with ClientSession() as session:
coros = [make_request(session, url) for url in urls]
result_files = await asyncio.gather(*coros)
print(result_files)
asyncio.run(main())
Very clever way of using the asyncio.gather method by #merrydeath.
I tweaked the helper function like below and got a big performance boost:
response = await session.get(url)
filename = url.split('/')[-1]
async with aiofiles.open(filename, "ba") as f:
await f.write(response.read())
Results may differ depending on the download connection speed.
In my case, these results can be files in the many GB range, and I don't want to store them in memory.
If I'm correct and in your code single aws means a downloading of a single file, you may face a following problem: while as_completed allows to swap data from RAM to HDD asap, all your aws running parallely storing each their data (buffer with partly downloaded file) in RAM simultaneously.
To avoid this you'll need to use semaphore to ensure not to much files are downloading parallely in the first place thus to prevent RAM overuse.
Here's example of using semaphore.
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor
by default) thus making this an asynchronous, multi-threaded program
vice an asynchronous, single-threaded program?
I'm not sure, I understand your question, but yes, your code will use threads, but only save_result will be executed inside those threads. All other code still runs in single main thread. Nothing bad here.
Futher, does this ensure only 1 earliest_result is being written to
file at any time?
Yes, it is[*]. To be precisely keyword await at last line of your snippet will ensure it:
_ = await loop.run_in_executor(None, save_result, earliest_result)
You can read it as: "Start executing run_in_executor asynchronously and suspend execution flow at this line until run_in_executor is done and returned result".
[*] Yes, if you don't run multiple for f in as_completed(aws) loops parallely in the first place.
How can I limit the number of concurrent threads in Python?
For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
Here is what I have so far:
def process_file(fname):
# open file and do something
def process_file_thread(queue, fname):
queue.put(process_file(fname))
def process_all_files(d):
files=glob.glob(d + '/*')
q=Queue.Queue()
for fname in files:
t=threading.Thread(target=process_file_thread, args=(q, fname))
t.start()
q.join()
def main():
process_all_files('.')
# Do something after all files have been processed
How can I modify the code so that only 4 threads are run at a time?
Note that I want to wait for all files to be processed and then continue and work on the processed files.
For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
That's exactly what a thread pool does: You create jobs, and the pool runs 4 at a time in parallel. You can make things even simpler by using an executor, where you just hand it functions (or other callables) and it hands you back futures for the results. You can build all of this yourself, but you don't have to.*
The stdlib's concurrent.futures module is the easiest way to do this. (For Python 3.1 and earlier, see the backport.) In fact, one of the main examples is very close to what you want to do. But let's adapt it to your exact use case:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
concurrent.futures.wait(fs)
If you wanted process_file to return something, that's almost as easy:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
for f in concurrent.futures.as_completed(fs):
do_something(f.result())
And if you want to handle exceptions too… well, just look at the example; it's just a try/except around the call to result().
* If you want to build them yourself, it's not that hard. The source to multiprocessing.pool is well written and commented, and not that complicated, and most of the hard stuff isn't relevant to threading; the source to concurrent.futures is even simpler.
I used this technique a few times, I think it's a bit ugly thought:
import threading
def process_something():
something = list(get_something)
def worker():
while something:
obj = something.pop()
# do something with obj
threads = [Thread(target=worker) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]