I'm writing a piece of code that needs to compare a python set to many other sets and retain the names of the files which have a minimum intersection length. I currently have a synchronous version but was wondering if it could benefit from async/await. I wanted to start by comparing the loading of sets. I wrote a simple script that writes a small set to disk and just reads it in n amount of times. I was suprised to see the sync version of this was a lot faster. Is this to be expected? and if not is there a flaw in the way I have coded it below?
My code is the following:
Synchronous version:
import pickle
import asyncio
import time
import aiofiles
pickle.dump(set(range(1000)), open('set.pkl', 'wb'))
def count():
print("Started Loading")
with open('set.pkl', mode='rb') as f:
contents = pickle.loads(f.read())
print("Finishd Loading")
def main():
for _ in range(100):
count()
if __name__ == "__main__":
s = time.perf_counter()
main()
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.3f} seconds.")
Asynchronous version:
import pickle
import asyncio
import time
import aiofiles
pickle.dump(set(range(1000)), open('set.pkl', 'wb'))
async def count():
print("Started Loading")
async with aiofiles.open('set.pkl', mode='rb') as f:
contents = pickle.loads(await f.read())
print("Finishd Loading")
async def main():
await asyncio.gather(*(count() for _ in range(100)))
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.3f} seconds.")
Execuitng them led to:
async.py executed in 0.052 seconds.
sync.py executed in 0.011 seconds.
Asyncio doesn’t help in this case because your workload is basically disk-IO bound and CPU bound.
CPU bound workload cannot be sped up by Asyncio.
Disk-IO bound workload could benefit from async operation if but the disk operation is very slow and your program can do other things during that time. This may not be your situation.
So the slower asyncio performance is mainly due to the additional overhead introduced.
aiofiles is implemented by using threads, so each time you tell it to read a file another thread will be instructed to read the file.
the file being read is actually very small it fits in 3 KB which is under 1 page in your memory and also smaller than your core L1 cache, the computer didn't actually read anything from the disk most of the time, it's all being moved between parts of your memory.
in the async case it is being moved from one core's memory to the second, which is slower than keeping everything within 1 core's cache, but for larger files that are actually read from disk and other tasks to attend to, such as reading from sockets and reading different files from disk and doing some processing concurrently you will find the async version is faster, because it is using threads under the hood, and some tasks drop the gil, like reading from files and sockets, and some processing libraries.
you are still reading files at the same speed in both cases as you will be limited by your drive read speed, you will only be reducing the "dead-time" of when you are not reading files, and your example has no "dead-time", it isn't even reading a file from disk.
an exception to the above happens when you are reading data from multiple HDDs and SSDs concurrently where 1 thread can never read the data fast enough so the async version will be faster, because it can read from multiple drives at the same time (assuming you have the cores and IO lanes for it in your CPU)
Related
I have a function readFiles that I need to call 8.5 million times (essentially stress-testing a logger to ensure the log rotates correctly). I don't care about the output/result of the function, only that I run it N times as quickly as possible.
My current solution is this:
from threading import Thread
import subprocess
def readFile(filename):
args = ["/usr/bin/ls", filename]
subprocess.run(args)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
readFile has been simplified, but the concept is the same. I need to run readFile 8.5 million times, and I need to wait for all the reads to finish. Based on my mental math, this spawns ~60 threads per second, which means it will take ~40 hours to finish. Ideally, this would finish within 1-8 hours.
Is this possible? Is the number of iterations simply too high for this to be done in a reasonable span of time?
Oddly enough, when I wrote a test script, I was able to generate a thread about every ~0.0005 seconds, which should equate to ~2000 threads per second, but this is not the case here.
I considered iteration 8500000 / 10 times, and spawning a thread which then runs the readFile function 10 times, which should decrease the amount of time by ~90%, but it caused some issues with blocking resources, and I think passing a lock around would be a bit complicated insofar as keeping the function usable by methods that don't incorporate threading.
Any tips?
Based on #blarg's comment, and scripts I've used using multiprocessing, the following can be considered.
It simply reads the same file based on the size of the list. Here I'm looking at 1M reads.
With 1 core it takes around 50 seconds. With 8 cores it's down to around 22 seconds. this is on a windows PC, but I use these scripts on linux EC2 (AWS) instances as well.
just put this in a python file and run:
import os
import time
from multiprocessing import Pool
from itertools import repeat
def readfile(fn):
f = open(fn, "r")
def _multiprocess(mylist, num_proc):
with Pool(num_proc) as pool:
r = pool.starmap(readfile, zip(mylist))
pool.close()
pool.join()
return r
if __name__ == "__main__":
__spec__=None
# use the system cpus or change explicitly
num_proc = os.cpu_count()
num_proc = 1
start = time.time()
mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
rs = _multiprocess(mylist, num_proc=num_proc)
print('total seconds,', time.time()-start )
I think you should considering using subprocess here, if you just want to execute ls command I think it's better to use os.system since it will reduce the resource consumption of your current GIL
also you have to put a little delay with time.sleep() while waiting the thread to be finished to reduce resource consumption
from threading import Thread
import os
import time
def readFile(filename):
os.system("/usr/bin/ls "+filename)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
time.sleep(0.1) # put this delay to reduce resource consumption while waiting
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
I'm reading in several thousand files at once, and for each file I need to perform operations on before yielding rows from each file. To increase performance I thought I could use asyncio to perhaps perform operations on files (and yield rows) whilst waiting for new files to be read in.
However from print statements I can see that all the files are opened and gathered, then each file is iterated over (same as would occur without asyncio).
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
import asyncio
async def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
async def async_generator():
file_outputs = await asyncio.gather(*[open_files(file) for file in files])
for file_output in file_ouputs:
print('using open file')
for row in file_output:
# Do stuff to row
yield row
async def main():
async for yield_value in async_generator():
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
opening files
opening files
.
.
.
using open file
using open file
EDIT
Using the code supplied by #user4815162342, I noticed that, although it was 3x quicker, the set of rows yielded from the generator were slightly different than if done without concurrency. I'm unsure as of yet if this is because some yields were missed out from each file, or if the files were somehow re-ordered. So I introduced the following changes to the code from user4815162342 and entered a lock into the pool.submit()
I should have mentioned when first asking, the ordering of rows in each file and of the files themselves is required.
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
m = multiprocessing.Manager()
lock = m.Lock()
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file, lock) for file in files]
for fut in concurrent.futures.as_completed(file_output_futures):
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
This way my non-concurrent and concurrent approaches yield the same values each time, however I have just lost all the speed gained from using concurrency.
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
There are two issues with your code. The first one is that asyncio.gather() by design waits for all the futures to complete in parallel, and only then returns their results. So the processing you do in the generator is not interspersed with the IO in open_files as was your intention, but only begins after all the calls to open_files have returned. To process async calls as they are done, you should be using something like asyncio.as_completed.
The second and more fundamental issue is that, unlike threads which can parallelize synchronous code, asyncio requires everything to be async from the ground up. It's not enough to add async to a function like open_files to make it async. You need to go through the code and replace any blocking calls, such as calls to IO, with equivalent async primitives. For example, connecting to a network port should be done with open_connection, and so on. If your async function doesn't await anything, as appears to be the case with open_files, it will execute exactly like a regular function and you won't get any benefits of asyncio.
Since you use IO on regular files, and operating systems don't expose portable async interface for regular files, you are unlikely to profit from asyncio. There are libraries like aiofiles that use threads under the hood, but they are as likely to make your code slower than to speed it up because their nice-looking async APIs involve a lot of internal thread synchronization. To speed up your code, you can use a classic thread pool, which Python exposes through the concurrent.futures module. For example (untested):
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file) for file in files]
for fut in file_output_futures:
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.
I want to fetch whois data from a txt file with 50000 urls. It's working but takes at least 20 minutes. What can i do to improve performance of this
import whois
from concurrent.futures import ThreadPoolExecutor
import threading
import time
pool = ThreadPoolExecutor(max_workers=2500)
def query(domain):
while True:
try:
w = whois.whois(domain)
fwrite = open("whoISSuccess.txt", "a")
fwrite.write('\n{0} : {1}'.format(w.domain, w.expiration_date))
fwrite.close()
except:
time.sleep(3)
continue
else:
break
with open('urls.txt') as f:
for line in f:
lines = line.rstrip("\n\r")
pool.submit(query, lines)
pool.shutdown(wait=True)
You can do two things to improve the speed:
Use multiprocessing, rather than threading as python Threads do not really run in parallel, while processes will be managed the OS and truly run in parallel.
Secondly, have each process write to its own file, e.g. <url>.txt, as having all processes write to the same file will cause lock contention on the file writing, which significantly slows your program. After all processes have completed you can then aggregate all files to a single one if this is a critical requirement. Alternatively, you can just keep the whois result in memory and then write it out to a file in the end.
To begin with, we're given the following piece of code:
from validate_email import validate_email
import time
import os
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
emails = set()
with open(email_path) as f:
for email in f:
email = email.strip()
if email in emails:
continue
emails.add(email)
if validate_email(email, verify=True):
good_emails.write(email + '\n')
else:
bad_emails.write(email + '\n')
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
I expect contacting SMTP servers to be the most expensive part by far from my program when emails.txt contains large amount of lines (>1k). Using some form of parallel or asynchronous I/O should speed this up a lot, since I can wait for multiple servers to respond instead of waiting sequentially.
As far as I have read:
Asynchronous I/O operates by queuing a request for I/O to the file
descriptor, tracked independently of the calling process. For a file
descriptor that supports asynchronous I/O (raw disk devcies
typically), a process can call aio_read() (for instance) to request a
number of bytes be read from the file descriptor. The system call
returns immediately, whether or not the I/O has completed. Some time
later, the process then polls the operating system for the completion
of the I/O (that is, buffer is filled with data).
To be sincere, I didn't quite understand how to implement async I/O on my program. Can anybody take a little time and explain me the whole process ?
EDIT as per PArakleta suggested:
from validate_email import validate_email
import time
import os
from multiprocessing import Pool
import itertools
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
with open(email_path, "r") as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
good_emails.close()
bad_emails.close()
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
You're asking the wrong question
Having looked at the validate_email package your real problem is that you're not efficiently batching your results. You should be only doing the MX lookup once per domain and then only connect to each MX server once, go through the handshake, and then check all of the addresses for that server in a single batch. Thankfully the validate_email package does the MX result caching for you, but you still need to be group the email addresses by server to batch the query to the server itself.
You need to edit the validate_email package to implement batching, and then probably give a thread to each domain using the actual threading library rather than multiprocessing.
It's always important to profile your program if it's slow and figure out where it is actually spending the time rather than trying to apply optimisation tricks blindly.
The requested solution
IO is already asynchronous if you are using buffered IO and your use case fits with the OS buffering. The only place you could potentially get some advantage is in read-ahead but Python already does this if you use the iterator access to a file (which you are doing). AsyncIO is an advantage to programs that are moving large amounts of data and have disabled the OS buffers to prevent copying the data twice.
You need to actually profile/benchmark your program to see if it has any room for improvement. If your disks aren't already throughput bound then there is a chance to improve the performance by parallel execution of the processing of each email (address?). The easiest way to check this is probably to check to see if the core running your program is maxed out (i.e. you are CPU bound and not IO bound).
If you are CPU bound then you need to look at threading. Unfortunately Python threading doesn't work in parallel unless you have non-Python work to be done so instead you'll have to use multiprocessing (I'm assuming validate_email is a Python function).
How exactly you proceed depends on where the bottleneck's in your program are and how much of a speed up you need to get to the point where you are IO bound (since you cannot actually go any faster than that you can stop optimising when you hit that point).
The emails set object is hard to share because you'll need to lock around it so it's probably best that you keep that in one thread. Looking at the multiprocessing library the easiest mechanism to use is probably Process Pools.
Using this you would need to wrap your file iterable in an itertools.ifilter which discards duplicates, and then feed this into a Pool.imap_unordered and then iterate that result and write into your two output files.
Something like:
with open(email_path) as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
The validate_map function should be something simple like:
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
The unique function should be something like:
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
ETA: I just realised that validate_email is a library which actually contacts SMTP servers. Given that it's not busy in Python code you can use threading. The threading API though is not as convenient as the multiprocessing library but you can use multiprocessing.dummy to have a thread based Pool.
If you are CPU bound then it's not really worth having more threads/processes than cores but since your bottleneck is network IO you can benefit from many more threads/processes. Since processes are expensive you want to swap to threads and then crank up the number running in parallel (although you should be polite not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool and then call ThreadPool(processes=32).imap_unordered().