Multithreaded parsing is slower then sequential - python

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

Related

Why don't I get faster run-times with ThreadPoolExecutor?

In order to understand how threads work in Python, I wrote the following simple function:
def sum_list(thelist:list, start:int, end:int):
s = 0
for i in range(start,end):
s += thelist[i]**3//10
return s
Then I created a list and tested how much time it takes to compute its sum:
LISTSIZE = 5000000
big_list = list(range(LISTSIZE))
start = time.perf_counter()
big_sum=sum_list(big_list, 0, LISTSIZE)
print(f"One thread: sum={big_sum}, time={time.perf_counter()-start} sec")
It took about 2 seconds.
Then I tried to partition the computation into threads, such that each thread computes the function on a subset of the list:
THREADCOUNT=4
SUBLISTSIZE = LISTSIZE//THREADCOUNT
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(THREADCOUNT) as executor:
futures = [executor.submit(sum_list, big_list, i*SUBLISTSIZE, (i+1)*SUBLISTSIZE) for i in range(THREADCOUNT)]
big_sum = 0
for res in concurrent.futures.as_completed(futures): # return each result as soon as it is completed:
big_sum += res.result()
print(f"{THREADCOUNT} threads: sum={big_sum}, time={time.perf_counter()-start} sec")
Since I have a 4-cores CPU, I expected it to run 4 times faster. But it did not: it ran in about 1.8 seconds on my Ubuntu machine (on my Windows machine, with 8 cores, it ran even slower than the single-thread version: about 2.2 seconds).
Is there a way to use ThreadPoolExecutor (or another threads-based mechanism in Python) so that I can compute this function faster?
The problem is that the function you are trying to make faster is CPU-bound and the Python Global Interpreter Lock (GIL) prevents any performance gain from parallelisation of such code.
In Python, threads are wrapper around genuine OS thread. However, in order to avoid race conditions due to concurrent execution, only one thread can access the Python interpreter to execute bytecode at a time. This restriction is enforced by a lock called the GIL.
Thus in Python, true multithreading cannot be achieved and multiprocessing should be used instead. However, note that the GIL is not locked by IO operations (file reading, networking, etc.) and some library code (numpy, etc.) so these operations can still benefit from Python multithreading.
The function sum_list used neither of those operations so it will not benefit from Python multithreading.
You can use ProcessPoolExecutor to effectively get parallelism but this may copy the input list in your case. Multiprocessing is equivalent to launching multiple independent Python interpreters, thus the GIL's (one per intepreter) is not an issue anymore. However, multiprocessing incurs performance penalties during inter-process communication.
Since I have a 4-cores CPU, I expected it to run 4 times faster.
ThreadPoolExecutor does not use multiple CPUs, so this isn't a sensible expectation. All worker threads are executing on the same CPU.
Threads generally only help when you're IO bound, large calculations are CPU bound.
To take advantage of multiple CPUs, you may want to look at a ProcessPoolExecutor instead. However, spawning/forking additional processes has a much higher overhead than threading, and any objects sent across the process boundary need to be picklable. Since the workers in your example code all reference the same big_list instance, it may not work well with multiprocessing either - it will be copying the entire list in each worker process, even though the worker only intends to use a small segment of the list.
You can refactor it to only send the data you need for the calculation (easy), or you can use shared memory (difficult).

Python multithreading/multiprocessing very slow with concurrent.futures

I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!
If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading

Reading CSV with pandas in parallel creates huge memory leak / process zombies

I'm reading +1000 of ~200Mb CSVs in parallel and saving the modified CSV afterwards using pandas. This creates many zombie processes that accumulate to +128Gb of RAM which devastates performance.
csv_data = []
c = zip(a, b)
process_pool = Pool(cpu_count())
for name_and_index in process_pool.starmap(load_and_process_csv, c):
csv_data.append(name_and_index)
process_pool.terminate()
process_pool.close()
process_pool.join()
This is my current solution. It doesn't seem to cause a problem until you process more than 80 CSVs or so.
PS: Even when pool is completed ~96Gb of RAM is still occupied and you can see the python processes occupying RAM but not doing anything nor being destoryed. Moreover, I know with certainty that the function the pool is executing itself is running to completion.
I hope that's descriptive enough.
Python's multiprocessing module is process-based. So it is natural that you have many processes.
Worse, these processes do not share memory, but communicate through pickling/unpickling. So they are very slow if large data need to be transferred between processed, which is happening here.
For this case, because the processing is I/O related, you may have better performance using multithread with threading module if I/O is the bottleneck. Threads share memory but they also 'share' 1 CPU core, so it's not guarantee to run faster, you should try it.
Update: If multithread does not help, you don't have many options left. Because this case is exactly against the critical weakness of Python's parallel processing architecture. You may want to try dask (parallel pandas): http://dask.readthedocs.io/en/latest/

Threads writing each their own file is slower than writing all files sequentially

I am learning about threading in Python, and wrote a short test program which creates 10 csv-files and writes 100k lines in each of the files. I assumed it would be faster to let 10 threads write each their own file, but for some reason it is 2x slower than simply writing all files in sequence.
I think this might have to do with the way the threading is treated by the OS, but not sure. I am running this on Linux.
I would greatly appreciate if someone could shed some light on why this is the case.
Multi-thread version:
import thread, csv
N = 10 #number of threads
exitmutexes = [False]*N
def filewriter(id_):
with open('files/'+str(id_)+'.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for i in xrange(100000):
writer.writerow(["efweef", "wefwef", "666w6efw", "6555555"])
exitmutexes[id_] = True
for i in range(N):
thread.start_new_thread(filewriter, (i,))
while False in exitmutexes: #checks whether all threads are done
pass
Note: I have tried to include a sleep in the while-loop so that main thread is free at intervals, but this had no substantial effect.
Regular version:
import time, csv
for i in range(10):
with open('files2/'+str(i)+'.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for i in xrange(100000):
writer.writerow(["efweef", "wefwef", "666w6efw", "6555555"])
There are several issues:
Due to Global Interpreter Lock (GIL), Python will not use more than one CPU core at a time for the data generation part, so your data generation won't be sped up by running multiple threads. You'll need multi processing to improve CPU bound operation.
But that's not really the core of the problem here, because the GIL is released when you do I/O like writing to disk. The core of the problem is that you're writing to ten different places at a time, which most likely causes the harddisk head to thrash around as the hard disk head switches around between ten different places in the disk. Serial writes is almost always fastest in a hard disk.
Even if you have CPU bound operation and use multiprocessing, using ten thread won't give you any significant advantage in data generation unless you actually have ten CPU cores. If you use more threads than the number of CPU cores, you'll pay the cost of thread switching, but you'll never speed up the total runtime of a CPU bound operation.
If you use more threads than available CPU, the total run time always increases or at most stay the same. The only reason to use more threads than CPU cores is if you are consuming the result of the threads interactively or in a pipeline with other systems. There are edge cases where you can speed up a poorly designed, I/O bound program by using threads. But a well designed single thread program will most likely perform just as well or better.
Sounds like the dreaded GIL (Global Interpreter Lock)
"In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)"
This essentially means each python interpreter (and thus script) is locked to one logical core on your machine, and no two threads will be executed simultaneously, unless you decide to spawn to separate processes.
Consult this page for more details:
https://wiki.python.org/moin/GlobalInterpreterLock

Simultaneous parsing, python

I have a python program that sequentially parses through 30,000+ files.
Is there a way I could break this up into multiple threads (is this the correct terminology?)
and parse through chunks of that file at the same time. Say having 30 algorithms parsing through 1000 files each.
This is easy.
You can create 30 threads explicitly and give each of them 1000 filenames.
But, even simpler, you can create a pool of 30 threads and have them service a thread with 30000 filenames. That gives you automatic load balancing—if some of the files are much bigger than others, you won't have one thread finishing when another one's only 10% done.
The concurrent.futures module gives you a nice way to execute tasks in parallel (including passing arguments to the tasks and receiving results, or even exceptions if you want). If you're using Python 2.x or 3.1, you will need to install the backport futures. Then you just do this:
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
results = executor.map(parse_file, filenames)
Now, 30 workers is probably way too many. You'll overwhelm the hard drive and its drivers and end up having most of your threads waiting for the disk to seek. But a small number may be worth doing. And it's ridiculously easy to tweak max_workers and test the timing and see where the sweet spot is for your system.
If your code is doing more CPU work than I/O work—that is, it spends more time parsing strings and building complicated structures and the like than it does reading from the disk—then threads won't help, at least in CPython, because of the Global Interpreter Lock. But you can solve that by using processes.
From a code point of view, this is trivial: just change ThreadPoolExecutor to ProcessPoolExecutor.
However, if you're returning large or complex data structures, the time spent serializing them across the process boundary may eat into, or even overwhelm, your savings. If that's the case, you can sometimes improve things by batching up larger jobs:
def parse_files(filenames):
return [parse_file(filename) for filename in filenames]
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
results = executor.map(parse_files, grouper(10, filenames))
But sometimes you probably need to drop to a lower level and use the multiprocessing module, which has features like inter-process memory sharing.
If you can't/don't want to use futures, 2.6+ have multiprocessing.Pool for a plain processor pool, and a thread pool with the same interface under the name multiprocessing.ThreadPool (not documented) or multiprocessing.dummy.Pool (documented but ugly).
In a trivial case like this, there's really no difference between a plain pool and an executor. And, as mentioned above, in very complicated cases, multiprocessing lets you get under the hood. In the middle, futures is often simpler. But it's worth learning both.

Categories