Why is this multiprocess program running slowly than its not concurrent version?

Why is this multiprocess program running slowly than its not concurrent version? - python

I have made a program for adding a list by dividing them in subparts and using multiprocessing in Python. My code is the following:
from concurrent.futures import ProcessPoolExecutor, as_completed
import random
import time
def dummyFun(l):
s=0
for i in range(0,len(l)):
s=s+l[i]
return s
def sumaSec(v):
start=time.time()
sT=0
for k in range(0,len(v),10):
vc=v[k:k+10]
print ("vector ",vc)
for item in vc:
sT=sT+item
print ("sequential sum result ",sT)
sT=0
start1=time.time()
print ("sequential version time ",start1-start)
def main():
workers=5
vector=random.sample(range(1,101),100)
print (vector)
sumaSec(vector)
dim=10
sT=0
for k in range(0,len(vector),dim):
vc=vector[k:k+dim]
print (vc)
for item in vc:
sT=sT+item
print ("sub list result ",sT)
sT=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),10))
start=time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(dummyFun,chunk) for chunk in chunks]
for future in as_completed(futures):
print (future.result())
start1=time.time()
print (start1-start)
if __name__=="__main__":
main()
The problem is that for the sequential version I got a time of:
0.0009753704071044922
while for the concurrent version my time is:
0.10629010200500488
And when I reduce the number of workers to 2 my time is:
0.08622884750366211
Why is this happening?

The length of your vector is only 100. That is a very small amount of work, so the the fixed cost of starting the process pool is the most significant part of the runtime. For this reason parallelism is most beneficial when there is a lot of work to do. Try a larger vector, like a length of 1 million.
The second problem is that you have each worker do a tiny amount of work: a chunk of size 10. Again, that means the cost of starting a task cannot be amortized over so little work. Use larger chunks. For example, instead of 10 use int(len(vector)/(workers*10)).
Also note that you're creating 5 processes. For a CPU-bound task like this one you ideally want to use the same number of processes as you have physical CPU cores. Either use whatever number of cores your system has, or if you use max_workers=None (the default value) then ProcessPoolExecutor will default to that number for your system. If you use too few processes you're leaving performance on the table, if you use too many then the CPU will have to switch between them and your performance may suffer.

Your chunking is pretty awful for creating multiple tasks.
Creating too many tasks still incurs the time punishment even when your workers are already created.
Maybe this post can help you in your search:
How to parallel sum a loop using multiprocessing in Python

Related

Python Multiprocessing: Writing to file every k iterations

I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)

My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()

There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.

You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)

I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

Making itertools.combinations calculations multiprocess in python?

I'm using such algorithm to make some calculations on array of Decimals:
fkn = Decimal('0')
for bits in itertools.combinations(decimals_array, elements_count):
kxn = reduce(operator.mul, bits, Decimal('1'))
fkn += kxn
I'm using Python 3.4 x64.
Decimals have precision>300 (it's a must).
len(decimals_array) is most of the time over 40.
elements_count is most of the time len(decimals_array)/2.
Calculations take very long time.
I wanted to make them multiprocess so first I was thinking about making an array of all combinations and send parts of this array to many processes - but during making of such array I quickly get MemoryError Exception.
Now I'm looking for nicer way to make this code multiprocess.
What is a good way to run this algorithm on multiple cores?
Or maybe there is a better (faster) way to make such calculations?
Thank you in advance for some ideas.

In order to really parallelize this you need to get around combinations() being sequential so that each process can generate its own combinations. The rest of the problem is already paralellizable.
40 choose 20 is about 138 billion combinations so pre-generating that or generating it in each process is going to hurt. With a 20-element list taking around 224 bytes (says sys.getsizeof()) that's 30 something terabytes if you generate the whole thing in one go. No wonder you ran out of memory. You also can't really split a generator across processes; or rather, if you do, each process will get its own copy of the generator.
Solution 1 is to have a process whose sole job is to generate combinations and push them into a queue, possibly in batches to save on IPC overhead, and have the other processes consume combinations from that queue.
Solution 2 is to write a non-sequential version of combinations that returns the Nth combination without computing the rest. This is definitely possible because it's possible with permutations, and combinations are an internally sorted subset of permutations. Then each process in a Pool can generate its own based on a start and step of N - process one counts combination 0, 3, 6..., process two counts combination 1, 4, 7... and so on, for example. This would probably be even slower unless you use C/Cython though.
Solution 3 (or possibly solution 0?) is to go over to the math stackexchange and ask if there's a mathematical rather than computational solution to this problem.

Here is one solution although it isn't very neat.
The idea is to use multiple processes where one process is responsible for one interval. However, since itertools.combinations is sequential, each process has to loop over unnecessary combinations until it reaches the right interval. When the right interval is handled, the process stops. The code is from this book.
import itertools
from tqdm import tqdm
from math import factorial
from multiprocessing import Process
import itertools
def total_combo(n, r):
return factorial(n) // factorial(r) // factorial(n-r)
def cal_combo(var,noCombo,start,end):
data = itertools.combinations(range(var),noCombo)
for i in enumerate(tqdm(data)):
if i[0] >= start:
if i[0] < start+10: print(i)
if i[0] > end: break
if __name__=='__main__':
noCombo=3
var=1000
print(total_combo(var,noCombo),'combinations for',noCombo,'of',var,'variants')
noProc=6
interval=total_combo(var,noCombo)/noProc
if interval%1==0:
print(interval)
procs=[]
for pid in range(noProc):
proc = Process(target=cal_combo, args=(var,noCombo, interval*pid, interval*(pid+1)))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()

Can my multiple processes be better regulated so that they finish at (almost) the same time?

I have the following multiprocessing code
from multiprocessing import Pool
pool = Pool(maxtasksperchild=20)
likelihoods = pool.map_async(do_comparison, itertools.combinations(clusters, 2)).get()
condensed_score_matrix = [1 / float(l) if l != 0 and l < 5 else 10 for l in likelihoods]
spectra_names = [c.get_names()[0] for c in clusters]
pool.close()
The problem with this code is that the different processes do not finish at the same time. I'm using eight processes. There can be 20-30+ minutes between the first process finishing and the last process finishing, with the last process running alone for a big part of that time. It would be much quicker if the workload would be redivided to processes that are finished, so that all cores are used the whole time.
Is there a way to accomplish this?

The way workload is divided can be controlled with the chunksize parameter of map_async.
By omitting it you are currently using the default behavior which is roughly chunksize = num_tasks / (num_processes * 4), so on average each process will only receive 4 chunks.
You can start by setting the chunk size to 1 to validate that it properly distributes workload and then gradually increase it until you stop seeing a performance improvement.

You can try to experiment with .imap_unordered using different chunksize values.
More here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.