Why does per-process overhead constantly increase for multiprocessing? - python

I was counting for a 6 core CPU with 12 logical CPUs in a for-loop till really high numbers several times.
To speed things up i was using multiprocessing. I was expecting something like:
Number of processes <= number of CPUs = time identical
number of processes + 1 = number of CPUs = time doubled
What i was finding was a continuous increase in time. I'm confused.
the code was:
#!/usr/bin/python
from multiprocessing import Process, Queue
import random
from timeit import default_timer as timer
def rand_val():
num = []
for i in range(200000000):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=rand_val) for _ in range(iii)]
start = timer()
for p in processes:
p.start()
for p in processes:
p.join()
end = timer()
print(f'elapsed time: {end - start}')
print('for ' + str(iii))
print('')
if __name__ == "__main__":
main()
print('done')
result:
elapsed time: 14.9477102 for 1
elapsed time: 15.4961154 for 2
elapsed time: 16.9633134 for 3
elapsed time: 18.723183399999996 for 4
elapsed time: 21.568377299999995 for 5
elapsed time: 24.126758499999994 for 6
elapsed time: 29.142095499999996 for 7
elapsed time: 33.175509300000016 for 8
.
.
.
elapsed time: 44.629786800000005 for 11
elapsed time: 46.22480710000002 for 12
elapsed time: 50.44349420000003 for 13
elapsed time: 54.61919949999998 for 14

There are two wrong assumptions you make:
Processes are not free. Merely adding processes adds overhead to the program.
Processes do not own CPUs. A CPU interleaves execution of several processes.
The first point is why you see some overhead even though there are less processes than CPUs. Note that your system usually has several background processes running, so the point of "less processes than CPUs" is not clearcut for a single application.
The second point is why you see the execution time increase gradually when there are more processes than CPUs. Any OS running mainline Python does preemptive multitasking of processes; roughly, this means a process does not block a CPU until it is done, but is paused regularly so that other processes can run.
In effect, this means that several processes can run on one CPU at once. Since the CPU can still only do a fixed amount of work per time, all processes take longer to complete.

I don't understand what you are trying to achieve?
You are taking the same work, and running it X times, where X is the number of SMPs in your loop. You should be taking the work and dividing it by X, then sending a chunk to each SMP unit.
Anyway, with regards what you are observing - you are seeing the time it takes to spawn and close the separate processes. Python isn't quick at starting new processes.

Your test is faulty.
Imagine this, it takes 1 day for one farmer to work a 10km^2 farmland using a single tractor. If there are two farmers working 20km^2 farms, why are you expecting two farmers working twice the amount of farmlands using two tractors to take less time?
You have 6 CPU cores, your village has 6 tractors, but nobody has money to buy private tractors. As the number of workers (processes) on the village increased, the number of tractors remain the same, so everyone has to share the limited number of tractors.
In an ideal world, two farmers working twice the amount of work using two tractors would take exactly the same amount of time as one farmer working one portion of work, but in real computers the machine has other work to do even if it seems idle. There are task switching, the OS kernel has to run and monitor hardware devices, memory caches needs to be flushed and invalidated between CPU cores, your browser needs to run, the village elder is running a meeting to discuss who should get the tractors and when, etc.
As the number of workers increases beyond the number of tractors, farmers don't just hog the tractors for themselves. Instead they made an arrangement that they'd pass the tractors around every three hours or so. This means that the seventh farmer don't have to wait for two days to get their share of tractor time. However, there's a cost to transferring tractors between farmlands, just as there are costs for CPU to switch between processes; task switch too frequently and the CPU is not actually doing work, and switch to infrequently and you get resource starvation as some jobs takes too long to start being worked on.
A more sensible test would be to keep the size of farmland constant and just increase the number of farmers. In your code, that would correspond to this change:
def rand_val(num_workers):
num = []
for i in range(200000000 / num_workers):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=lambda: rand_val(iii)) for _ in range(iii)]
...

Related

How do I take advantage of parallel processing in gcloud?

Google Cloud offers a virtual machine instance with 96 cores.
I thought that having those 96 cores you would be able to divide your program into 95 slices while leaving one other core to run the program and you could thus run the program 95 times faster.
It's not working out that way however.
I'm running a simple program in Python which simply counts to 20 million.
On my Mac book it takes 4.6 seconds to run this program in serial, and 1.2 seconds when I divide this process into 4 sections and run it in parallel.
My Mac book has the following specs:
version 10.14.5
processor 2.2 GHz Intel Core i7
Mid 2015
Memory 16GB 1600 MHz DDr3
The computer that gcloud offers basically is just a tad bit slower.
When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete. The computer has the following specs:
n1-highmem-96 (96 vCPUs, 624 GB memory)
I need the high memory because I'm going to index a corpus of 14 billion words. I'm going to save the locations of the 100,000 most frequent words.
When I then save those locations, I'm going to pickle them.
Pickling files takes up an enormous amount of time and will probably consume 90% of the time.
It is for this reason that I need to pickle the files as little as often. So if I can put the locations into python objects for as long as possible then I will save myself a lot of time.
Here is the python program I'm using just in case it helps:
import os, time
p = print
en = enumerate
def count_2_million(**kwargs):
p('hey')
start = kwargs.get("start")
stop = kwargs.get("stop")
fork_num = kwargs.get('fork_num')
for x in range(start, stop):
pass
b = time.time()
c = round(b - kwargs['time'], 3)
p(f'done fork {fork_num} in {c} seconds')
def divide_range(divisions: int, total: int, idx: int, begin=0):
sec = total // divisions
start = (idx * sec) + begin
if total % divisions != 0 and idx == divisions:
stop = total
else:
stop = start + sec
return start, stop
def main_fork(func, total, **kwargs):
forks = 4
p(f'{forks} forks')
kwargs['time'] = time.time()
pids = []
for i in range(forks):
start1, stop1 = 0, 0
if total != -1:
start1, stop1 = divide_range(4, total, i)
newpid = os.fork()
pids.append(newpid)
kwargs['start'] = start1
kwargs['stop'] = stop1
kwargs['fork_num'] = i
if newpid == 0:
p(f'fork num {i} {start1} {stop1}')
child(func, **kwargs)
return pids
def child(func, **kwargs):
func(**kwargs)
os._exit(0)
main_fork(count_2_million, 200_000_000, **{})
In your specific use case I think one of the solutions is to use Clusters.
There are two main types of cluster computing workloads, in your case since you need to index 14 billion words you will need to use High-throughput computing (HTC).
What is High-throughput Computing?
A type of computing where apps have multiple tasks that are processed independently of each other without a need for the individual compute nodes to communicate. Sometimes these workloads are called embarrassingly parallel or batch workloads
When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete.
For this part you should check the part of the documentation with recommended architectures and best practices so you ensure that you have the best setup to get the most of the job you want to be done.
There are some open source software like ElastiCluster that provide cluster management and support for provisioning nodes while using Google Compute Engine.
After the cluster is operational you can use HTCondor to analyze and manage the cluster nodes.

Why is this multiprocess program running slowly than its not concurrent version?

I have made a program for adding a list by dividing them in subparts and using multiprocessing in Python. My code is the following:
from concurrent.futures import ProcessPoolExecutor, as_completed
import random
import time
def dummyFun(l):
s=0
for i in range(0,len(l)):
s=s+l[i]
return s
def sumaSec(v):
start=time.time()
sT=0
for k in range(0,len(v),10):
vc=v[k:k+10]
print ("vector ",vc)
for item in vc:
sT=sT+item
print ("sequential sum result ",sT)
sT=0
start1=time.time()
print ("sequential version time ",start1-start)
def main():
workers=5
vector=random.sample(range(1,101),100)
print (vector)
sumaSec(vector)
dim=10
sT=0
for k in range(0,len(vector),dim):
vc=vector[k:k+dim]
print (vc)
for item in vc:
sT=sT+item
print ("sub list result ",sT)
sT=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),10))
start=time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(dummyFun,chunk) for chunk in chunks]
for future in as_completed(futures):
print (future.result())
start1=time.time()
print (start1-start)
if __name__=="__main__":
main()
The problem is that for the sequential version I got a time of:
0.0009753704071044922
while for the concurrent version my time is:
0.10629010200500488
And when I reduce the number of workers to 2 my time is:
0.08622884750366211
Why is this happening?
The length of your vector is only 100. That is a very small amount of work, so the the fixed cost of starting the process pool is the most significant part of the runtime. For this reason parallelism is most beneficial when there is a lot of work to do. Try a larger vector, like a length of 1 million.
The second problem is that you have each worker do a tiny amount of work: a chunk of size 10. Again, that means the cost of starting a task cannot be amortized over so little work. Use larger chunks. For example, instead of 10 use int(len(vector)/(workers*10)).
Also note that you're creating 5 processes. For a CPU-bound task like this one you ideally want to use the same number of processes as you have physical CPU cores. Either use whatever number of cores your system has, or if you use max_workers=None (the default value) then ProcessPoolExecutor will default to that number for your system. If you use too few processes you're leaving performance on the table, if you use too many then the CPU will have to switch between them and your performance may suffer.
Your chunking is pretty awful for creating multiple tasks.
Creating too many tasks still incurs the time punishment even when your workers are already created.
Maybe this post can help you in your search:
How to parallel sum a loop using multiprocessing in Python

Python most accurate method to measure time (ms)

I need to meassure the time certain parts of my code take. While executing my code on a powerfull server, I get 10 diffrent results
I tried comparing time measured with time.time(), time.perf_counter(), time.perf_counter_ns(), time.process_time() and time.process_time_ns().
import time
for _ in range(10):
start = time.perf_counter()
i = 0
while i < 100000:
i = i + 1
time.sleep(1)
end = time.perf_counter()
print(end - start)
I'm expecting when executing the same code 10 times, to be the same (the results to have a resolution of at least 1ms) ex. 1.041XX and not 1.030sec - 1.046sec.
When executing my code on a 16 cpu, 32gb memory server I'm receiving this result:
1.045549364
1.030857833
1.0466020120000001
1.0309665050000003
1.0464690349999994
1.046397238
1.0309525370000001
1.0312070380000007
1.0307592159999999
1.046095523
Im expacting the result to be:
1.041549364
1.041857833
1.0416020120000001
1.0419665050000003
1.0414690349999994
1.041397238
1.0419525370000001
1.0412070380000007
1.0417592159999999
1.041095523
Your expectations are wrong. If you want to measure code average time consumption use the timeit module. It executes your code multiple times and averages over the times.
The reason your code has different runtimes lies in your code:
time.sleep(1) # ensures (3.5+) _at least_ 1000ms are waited, won't be less, might be more
You are calling it in a tight loop,resulting in accumulated differences:
Quote from time.sleep(..) documentation:
Suspend execution of the calling thread for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal’s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
Changed in version 3.5: The function now sleeps at least secs even if the sleep is interrupted by a signal, except if the signal handler raises an exception (see PEP 475 for the rationale).
Emphasis mine.
Perfoming a code do not take the same time at each loop iteration because of the scheduling of the system (system puts on hold your process to perform another process then back to it...).

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.
You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)
I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Can my multiple processes be better regulated so that they finish at (almost) the same time?

I have the following multiprocessing code
from multiprocessing import Pool
pool = Pool(maxtasksperchild=20)
likelihoods = pool.map_async(do_comparison, itertools.combinations(clusters, 2)).get()
condensed_score_matrix = [1 / float(l) if l != 0 and l < 5 else 10 for l in likelihoods]
spectra_names = [c.get_names()[0] for c in clusters]
pool.close()
The problem with this code is that the different processes do not finish at the same time. I'm using eight processes. There can be 20-30+ minutes between the first process finishing and the last process finishing, with the last process running alone for a big part of that time. It would be much quicker if the workload would be redivided to processes that are finished, so that all cores are used the whole time.
Is there a way to accomplish this?
The way workload is divided can be controlled with the chunksize parameter of map_async.
By omitting it you are currently using the default behavior which is roughly chunksize = num_tasks / (num_processes * 4), so on average each process will only receive 4 chunks.
You can start by setting the chunk size to 1 to validate that it properly distributes workload and then gradually increase it until you stop seeing a performance improvement.
You can try to experiment with .imap_unordered using different chunksize values.
More here.

Categories