Can one efficiently thread short CPU tasks in python?

Can one efficiently thread short CPU tasks in python? - python

I am trying to streamline a program that involves a set of short tasks that can be done in parallel, where the results of the set of tasks must be compared before moving onto the next step (which again involves a set of short tasks, and then another set, etc.). Due to the level of complexity of these tasks, it's not worthwhile to use multiprocessing due to the set-up time. I am wondering if there is another way to do these short tasks in parallel that is faster than linear. The only question I can find on this site that describes this problem for Python references this answer on memory sharing which I don't think answers my question (or if it does I could not follow how).
To illustrate what I am hoping to do, consider the problem of summing a bunch of numbers from 0 to N. (Of course this can be solved analytically, my point is to come up with a low-memory but short CPU-intensive task.) First, the linear approach would simply be:
def numbers(a,b):
return(i for i in range(a,b))
def linear_sum(a):
return(sum(numbers(a[0],a[1])))
n = 2000
linear_sum([0, n+1])
#2001000
For threading, I want to break the problem into parts that can then be summed separately and then combined, so the idea would be to get a bunch of ranges over which to sum with something like
def get_ranges(i, Nprocess = 3):
di = i // Nprocess
j = np.append(np.arange(0, i, di), [i+1,])
return([(j[k], j[k+1]) for k in range(len(j)-1)])
and for some value n >> NProcesses the pseudocode example would be something like
values = get_ranges(n)
x = []
for value in values:
x.append(do_someting_parallel(value))
return(sum(x))
The question then, is how to implement do_someting_parallel? For multiprocessing, we can do something like:
from multiprocessing import Pool as ThreadPool
def mpc_thread_sum(i, Nprocess = 3):
values = get_ranges(i)
pool = ThreadPool(Nprocess)
results = pool.map(linear_sum, values)
pool.close()
pool.join()
return(sum(results))
print(mpc_thread_sum(2000))
# 2001000
The graph below shows the performance of the different approaches described. Is there a way to speed up computations for the region where multiprocessing is still slower than linear or is this the limit of parallelization in Python's GIL? I suspect the answer may be that I am hitting my limit but wanted to ask here to be sure. I tried multiprocessing.dummy, asyncio, threading, and ThreadPoolExecutor (from concurrent.futures). For brevity, I have omitted the code, but all show comparable execution time to the linear approach. All are designed for I/O tasks, so are confined by GIL.

My first observation is that the running time of function numbers can be cut roughly in half by simply defining it as:
def numbers(a, b):
return range(a, b)
Second, a task that is 100% CPU-intensive as is computing the sum of numbers can never perform significantly better using pure Python without the aid of a C-language runtime library (such as numpy) because of contention for the Global Interpreter Lock (GIL), which prevents any sort of parallelization from occurring (and asyncio only uses a single thread to being with).
Third, the only way you can achieve a performance improvement running pure Python code against a 100% CPU task is with multiprocessing. But there is CPU overhead in creating the process pool and CPU overhead in passing arguments from the main process to the the address space in which the process pool's processes are running in and overhead again in passing back the results. So for there to be any performance improvement, the worker function linear_sum, cannot be trivial; it must require enough CPU processing to warrant the additional overhead I just mentioned.
The following benchmark runs the worker function, renamed to compute_sum and which now accepts as its argument a range. To further reduce overhead, I have introduced a function split that will take the passed range argument and generate multiple range instances removing the need to use numpy and to generate arrays. The benchmark computes the sum using a single thread (linear), a multithreading pool and a multiprocessing pool and is run twice for n = 2000 and n = 50_000_000. The benchmark displays the elapsed time and total CPU time across all processes.
For n = 2000, multiprocessing, as expected, performs worse than both linear and multithreading. For n = 50_000_000, multiprocessing's total CPU time is a bit higher than for linear and multithreading as is expected due to the additional aforementioned overhead. But now the elapsed time has gone down significantly. For both values of n, multithreading is a loser.
from multiprocessing.pool import Pool, ThreadPool
import time
def split(iterable, n):
k, m = divmod(len(iterable), n)
return (iterable[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
def compute_sum(r):
t = time.process_time()
return (sum(r), time.process_time() - t)
if __name__ == '__main__':
for n in (2000, 50_000_000):
r = range(0, n+1)
t1 = time.time()
s, cpu = compute_sum(r)
elapsed = time.time() - t1
print(f'n = {n}, linear elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
t1 = time.time()
t2 = time.process_time()
thread_pool = ThreadPool(4)
s = 0
for return_value, process_time in thread_pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
elapsed = time.time() - t1
cpu = time.process_time() - t2
print(f'n = {n}, thread pool elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
thread_pool.close()
thread_pool.join()
t1 = time.time()
t2 = time.process_time()
pool = Pool(4)
s = 0
cpu = 0
for return_value, process_time in pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
cpu += process_time
elapsed = time.time() - t1
cpu += time.process_time() - t2
print(f'n = {n}, multiprocessing elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
pool.close()
pool.join()
print()
Prints:
n = 2000, linear elapsed time = 0.0, total cpu time = 0.0, sum = 2001000
n = 2000, thread pool elapsed time = 0.00700068473815918, total cpu time = 0.015625, sum = 2001000
n = 2000, multiprocessing elapsed time = 0.13200139999389648, total cpu time = 0.015625, sum = 2001000
n = 50000000, linear elapsed time = 2.0311124324798584, total cpu time = 2.03125, sum = 1250000025000000
n = 50000000, thread pool elapsed time = 2.050999164581299, total cpu time = 2.046875, sum = 1250000025000000
n = 50000000, multiprocessing elapsed time = 0.7579991817474365, total cpu time = 2.359375, sum = 125000002500000

Related

Can I speed up performance by applying functions to an item in a data object with multiprocessing?

Disclaimer: I have gone through loads of multiprocessing answers on SO and also documents and either the questions were really old (Python 3.X has made tons of improvements since) or did not find a clear answer. If I might have missed out something relevant do point me in the right direction.
I started with a simple function that I defined as below in my folder module since I am running of Jupyter Notebook and it seems that due to conflicts, you can only run multiprocessing on an imported function:
def f(a):
return a * 100
Built some test data and ran some test:
from itertools import zip_longest
from multiprocessing import Process, Pool, Array, Queue
from time import time
from modules.test import *
li = [i for i in range(1000000)]
List comprehension: Really Fast
start = time()
tests = [f(i) for i in li]
print(f'Total time {time() - start} s')
>> Total time 0.154066801071167 s
Answer of an SO example here: 11 seconds or so
start = time()
results = []
if __name__ == '__main__':
jobs = 4
size = len(li)
heads = list(range(size//jobs, size, size//jobs)) + [size]
tails = range(0,size,size//jobs)
pool = Pool(4)
for tail,head in zip(tails, heads):
r = pool.apply_async(f, args=(li[tail:head],))
results.append(r)
pool.close()
pool.join() # wait for the pool to be done
print(f'Total time {time() - start} s')
>>Total time 11.087551593780518 s
And there is Process which I do not know whether will be applicable to the example above. I am unfamiliar with multiprocessing but do understand that there is some overhead in creating new instances and what not, but as the data grows larger it should justify the overhead.
My question is, with the current performances in Python 3.x, is using multiprocessing in running similar operations to the above still relevant or something one should even attempt. And if it is, how can they be applied in parallelizing the workload.
Most of the examples I have read and understand are used for web scraping when there is an actual idle time in one process receiving information and it makes sense to parallelize but how would want approach it if you are running computations of something like a list or dictionary.

The reason your example is not performing well is because you are doing two totally different things.
In your list comprehension, you are mapping f onto each element of li.
In the second case, you are splitting your li list into jobs chunks and then apply your functon jobs times onto each of those chunks. And now, in f, n * 100 takes a chunk about a quarter the size of your original list, and multiplies it by 100, i.e., it uses the sequence-repitition operator, so creates a new list 100-times the size of the chunk:
>>> chunk = [1,2,3]
>>> chunk * 10
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
>>>
So basically, you are comparing apples to oranges.
However, multiprocessing already comes with an out-of-the box mapping utility. Here is a better comparison, a script called foo.py:
import time
import multiprocessing as mp
def f(x):
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
Now, here's some actual performance results:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 0.14193987846374512 seconds
Pool.map took 0.2513458728790283 seconds
(py37) Juans-MBP:test_mp juan$
For this very trivial function, the cost of the inter-process communication will always be higher than the cost of calculating the function serially. So you won't see any gains from multiprocessing. However, a much less trivial function can see gains from multiprocessing.
Here's a trivial example, I simply sleep for a microsecond before multiplying:
import time
import multiprocessing as mp
def f(x):
time.sleep(0.000001)
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
And now, you see gains commensurate with the number of processes:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 13.175776720046997 seconds
Pool.map took 3.1484851837158203 seconds
Note, on my machine, a single multiplication takes orders of magnitude less time than a microsecond (about 10 nanoseconds):
>>> import timeit
>>> timeit.timeit('100*3', number=int(1e6))*1e-6
1.1292944999993892e-08

Python dynamic control of the number of processes in a multiprocessing script according to the amount of free RAM and of a parameter from the function

I've got an unusual question for python. I'm using the multiprocessing library to map a function f((dynamic1, dynamic2), fix1, fix2).
import multiprocessing as mp
fix1 = 4
fix2 = 6
# Numer of cores to use
N = 6
dynamic_duos = [(a, b) for a in range(5) for b in range(10)]
with mp.Pool(processes = N) as p:
p.starmap(f, [(dyn, fix1, fix2) for dyn in dynamic_duos])
I would like to control dynamically the number of active processes because the function is actually pumping sometimes a LOT of RAM. The idea would be to check at every iteration (i.e. before any call of the function f) if the sum(dyn) is inferior to a threshold and if the amount of RAM is above a threshold. If the conditions are matched, then a new process can start and compute the function.
An additional condition would be the maximum number of processes: the number of cores on the PC.
Thanks for the help :)
Edit: Details on the reasons.
Some of the combinations of parameters will have a high RAM consumption (up to 80 Gb on 1 process). I know more or less which ones will use a lot of RAM, and when the program encounters them, I would like to wait for the other process to end, start in single process this high RAM consumption combination, and then resume the computation with more processes on the rest of the combination to map.
Edit on my try based on the answer below:
It doesn't work, but it doesn't raise an error. It just completes the program.
# Imports
import itertools
import concurrent.futures
# Parameters
N = int(input("Number of CPUs to use: "))
t0 = 0
tf = 200
s_step = 0.05
max_s = None
folder = "test"
possible_dynamics = [My_class(x) for x in [20, 30, 40, 50, 60]]
dynamics_to_compute = [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 2)] + [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 3)]
function_inputs = [(dyn , t0, tf, s_step, max_s, folder) for dyn in dynamics_to_compute]
# -----------
# Computation
# -----------
start = time.time()
# Pool creation and computation
futures = []
pool = concurrent.futures.ProcessPoolExecutor(max_workers = N)
for Obj, t0, tf, s_step, max_s, folder in function_inputs:
if large_memory(Obj, s_step, max_s):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
futures.append(future)
end = time.time()
if round(end-start, 3) < 60:
print ("Complete - Elapsed time: {} s".format(round(end-start,3)))
else:
print ("Complete - Elapsed time: {} mn and {} s".format(int((end-start)//60), round((end-start)%60,3)))
os.system("pause")
This is still a simplified example of my code, but the idea is here. It runs in less than 0.2 s, which means he actually never called the function compute.
N.B: Obj is not the actual variable name.

To achieve so you need to give up on the use of map to gain more control on the execution flow of your tasks.
This code implements the algorithm you described at the end of your question. I'd recommend using concurrent.futures library as it expose a more neat set of APIs.
import concurrent.futures
pool = concurrent.futures.ProcessPoolExecutor(max_workers=6)
futures = []
for dyn, fix1, fix2 in dynamic_duos:
if large_memory(dyn, fix1, fix2):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(f, dyn, fix1, fix2)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(f, dyn, fix1, fix2)
futures.append(future)

refactoring to calculate running time of sorting algorithm - python

I wrote this to calculate the average running time of sorting algorithms,
and I was just curious if there was a way to refactor this to something simpler or better.
time = []
for i in range(3):
start = timeit.default_timer()
insert_list = []
for i in range(3000):
insert_list.append(randint(0,5000))
sorted_list = merge_sort(insert_list)
stop = timeit.default_timer()
time.append(stop - start)
print sum(time) /len(time)

Try using datetime to measure the running time of your algorithm.
datetime.datetime has a microsecond attribute that can be used if you choose to use datetime.datetime.now()
from datetime import datetime
startTime = datetime.now()
#CODE
print("Time taken:",datetime.now() - startTime)

First, you have to move for i in range(3000) cycle outside of the time measurements. This is NOT sorting, so you actually measure the dataset population. And since you use the random numbers, it will be highly dependent on the speed of the source of entropy (e.g. /dev/random, /dev/urandom, or alike), which can be very slow in some configurations (e.g., VMs on a shared host or in the cloud). Which has nothing to do with the speed of the sorting algorithm.
time = []
for i in range(3):
insert_list = []
for i in range(3000):
insert_list.append(randint(0,5000))
start = timeit.default_timer()
sorted_list = merge_sort(insert_list)
stop = timeit.default_timer()
time.append(stop - start)
print sum(time) /len(time)
Second, not so important, this timer (so as time.time() & datetime.now()) can give unexpected results in case of timezone shifts, daylight savings, ntp time adjustments, etc. It is better to use monotonic.monotonic(), which uses the OS'es source of monotonic time if possible. Though, this is an external library, not a builtin.
time = []
for i in range(3):
insert_list = []
for i in range(3000):
insert_list.append(randint(0,5000))
start = monotonic.monotonic()
sorted_list = merge_sort(insert_list)
stop = monotonic.monotonic()
time.append(stop - start)
print sum(time) /len(time)
Third, the measurements can be affected by the external circumstances if you measure each call separately. Such as too fast algorithm on a too small dataset, which will lead to measurement roundings due to precision of the time clock. Instead, make N sorting calls and measure the time of the whole cycle. Then divide the total time by the number of operations. This goes at the cost of memory, since you have to prepare all N arrays in advance.
N = 3
dataset = []
for i in range(N):
insert_list = []
for i in range(3000):
insert_list.append(randint(0,5000))
dataset.append(insert_list)
start = monotonic.monotonic()
for insert_list in dataset:
sorted_list = merge_sort(insert_list)
stop = monotonic.monotonic()
print (stop - start) / N
Fourth, why not use timeit.timeit() function?
N = 3
dataset = [[randint(0, 5000) for j in range(3000)] for i in range(N)]
print(timeit.timeit(lambda: merge_sort(dataset.pop()), number=N))

Is IPC slowing this down?

I understand that there is overhead when using the Multiprocessing module, but this seems to be a high amount and the level of IPC should be fairly low from what I can gather.
Say I generate a large-ish list of random numbers between 1-1000 and want to obtain a list of only the prime numbers. This code is only meant to test multiprocessing on CPU-intensive tasks. Ignore the overall inefficiency of the primality test.
The bulk of the code may look something like this:
from random import SystemRandom
from math import sqrt
from timeit import default_timer as time
from multiprocessing import Pool, Process, Manager, cpu_count
rdev = SystemRandom()
NUM_CNT = 0x5000
nums = [rdev.randint(0, 1000) for _ in range(NUM_CNT)]
primes = []
def chunk(l, n):
i = int(len(l)/float(n))
for j in range(0, n-1):
yield l[j*i:j*i+i]
yield l[n*i-i:]
def is_prime(n):
if n <= 2: return True
if not n % 2: return False
for i in range(3, int(sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
It seems to me that I should be able to split this up among multiple processes. I have 8 logical cores, so I should be able to use cpu_count() as the # of processes.
Serial:
def serial():
global primes
primes = []
for num in nums:
if is_prime(num):
primes.append(num) # primes now contain all the values
The following sizes of NUM_CNT correspond to the speed:
0x500 = 0.00100 sec.
0x5000 = 0.01723 sec.
0x50000 = 0.27573 sec.
0x500000 = 4.31746 sec.
This was the way I chose to do the multiprocessing. It uses the chunk() function to split up nums into cpu_count() (roughly equal) parts. It passes each chunk into a new process, which iterates through them, and then assigns it to an entry of a shared dict variable. The IPC should really occur when I assign the value to the shared variable. Why would it occur otherwise?
def loop(ret, id, numbers):
l_primes = []
for num in numbers:
if is_prime(num):
l_primes.append(num)
ret[id] = l_primes
def parallel():
man = Manager()
ret = man.dict()
num_procs = cpu_count()
procs = []
for i, l in enumerate(chunk(nums, num_procs)):
p = Process(target=loop, args=(ret, i, l))
p.daemon = True
p.start()
procs.append(p)
[proc.join() for proc in procs]
return sum(ret.values(), [])
Again, I expect some overhead, but the time seems to be increasing exponentially faster than the serial version.
0x500 = 0.37199 sec.
0x5000 = 0.91906 sec.
0x50000 = 8.38845 sec.
0x500000 = 119.37617 sec.
What is causing it to do this? Is it IPC? The initial setup makes me expect some overhead, but this is just an insane amount.
Edit:
Here's how I'm timing the execution of the functions:
if __name__ == '__main__':
print(hex(NUM_CNT))
for func in (serial, parallel):
t1 = time()
vals = func()
t2 = time()
if vals is None: # serial has no return value
print(len(primes))
else: # but parallel does
print(len(vals))
print("Took {:.05f} sec.".format(t2 - t1))
The same list of numbers is used each time.
Example output:
0x5000
3442
Took 0.01828 sec.
3442
Took 0.93016 sec.

Hmm. How do you measure time? On my computer, the parallel version is much faster than the serial one.
I'm mesuring using time.time() that way: if we assume tt is an alias for time.time().
serial()
t2 = int(round(tt() * 1000))
print(t2 - t1)
parallel()
t3 = int(round(tt() * 1000))
print(t3-t2)
I get, with 0x500000 as input:
5519ms for the serial version
3351ms for the parallel version
I believe that your mistake is caused by the inclusion of the number generation process inside the parallel, but not inside the serial one.
On my computer, the generation of the random numbers takes like 45seconds (it's a very slow process). So, it can explain the difference between your two values as I don't think that my computer uses a very different architecture.

firing a sequence of parallel tasks

For this dask code:
def inc(x):
return x + 1
for x in range(5):
array[x] = delay(inc)(x)
I want to access all the elements in array by executing the delayed tasks. But I can't call array.compute() since array is not a function. If I do
for x in range(5):
array[x].compute()
then does each task gets executed in parallel or does a[1] get fired only after a[0] terminates? Is there a better way to write this code?

You can use the dask.compute function to compute many delayed values at once
from dask import delayed, compute
array = [delayed(inc)(i) for i in range(5)]
result = compute(*array)

It's easy to tell if things are executing in parallel if you force them to take a long time. If you run this code:
from time import sleep, time
from dask import delayed
start = time()
def inc(x):
sleep(1)
print('[inc(%s): %s]' % (x, time() - start))
return x + 1
array = [0] * 5
for x in range(5):
array[x] = delayed(inc)(x)
for x in range(5):
array[x].compute()
It becomes very obvious that the calls happen in sequence. However if you replace the last loop with this:
delayed(array).compute()
then you can see that they are in parallel. On my machine the output looks like this:
[inc(1): 1.00373506546]
[inc(4): 1.00429320335]
[inc(2): 1.00471806526]
[inc(3): 1.00475406647]
[inc(0): 2.00795912743]
Clearly the first four tasks it executed were in parallel. Presumably the default parallelism is set to the number of cores on the machine, because for CPU intensive tasks it's not generally useful to have more.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can one efficiently thread short CPU tasks in python? - python

Related

Can I speed up performance by applying functions to an item in a data object with multiprocessing?

Python dynamic control of the number of processes in a multiprocessing script according to the amount of free RAM and of a parameter from the function

refactoring to calculate running time of sorting algorithm - python

Is IPC slowing this down?

firing a sequence of parallel tasks

Categories

Resources