How to retrieve values from a function run in parallel processes?

How to retrieve values from a function run in parallel processes? - python

The Multiprocessing module is quite confusing for python beginners specially for those who have just migrated from MATLAB and are made lazy with its parallel computing toolbox. I have the following function which takes ~80 Secs to run and I want to shorten this time by using Multiprocessing module of Python.
from time import time
xmax = 100000000
start = time()
for x in range(xmax):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
This outputs as expected:
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 8.667453265190124e-07
Total time: 86.67453265190125
As any iteration of the loop is not dependent on others, I tried to adopt this Server Process from the official documentation to scan chunks of the range in separate processes. And finally I came up with vartec's answer to this question and could prepare the following code. I also updated the code based on Darkonaut's response to the current question.
from time import time
import multiprocessing as mp
def chunker (rng, t): # this functions makes t chunks out of rng
L = rng[1] - rng[0]
Lr = L % t
Lm = L // t
h = rng[0]-1
chunks = []
for i in range(0, t):
c = [h+1, h + Lm]
h += Lm
chunks.append(c)
chunks[t-1][1] += Lr + 1
return chunks
def worker(lock, xrange, return_dict):
'''worker function'''
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
return_dict['x'].append(x)
return_dict['y'].append(y)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
if __name__ == '__main__':
start = time()
manager = mp.Manager()
return_dict = manager.dict()
lock = manager.Lock()
return_dict['x']=manager.list()
return_dict['y']=manager.list()
xmax = 100000000
nw = mp.cpu_count()
workers = list(range(0, nw))
chunks = chunker([0, xmax], nw)
jobs = []
for i in workers:
p = mp.Process(target=worker, args=(lock, chunks[i],return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
print(return_dict['x'])
print(return_dict['y'])
which considerably reduces the run time to ~17 Secs. But, my shared variable cannot retrieve any values. Please help me find out which part of the code is going wrong.
the output I get is:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[]
[]
from which I expect:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[0, 1, 2]
[-15, -3, 11]

The issue in your example is that modifications to standard mutable structures within Manager.dict will not be propagated. I'm first showing you how to fix it with manager, just to show you better options afterwards.
multiprocessing.Manager is a bit heavy since it uses a separate Process just for the Manager and working on a shared object needs using locks for data consistency. If you run this on one machine, there are better options with multiprocessing.Pool, in case you don't have to run customized Process classes and if you have to, multiprocessing.Process together with multiprocessing.Queue would be the common way of doing it.
The quoting parts are from the multiprocessing docs.
Manager
If standard (non-proxy) list or dict objects are contained in a referent, modifications to those mutable values will not be propagated through the manager because the proxy has no way of knowing when the values contained within are modified. However, storing a value in a container proxy (which triggers a setitem on the proxy object) does propagate through the manager and so to effectively modify such an item, one could re-assign the modified value to the container proxy...
In your case this would look like:
def worker(xrange, return_dict, lock):
"""worker function"""
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
The lock here would be a manager.Lock instance you have to pass along as argument since the whole (now) locked operation is not by itself atomic. (Here
is an easier example with Manager using Lock)
This approach is perhaps less convenient than employing nested Proxy Objects for most use cases but also demonstrates a level of control over the synchronization.
Since Python 3.6 proxy objects are nestable:
Changed in version 3.6: Shared objects are capable of being nested. For example, a shared container object such as a shared list can contain other shared objects which will all be managed and synchronized by the SyncManager.
Since Python 3.6 you can fill your manager.dict before starting multiprocessing with manager.list as values and then append directly in the worker without having to reassign.
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
EDIT:
Here is the full example with Manager:
import time
import multiprocessing as mp
from multiprocessing import Manager, Process
from contextlib import contextmanager
# mp_util.py from first link in code-snippet for "Pool"
# section below
from mp_utils import calc_batch_sizes, build_batch_ranges
# def context_timer ... see code snippet in "Pool" section below
def worker(batch_range, return_dict, lock):
"""worker function"""
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
return_dict['x'].append(x)
return_dict['y'].append(y)
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with Manager() as manager:
lock = manager.Lock()
return_dict = manager.dict()
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
tasks = [(batch_range, return_dict, lock)
for batch_range in batch_ranges]
with context_timer():
pool = [Process(target=worker, args=args)
for args in tasks]
for p in pool:
p.start()
for p in pool:
p.join()
# Create standard container with data from manager before exiting
# the manager.
result = {k: list(v) for k, v in return_dict.items()}
print(result)
Pool
Most often a multiprocessing.Pool will just do it. You have an additional challenge in your example since you want to distribute iteration over a range.
Your chunker function doesn't manage to divide the range even so every process has about the same work to do:
chunker((0, 21), 4)
# Out: [[0, 4], [5, 9], [10, 14], [15, 21]] # 4, 4, 4, 6!
For the code below please grab the code snippet for mp_utils.py from my answer here, it provides two functions to chunk ranges as even as possible.
With multiprocessing.Pool your worker function just has to return the result and Pool will take care of transporting the result back over internal queues back to the parent process. The result will be a list, so you will have to rearange your result again in a way you want it to have. Your example could then look like this:
import time
import multiprocessing as mp
from multiprocessing import Pool
from contextlib import contextmanager
from itertools import chain
from mp_utils import calc_batch_sizes, build_batch_ranges
#contextmanager
def context_timer():
start_time = time.perf_counter()
yield
end_time = time.perf_counter()
total_time = end_time-start_time
print(f'\nEach iteration took: {total_time / X_MAX:.4f} s')
print(f'Total time: {total_time:.4f} s\n')
def worker(batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
return result
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
with Pool(N_WORKERS) as pool:
results = pool.map(worker, iterable=batch_ranges)
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')
Example Output:
[range(0, 12500000), range(12500000, 25000000), range(25000000, 37500000),
range(37500000, 50000000), range(50000000, 62500000), range(62500000, 75000000), range(75000000, 87500000), range(87500000, 100000000)]
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 0.0000 s
Total time: 8.2408 s
results: [[(0, -15), (1, -3), (2, 11)], [], [], [], [], [], [], []]
results sorted: x: (0, 1, 2), y: (-15, -3, 11)
Process finished with exit code 0
If you had multiple arguments for your worker you would build a "tasks"-list with argument-tuples and exchange pool.map(...) with pool.starmap(...iterable=tasks). See docs for further details on that.
Process & Queue
If you can't use multiprocessing.Pool for some reason, you have to take
care of inter-process communication (IPC) yourself, by passing a
multiprocessing.Queue as argument to your worker-functions in the child-
processes and letting them enqueue their results to be send back to the
parent.
You will also have to build your Pool-like structure so you can iterate over it to start and join the processes and you have to get() the results back from the queue. More about Queue.get usage I've written up here.
A solution with this approach could look like this:
def worker(result_queue, batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
result_queue.put(result) # <--
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
result_queue = mp.Queue() # <--
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
pool = [Process(target=worker, args=(result_queue, batch_range))
for batch_range in batch_ranges]
for p in pool:
p.start()
results = [result_queue.get() for _ in batch_ranges]
for p in pool:
p.join()
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')

Related

Method with multiprocessing tool in python works worse than methoud without that

I am trying to use multiprocessing to speed up dealing with lots of files instead of reading them one by one. I did a test to learn before that. Below is my code:
from multiprocessing.pool import Pool
from time import sleep, time
def print_cube(num):
aa1 = num * num
aa2 = num * num * num
return aa1, aa2
def main1():
start = time()
x = []
y = []
p = Pool(16)
for j in range(1, 5):
results = p.apply_async(print_cube, args = (j, ))
x.append(results.get()[0])
y.append(results.get()[1])
end = time()
return end - start, x, y
def main2():
start = time()
x = []
y = []
for j in range(1, 5):
results = print_cube(j)
x.append(results[0])
y.append(results[1])
end = time()
return end - start, x, y
if __name__ == "__main__":
print("Method1{0}time : {1}{2}x : {3}{4}y : {5}".format('\n' ,main1()[0], '\n', main1()[1], '\n', main1()[2]))
print("Method2{0}time : {1:.6f}{2}x : {3}{4}y : {5}".format('\n' ,main2()[0], '\n', main2()[1], '\n', main2()[2]))
And the result is:
Method1
time : 0.1549079418182373
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method2
time : 0.000000
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method1 uses multiprocessing and consumes more CPU, but costs more time than method2.
Even if the number of cycles j goes to 5000 or greater, method2 works better than method1. Can anybody tell me what's wrong with my code?

There is overhead in using multiprocessing that you do not otherwise have, such as (1) creating processes, (2) passing arguments to your worker function, which is running in different processes and (3) passing results back to your main process. Therefore, the worker function must be sufficiently CPU-intensive so that the gains you achieve by running it in parallel offset the additional overhead I just mentioned. Your worker function, print_cube does not meet that criteria because it is not sufficiently CPU-intensive.
But you are not even running your worker function in parallel.
You are submitting a tasks in a loop by calling method multiprocessing.pool.Pool.apply_async, which returns an instance of multiprocessing.pool.AsyncResult but before you call apply_async again to submit the next task you are calling method get on the AsyncResult and therefore blocking until the first task completes and returns its result before you submit the second task!!! You must submit all your tasks with apply_async and save the returned AsyncResult instances and only then call get on these instances. Only then will you achieve parallelism. Even then your worker function, print_cube, uses too little CPU to overcome the additional overhead that multiprocessing uses to be more performant than serial processing.
In the following code I have (1) corrected the multiprocessing code to perform parallelism and to create a pool size of 5 (there is no reason to create a pool with more processes than the number of tasks you will be submitting or the number of CPU processors that you have for purely CPU-bound tasks; that is just additional overhead you are creating for no good reason) and (2) modified print_cube to be very CPU-intensive to demonstrate how multiprocessing could be advantageous (albeit in an artificial way):
from multiprocessing.pool import Pool
from time import sleep, time
def print_cube(num):
# emulate a CPU-intensive calculation:
for _ in range(10_000_000):
aa1 = num * num
aa2 = num * num * num
return aa1, aa2
def main1():
start = time()
x = []
y = []
p = Pool(5)
# Submit all the tasks and save the AsyncResult instances:
results = [p.apply_async(print_cube, args = (j, )) for j in range(1, 5)]
# Now wait for the return values:
for result in results:
# Unpack the tuple:
x_value, y_value = result.get()
x.append(x_value)
y.append(y_value)
end = time()
return end - start, x, y
def main2():
start = time()
x = []
y = []
for j in range(1, 5):
results = print_cube(j)
x.append(results[0])
y.append(results[1])
end = time()
return end - start, x, y
if __name__ == "__main__":
print("Method1{0}time : {1}{2}x : {3}{4}y : {5}".format('\n' ,main1()[0], '\n', main1()[1], '\n', main1()[2]))
print("Method2{0}time : {1:.6f}{2}x : {3}{4}y : {5}".format('\n' ,main2()[0], '\n', main2()[1], '\n', main2()[2]))
Prints:
Method1
time : 1.109999656677246
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method2
time : 2.827015
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Important Note
Unless you have a solid state drive, you will probably find that trying to read in parallel multiple files may be counter-productive because of head movement back and forth. This may also be a job better-suited for multithreading.

#Booboo First of all, thank you very much for your detailed and excellent explanation. It helps me a lot to better understand the multiprocessing tool of python and your code is also a great example. And next time when trying to apply multiprocessing, I think I'll first consider whether the task satisfies the features of multiprocessing you said. And sorry for the late reply that I ran some experiments.
Second, I ran the code you gave on my computer, and it showed similar result with yours, where Method1 did cost less time with higher CPU consumption than Method2.
Method1
time : 1.0751237869262695
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method2
time : 3.642306
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Third, as for the note you wrote, the data files are stored in a solid state drive, and I tested the time and CPU consumption of dealing with about 50 * 100 MB csv files in Method1 (with multiprocessing), Method2 (nothing), and Method3 (with multithreading), respectively. Method2 did consume high percentage of CPU, 50%, but did not reach the maximum like the Method1 could. Result is as follows:
time : 12.527468204498291
time : 59.400668144226074
time : 35.45922660827637
Forth, below is the example by emulating a CPU-intensive calculation:
import threading
from multiprocessing.pool import Pool
from queue import Queue
from time import time
def print_cube(num):
# emulate a CPU-intensive calculation:
for _ in range(10_000_000_0):
aa1 = num * num
aa2 = num * num * num
return aa1, aa2
def print_cube_queue(num, q):
# emulate a CPU-intensive calculation:
for _ in range(10_000_000_0):
aa1 = num * num
aa2 = num * num * num
q.put((aa1, aa2))
def main1():
start = time()
x = []
y = []
p = Pool(8)
# Submit all the tasks and save the AsyncResult instances:
results = [p.apply_async(print_cube, args = (j, )) for j in range(1, 5)]
# Now wait for the return values:
for result in results:
# Unpack the tuple:
x_value, y_value = result.get()
x.append(x_value)
y.append(y_value)
end = time()
return end - start, x, y
def main2():
start = time()
x = []
y = []
for j in range(1, 5):
results = print_cube(j)
x.append(results[0])
y.append(results[1])
end = time()
return end - start, x, y
def main3():
start = time()
q = Queue()
x = []
y = []
threads = []
for j in range(1, 5):
t = threading.Thread(target=print_cube_queue, args = (j, q))
t.start()
threads.append(t)
for thread in threads:
thread.join()
results = []
for thread in threads:
x_value, y_value = q.get()
x.append(x_value)
y.append(y_value) #q.get()按顺序从q中拿出一个值
end = time()
return end - start, x, y
if __name__ == "__main__":
print("Method1{0}time : {1}{2}x : {3}{4}y : {5}".format('\n' ,main1()[0], '\n', main1()[1], '\n', main1()[2]))
print("Method2{0}time : {1:.6f}{2}x : {3}{4}y : {5}".format('\n' ,main2()[0], '\n', main2()[1], '\n', main2()[2]))
print("Method3{0}time : {1:.6f}{2}x : {3}{4}y : {5}".format('\n' ,main3()[0], '\n', main3()[1], '\n', main3()[2]))
And the result is:
Method1
time : 9.838010549545288
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method2
time : 35.850124
x : [1, 4, 9, 16]
y : [1, 8, 27, 64]
Method3
time : 37.191602
x : [4, 16, 9, 1]
y : [8, 1, 64, 27]
I did some search, and don't know whether it is because the GIL or someting else.

ProcessPoolExecutor on shared dataset and multiple arguments

I am facing an issue I was not able to solve by doing some search on the web.
I am using the minimal code below. The goal is to run some function 'f_sum' several million times by multiprocessing (using the ProcessPoolExecutor). I am adding multiple arguments by a list of tuples 'args'. In addition, the function is supposed to use some sort of data which is the same for all executions (in the example it's just one number). I do not want to add the data to the 'args' tuple for memory reasons.
The only option I found so far is adding the data outside of the "if name == 'main'". This will (for some reason that I do not understand) make the variable available to all processes. However, updating is not possible. Also, I do not really want to make the data definition outside because in the actual code it will be based on data import and might require additional manipulation.
Hope you can help and thanks in advance!
PS: I am using Python 3.7.9 on Win 10.
from concurrent.futures import ProcessPoolExecutor
import numpy as np
data = 0 # supposed to be a large data set & shared among all calculations)
num_workers = 6 # number of CPU cores
num_iterations = 10 # supposed to be large number
def f_sum(args):
(x,y) = args
print('This is process', x, 'with exponent:', y)
value = 0
for i in range(10**y):
value += i
return value/10**y + data
def multiprocessing(func, args, workers):
with ProcessPoolExecutor(workers) as executor:
results = executor.map(func, args)
return list(results)
if __name__ == '__main__':
data = 0.5 # try to update data, should not be part of 'args' due to memory
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,8)))
result = multiprocessing(f_sum, args, num_workers)
if np.abs(result[0]-np.round(result[0])) > 0:
print('data NOT updated')
Edit to original question:
>> Performance Example 1
from concurrent.futures import ProcessPoolExecutor
import numpy as np
import time
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
data = np.random.randint(0,100,size=data_size)
# data = np.linspace(0,data_size,data_size+1, dtype=np.uintc)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def multiprocessing(func, args, workers):
with ProcessPoolExecutor(workers) as executor:
results = executor.map(func, args)
return list(results)
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
result = multiprocessing(f_sum, args, num_workers)
print(f'expected result: {data[-1]}, actual result: {np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')
>> Output
This is process 99 random number: 6 last data 9
expected result: 86, actual result: [ 3. 9. 29. 58.]
total time: 11.760863542556763
Leads to false result if randint is used. For linspace result is correct.
>> Performance Example 2 - based on proposal in answer
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array
import time
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
input = np.random.randint(0, 100, size=data_size)
# input = np.linspace(0, data_size, data_size + 1, dtype=np.uintc)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def init_pool(the_data):
global data
data = the_data
def multiprocessing(func, args, workers, input):
data = Array('i', input, lock=False)
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(data,)) as executor:
results = list(executor.map(func, args))
return results
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
result = multiprocessing(f_sum, args, num_workers, input)
print(f'expected result: {input[-1]}, actual result:{np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')
>> Output
This is process 99 random number: 7 last data 29
expected result: 29, actual result: [29.]
total time: 30.8266122341156
#Booboo
I added two examples to my original question, the "Performance Example 2" is based on your code. First interesting finding, my original code actually gives incorrect results if the data array is initialized with random integers. I noticed, that each process by itself initializes the data array. Since it is based on random numbers each process uses a different array for calculation, and even different than the main. So that use case would not work with this code, in your code it is correct all the time.
If using linspace, however, it works, since this gives the same result each time. Same would be true for the use case where some data is read from a file (which is my actual use case). Example 1 is still about 3x faster than Example 2, and I think the time is mainly used by the initializing of the array in your method.
Regarding memory usage I don't see a relevant difference in my task manager. Both Example produce a similar increase in memory, even if the shape is different.
I still believe that your method is the correct approach, however, memory usage seems to be similar and speed is slower in the example above.

The most efficient used of memory would be to use shared memory so that all processes are working on the same instance of data. This would be absolutely necessary if the processes updated data. In the example below, since the access to data is read only and I am using a simple array of integers, I am using multiprocessing.Array with no locking specified. The "trick" is to initialize your pool by specifying the initializer and initargs arguments so that each process in the pool has access to this shared memory. I have made a couple of other changes to the code, which I have commented
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array, cpu_count # new imports
def init_pool(the_data):
global data
data = the_data
def f_sum(args):
(x,y) = args
print('This is process', x, 'with exponent:', y)
value = 0
for i in range(10**y):
value += i
return value/10**y + len(data) # just use the length of data for now
def multiprocessing(func, args, workers):
data = Array('i', range(1000), lock=False) # read-only, integers 0, 1, 2, ... 999
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(data,)) as executor:
results = list(executor.map(func, args)) # create the list of results here
print(results) # so that it can be printed out for demo purposes
return results
if __name__ == '__main__':
num_iterations = 10 # supposed to be large number
#num_workers = 6 # number of CPU cores
num_workers = cpu_count() # number of CPU cores
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,8)))
result = multiprocessing(f_sum, args, num_workers)
if np.abs(result[0]-np.round(result[0])) > 0:
print('data NOT updated')
Prints:
This is process 0 with exponent: 2
This is process 1 with exponent: 1
This is process 2 with exponent: 4
This is process 3 with exponent: 3
This is process 4 with exponent: 5
This is process 5 with exponent: 1
This is process 6 with exponent: 5
This is process 7 with exponent: 2
This is process 8 with exponent: 6
This is process 9 with exponent: 6
[1049.5, 1004.5, 5999.5, 1499.5, 50999.5, 1004.5, 50999.5, 1049.5, 500999.5, 500999.5]
data NOT updated
Updated Example 2
You saw my comments to your question concerning Example 1.
Your Example 2 is still not ideal: You have the statement input = np.random.randint(0, 100, size=data_size) as a global being needlessly executed by every process as it is initialized for use in the process pool. Below is an updated solution that also shows one way how you can have your worker function work directly with a numpy array that is backed up a multiprocessing.Array instance so that the numpy array exists in shared memory. You don't have to use this technique for what you are doing since you are only using numpy to create random numbers (I an not sure why), but it is a useful technique to know. But you should re-rerun your code after moving the initialization code of input as I have so it is only executed once.
I don't have the occasion to work with numpy day to day but I have come to learn that it uses multiprocessing internally for many of its own functions. So it is often not the best match for use with multiprocessing, although that does not seem to be applicable here since even in the case below we are just indexing an element of an array and it would not be using a sub-process to accomplish that.
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array
import time
import ctypes
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
# input = np.linspace(0, data_size, data_size + 1, dtype=np.uintc)
def to_shared_array(arr, ctype):
shared_array = Array(ctype, arr.size, lock=False)
temp = np.frombuffer(shared_array, dtype=arr.dtype)
temp[:] = arr.flatten(order='C')
return shared_array
def to_numpy_array(shared_array, shape):
'''Create a numpy array backed by a shared memory Array.'''
arr = np.ctypeslib.as_array(shared_array)
return arr.reshape(shape)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def init_pool(shared_array, shape):
global data
data = to_numpy_array(shared_array, shape)
def multiprocessing(func, args, workers, input):
input = np.random.randint(0, 100, size=data_size)
shape = input.shape
shared_array = to_shared_array(input, ctypes.c_long)
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(shared_array, shape)) as executor:
results = list(executor.map(func, args))
return input, results
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
input, result = multiprocessing(f_sum, args, num_workers, input)
print(f'expected result: {input[-1]}, actual result:{np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')

Parallelizing through Multi-threading and Multi-processing taking significantly more time than serial

I'm trying to learn how to do parallel programming in python. I wrote a simple int square function and then ran it in serial, multi-thread, and multi-process:
import time
import multiprocessing, threading
import random
def calc_square(numbers):
sq = 0
for n in numbers:
sq = n*n
def splita(list, n):
a = [[] for i in range(n)]
counter = 0
for i in range(0,len(list)):
a[counter].append(list[i])
if len(a[counter]) == len(list)/n:
counter = counter +1
continue
return a
if __name__ == "__main__":
random.seed(1)
arr = [random.randint(1, 11) for i in xrange(1000000)]
print "init completed"
start_time2 = time.time()
calc_square(arr)
end_time2 = time.time()
print "serial: " + str(end_time2 - start_time2)
newarr = splita(arr,8)
print 'split complete'
start_time = time.time()
for i in range(8):
t1 = threading.Thread(target=calc_square, args=(newarr[i],))
t1.start()
t1.join()
end_time = time.time()
print "mt: " + str(end_time - start_time)
start_time = time.time()
for i in range(8):
p1 = multiprocessing.Process(target=calc_square, args=(newarr[i],))
p1.start()
p1.join()
end_time = time.time()
print "mp: " + str(end_time - start_time)
Output:
init completed
serial: 0.0640001296997
split complete
mt: 0.0599999427795
mp: 2.97099995613
However, as you can see, something weird happened and mt is taking the same time as serial and mp is actually taking significantly longer (almost 50 times longer).
What am I doing wrong? Could someone push me in the right direction to learn parallel programming in python?
Edit 01
Looking at the comments, I see that perhaps the function not returning anything seems pointless. The reason I'm even trying this is because previously I tried the following add function:
def addi(numbers):
sq = 0
for n in numbers:
sq = sq + n
return sq
I tried returning the addition of each part to a serial number adder, so at least I could see some performance improvement over a pure serial implementation. However, I couldn't figure out how to store and use the returned value, and that's the reason I'm trying to figure out something even simpler than that, which is just dividing up the array and running a simple function on it.
Thanks!

I think that multiprocessing takes quite a long time to create and start each process. I have changed the program to make 10 times the size of arr and changed the way that the processes are started and there is a slight speed-up:
(Also note python 3)
import time
import multiprocessing, threading
from multiprocessing import Queue
import random
def calc_square_q(numbers,q):
while q.empty():
pass
return calc_square(numbers)
if __name__ == "__main__":
random.seed(1) # note how big arr is now vvvvvvv
arr = [random.randint(1, 11) for i in range(10000000)]
print("init completed")
# ...
# other stuff as before
# ...
processes=[]
q=Queue()
for arrs in newarr:
processes.append(multiprocessing.Process(target=calc_square_q, args=(arrs,q)))
print('start processes')
for p in processes:
p.start() # even tho' each process is started it waits...
print('join processes')
q.put(None) # ... for q to become not empty.
start_time = time.time()
for p in processes:
p.join()
end_time = time.time()
print("mp: " + str(end_time - start_time))
Also notice above how I create and start the processes in two different loops, and then finally join with the processes in a third loop.
Output:
init completed
serial: 0.53214430809021
split complete
start threads
mt: 0.5551605224609375
start processes
join processes
mp: 0.2800724506378174
Another factor of 10 increase in size of arr:
init completed
serial: 5.8455305099487305
split complete
start threads
mt: 5.411392450332642
start processes
join processes
mp: 1.9705185890197754
And yes, I've also tried this in python 2.7, although Threads seemed slower.

Performance differences between different ways of running a function

I hope this is not a duplicate question.
I've run the same function in Python 3.4.2 in a simple way and in a multi-processing way and I've found that the simple way is faster. Perhaps my design is not good, but I don't see where the problem lies.
Below is my code:
Common part
import os
import math
from multiprocessing import Process
import timeit
def exponential(number):
"""
A function that returns exponential
"""
result = math.exp(number)
proc = os.getpid()
Simple solution
if __name__ == '__main__':
start = timeit.default_timer()
numbers = [5, 10, 20, 30, 40, 50, 60]
for index, number in enumerate(numbers):
exponential(number)
stop = timeit.default_timer()
duration = stop - start
print(duration)
Multi-processing solution
if __name__ == '__main__':
start = timeit.default_timer()
numbers = [5, 10, 20, 30, 40, 50, 60]
procs = []
for index, number in enumerate(numbers):
proc = Process(target=exponential, args=(number,))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
stop = timeit.default_timer()
duration = stop - start
print(duration)
What I see is that the simple solution is faster than the multi-processing one:
Duration with Simple solution: 2.8359994757920504e-05
Duration with Multi processing solution: 0.012581961986143142

Computing math.exp(x), where x<100 (as it is in your case) is not especially difficult. So you don't computing these in parallel does not offer a clear advantage.
Remember that when you set up multiple processes, you also incur the overhead of creating a new process, and copying over the memory space, etc.
Finally, there's something to be said about you creating a new process for each number in that list. If that list had 100 numbers in it, you'd be creating 100 new processes, which will compete for time on your 4 or 8 cores (depending on your CPU), which will add to further delays (especially when the computation itself gets complex). You're better off creating a pool of processes and getting them to incrementally work on your dataset:
import math
import multiprocess as mp
def slave(qIn, qOut):
for i, num in iter(qIn.get, None):
qOut.put((i, math.exp(num))
qOut.put(None)
def master():
numbers = [5, 10, 20, 30, 40, 50, 60]
qIn, qOut = [mp.Queue() for _ in range(2)]
procs = [mp.Process(target=slave, args=(qIn, qOut)) for _ in range(mp.cpu_count()-1)]
for p in procs: p.start()
for t in enumerate(numbers): qIn.put(t)
for p in procs: qIn.put(None)
answer = [None] * len(numbers)
done = 0
while done < len(numProcs):
t = qOut.get()
if t is None:
done += 1
continue
i, e = t
answer[i] = e
for p in procs: p.terminate()
return answer

Compare performance of Process and Thread in different implementations

I'm exploring multi-tasking in Python, after reading this article, I create an example to compare performance between multithreading and multiprocessing:
dummy_data = ''.join(['0' for i in range(1048576)]) # around 1MB of data
def do_something(num):
l = []
for i in range(num):
l.append(dummy_data)
def test(use_thread):
if use_thread: title = 'Thread'
else: title = 'Process'
num = 1000
jobs = []
for i in range(4): # the test machine has 4 cores
if use_thread:
j = Thread(target=do_something, args=(num,))
else:
j = Process(target=do_something, args=(num,))
jobs.append(j)
start = time.time()
for j in jobs: j.start()
for j in jobs: j.join()
end = time.time()
print '{0}: {1}'.format(title, str(end - start))
The results are:
Process: 0.0416989326477
Thread: 0.149359941483
Which means using Process results in better performance since it utilises available cores.
However, if I change the implementation of function do_something to:
def do_something_1(num):
l = ''.join([dummy_data for i in range(num)])
Using process suddenly performs worse than threading (I reduce the num value to 1000 due to MemoryError):
Process: 14.6903309822
Thread: 4.30753493309
Can anyone explain to me why using the second implemetation of do_something results in the worse performance to Process in compare to Thread?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to retrieve values from a function run in parallel processes? - python

Related

Method with multiprocessing tool in python works worse than methoud without that

ProcessPoolExecutor on shared dataset and multiple arguments

Parallelizing through Multi-threading and Multi-processing taking significantly more time than serial

Performance differences between different ways of running a function

Compare performance of Process and Thread in different implementations

Categories

Resources