Multiprocessing taking longer than single (normal) processing

Multiprocessing taking longer than single (normal) processing - python

How come the multi function that uses a multiprocessing pool to segment and process data on multiple "processes" is slower (8 seconds) than just calling the map function (6 seconds)?
from multiprocessing import Pool
import timeit
def timer(function):
def new_function():
start_time = timeit.default_timer()
function()
elapsed = timeit.default_timer() - start_time
print('Function "{name}" took {time} seconds to complete.'.format(name=function.__name__, time=elapsed))
return new_function
def cube(n):
return n*n*n
nums = range(20000000)
if __name__ == '__main__':
#timer
def multi():
pool = Pool()
res = pool.map(cube,nums)
pool.close()
pool.join()
#timer
def test():
a = map(cube,nums)
multi()
test()

Because all the dispatching logic behind pool.map creates an overhead.
Multiprocessing always create overhead of some sort, which heavily depends on its underlying implementation.
You are running a lot of very simple tasks here, hence the overhead caused by the threading logic is not compensated by the gain of parallel execution. Try to do the same test with a lesser number of more cpu-intensive task, you should see different results.
Exemple
See this modified test. Here, we have a silly cubes function that computes n^3 1000 times.
from multiprocessing import Pool
import timeit
def timer(function):
def new_function():
start_time = timeit.default_timer()
function()
elapsed = timeit.default_timer() - start_time
print('Function "{name}" took {time} seconds to complete.'.format(name=function.__name__, time=elapsed))
return new_function
def cubes(n):
for _ in range(999):
n * n * n
return n * n * n
nums = range(20000)
if __name__ == '__main__':
#timer
def multi():
pool = Pool()
res = pool.map(cubes, nums)
pool.close()
pool.join()
#timer
def test():
# On Python 3, simply calling map() returns an iterator
# tuple() collects its values for timing
a = tuple(map(cubes, nums))
multi()
test()
We now see multiprocessing is improving our timing:
Function "multi" took 0.6272498000000001 seconds to complete.
Function "test" took 2.130454 seconds to complete.

Related

Is there a way to utilize all cpu cores and power in calculations?

First, I've tried to use multithreading solution for this problem and discovered that it is not suitable for this purpose. Then I tried as community suggested to apply multiprocessing solution to bypass the GIL and even that performs poor compared to single process single thread code. Is python flawed in this domain?
Is the only solution for heavy cpu calculations is to drop python for another language?
I post my multiprocessing test code so you can get an impression.
from itertools import cycle
import random
import multiprocessing as mp
import time
# The class that represents the process
class Task(mp.Process):
def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
mp.Process.__init__(self, group=group, target=target, name=name, args=args ,kwargs=kwargs, daemon=daemon)
self.inputs = []
def run(self):
print(f"{self.name} is running")
for arr in self.inputs:
arr.sort()
def add_input(self, arr):
self.inputs.append(arr)
# A util function to cycle on iterable a finite number of times.
def finite_cycle(cycle_on, times):
infinite_cycle = cycle(cycle_on)
for _ in range(times):
yield next(infinite_cycle)
# Constants
THOUSAND = 1000
MILION = THOUSAND ** 2
PCNT = 2
TASK_CNT = 50 * THOUSAND
# Main
def main():
processes = [Task(name = f"p{pid}") for pid in range(PCNT)]
for pid in finite_cycle(range(PCNT), TASK_CNT):
processes[pid].add_input([random.randint(1,10) for _ in range(100)])
stime = time.time()
for p in processes:
p.start()
for p in processes:
p.join()
print(f"execution time: {round(time.time() - stime, 2)}")
print("finish.")
And this is the single process single thread code which is faster for every varation of the constants.
def main():
inputs = [[random.randint(1,10) for _ in range(100)] for _ in range(TASK_CNT)]
stime = time.time()
for arr in inputs:
arr.sort()
print(f"execution time: {round(time.time() - stime, 2)}")
print("finish.")

On my desktop the run methods averaged each approximately .125 seconds to run while the time elapsed between calling the first start method and the start of the first run method was approximately .23 seconds (i.e. 1628456465.1061594 - 1628456464.8741603), most of that time I believe taken by the serialization/de-serialization of self.inputs. See below, which is the original program with a few timings added.
The point is that multiprocessing has two sources of overhead that the non-multiprocessing program does not have:
Overhead in creating the processes.
Overhead in passing arguments to and getting results back from the process. This involves moving data from one address space to another (via various mechanisms) in many cases unless shared memory is being used.
Multiprocessing therefore only becomes advantageous when the processing itself (the run method in this case) is so CPU-intensive that the aforementioned costs of multiprocessing are offset by being able to "divide and conquer" the problem.
from itertools import cycle
import random
import multiprocessing as mp
import time
# The class that represents the process
class Task(mp.Process):
def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
mp.Process.__init__(self, group=group, target=target, name=name, args=args ,kwargs=kwargs, daemon=daemon)
self.inputs = []
def run(self):
t = time.time()
print(f"{self.name} is running at:", t)
for arr in self.inputs:
arr.sort()
print('elapsed time =', time.time() - t)
def add_input(self, arr):
self.inputs.append(arr)
# A util function to cycle on iterable a finite number of times.
def finite_cycle(cycle_on, times):
infinite_cycle = cycle(cycle_on)
for _ in range(times):
yield next(infinite_cycle)
# Constants
THOUSAND = 1000
MILION = THOUSAND ** 2
PCNT = 2
TASK_CNT = 50 * THOUSAND
# Main
def main():
processes = [Task(name = f"p{pid}") for pid in range(PCNT)]
for pid in finite_cycle(range(PCNT), TASK_CNT):
processes[pid].add_input([random.randint(1,10) for _ in range(100)])
stime = time.time()
print('stime =', stime)
for p in processes:
p.start()
for p in processes:
p.join()
print(f"execution time: {round(time.time() - stime, 2)}")
print("finish.")
if __name__ == '__main__':
main()
Prints:
stime = 1628456464.8741603
p0 is running at: 1628456465.1061594
elapsed time = 0.1320023536682129
p1 is running at: 1628456465.3201597
elapsed time = 0.11999750137329102
execution time: 0.62
finish.

Why is PyPy3 slower at starting new Processes?

I would like to know why when I start new Processes (multiprocessing.Process) in PyPy3, it is actually like 2 times slower than doing the same thing on CPython. I would also like to know any solutions for this.
This is some code I wrote for illustrating this effect:
import multiprocessing as mp
from time import sleep, time
class A(object):
def __init__(self, *args, **kwargs):
# do other stuff
#self.p_conn, self.child_conn = mp.Pipe()
#self.q = mp.Queue()
def do_something(self, i):
sleep(0.1)
s = '%s * %s = %s' % (i, i, i*i)
#self.child_conn.send(s)
#self.q.put(i**2)
def run(self):
processes = []
for i in range(500):
p = mp.Process(target=self.do_something, args=(i,))
processes.append(p)
[x.start() for x in processes]
#for i in range(50):
# print(self.p_conn.recv())
#for i in range(50):
# print(self.q.get())
if __name__ == '__main__':
a = A()
s = time()
a.run()
print(f"Took {time()-s} seconds...")
CPython took around 18 seconds while PyPy3 took around 37 seconds to execute the same code. (Other tests also showed that PyPy3 is 2 times slower than CPython at starting Processes in my system...)
I would like to know how to solve the problem efficiently.

multiprocessing Process vs single Pool.apply

The code is running under Linux system with fork available.
Q1: It seems mp.Process(f, args) doesn't pickle the f, but Pool.apply(f, args) has to, why is that?
The consequence is that pool cannot work with nested function.
Q2: The time taken for Process is much smaller than Pool. Does Process pickle args like Pool does? How could it be so fast?
Q3: The answer to the above 2 questions might be becuase Process is utilizing copy on write at fork. If Process can benefit from it, why cannot Pool do the same?
import time
import multiprocessing as mp
import numpy as np
N = 5000
def create_big_object(n):
# Make sure memory is allocated.
np.random.seed(1)
return np.random.rand(n, n)
def f(a):
pass
def main():
time.sleep(1)
my_array = create_big_object(N)
def g(a):
pass
t = time.time()
# No pickling?
p = mp.Process(target=g, args=(my_array,))
p.start()
p.join()
print(f'Process took {time.time() - t} sec')
pool = mp.Pool(2)
time.sleep(1)
t = time.time()
# pool doesn't work with g because it cannot pickle nested function
pool.apply(f, args=(my_array,))
print(f'Pool took {time.time() - t} sec')
if __name__ == '__main__':
main()
# Process took 0.0038743019104003906 sec
# Pool took 0.5500802993774414 sec

Why using Asyncio is not reducing the overall execution time in Python and run functions concurrently?

I am trying to run a piece of code using asyncio and reduce the time execution of the whole code. Below is my code which is taking around 6 seconds to fully execute itself
Normal function calls- (approach 1)
from time import time, sleep
import asyncio
def find_div(range_, divide_by):
lis_ = []
for i in range(range_):
if i % divide_by == 0:
lis_.append(i)
print("found numbers for range {}, divided by {}".format(range_, divide_by))
return lis_
if __name__ == "__main__":
start = time()
find_div(50800000, 341313)
find_div(10005200, 32110)
find_div(50000340, 31238)
print(time()-start)
The output of the above code is just the total execution time which is 6 secs.
Multithreaded Approach- (approach 2)
Used multithreading in this, but surprisingly the time increased
from time import time, sleep
import asyncio
import threading
def find_div(range_, divide_by):
lis_ = []
for i in range(range_):
if i % divide_by == 0:
lis_.append(i)
print("found numbers for range {}, divided by {}".format(range_, divide_by))
return lis_
if __name__ == "__main__":
start = time()
t1 = threading.Thread(target=find_div, args=(50800000, 341313))
t2 = threading.Thread(target=find_div, args=(10005200, 32110))
t3 = threading.Thread(target=find_div, args=(50000340, 31238))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
print(time()-start)
The output of the above code is 12 secs.
Multiprocessing approach- (approach 3)
from time import time, sleep
import asyncio
from multiprocessing import Pool
def multi_run_wrapper(args):
return find_div(*args)
def find_div(range_, divide_by):
lis_ = []
for i in range(range_):
if i % divide_by == 0:
lis_.append(i)
print("found numbers for range {}, divided by {}".format(range_, divide_by))
return lis_
if __name__ == "__main__":
start = time()
with Pool(3) as p:
p.map(multi_run_wrapper,[(50800000, 341313),(10005200, 32110),(50000340, 31238)])
print(time()-start)
The output of the multiprocessing code is 3 secs which is better than the normal function call approach.
Asyncio Approach- (approach 4)
from time import time, sleep
import asyncio
async def find_div(range_, divide_by):
lis_ = []
for i in range(range_):
if i % divide_by == 0:
lis_.append(i)
print("found numbers for range {}, divided by {}".format(range_, divide_by))
return lis_
async def task():
tasks = [find_div(50800000, 341313),find_div(10005200, 32110),find_div(50000340, 31238)]
result = await asyncio.gather(*tasks)
print(result)
if __name__ == "__main__":
start = time()
asyncio.run(task())
print(time()-start)
The above code is also taking around 6 seconds which is the same as the normal execution function call that is the Approach 1.
Problem-
Why is my Asyncio approach not working as expected and reducing the overall time? What is wrong in the code?

You have code that exclusively uses the CPU.
Code like this cannot be sped up using async.
Async shines when you have tasks that are waiting on something not CPU related, e.g. a network request or reading from disk. This is generally true for all languages.
In python, also the thread based approach will not help you, as this still restricts you to a single core and not true parallel execution. This is due to the Global Interpreter Lock (GIL). The overhead of starting and switching between threads makes it slower than the simple version.
In this regard, threads are similar to async in python, it only helps if the time you are waiting is not spend mainly on the CPU or if you are calling code that's not bound by the GIL, e.g. c extensions.
Using multiprocessing really uses multiple cpu cores, so it is faster than the normal solution.

asyncio def run(time):
await asyncio.sleep(time)
This code takes 1 min 40 seconds
from datetime import datetime
now = datetime.now()
task=[]
for i in range(10):
await run(10)
now1=datetime.now()
print(now1-now)
OPTIMIZED USING async-->
THis takes 10 seconds only
from datetime import datetime
now = datetime.now()
task=[]
for i in range(10):
task.append(asyncio.create_task(run(10)))
await asyncio.gather(*task)
now1=datetime.now()
print(now1-now)

Multiprocessing : use tqdm to display a progress bar

To make my code more "pythonic" and faster, I use multiprocessing and a map function to send it a) the function and b) the range of iterations.
The implanted solution (i.e., calling tqdm directly on the range tqdm.tqdm(range(0, 30))) does not work with multiprocessing (as formulated in the code below).
The progress bar is displayed from 0 to 100% (when python reads the code?) but it does not indicate the actual progress of the map function.
How can one display a progress bar that indicates at which step the 'map' function is ?
from multiprocessing import Pool
import tqdm
import time
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
p = Pool(2)
r = p.map(_foo, tqdm.tqdm(range(0, 30)))
p.close()
p.join()
Any help or suggestions are welcome...

Use imap instead of map, which returns an iterator of the processed values.
from multiprocessing import Pool
import tqdm
import time
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
with Pool(2) as p:
r = list(tqdm.tqdm(p.imap(_foo, range(30)), total=30))

Sorry for being late but if all you need is a concurrent map, I added this functionality in tqdm>=4.42.0:
from tqdm.contrib.concurrent import process_map # or thread_map
import time
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
r = process_map(_foo, range(0, 30), max_workers=2)
References: https://tqdm.github.io/docs/contrib.concurrent/ and https://github.com/tqdm/tqdm/blob/master/examples/parallel_bars.py
It supports max_workers and chunksize and you can also easily switch from process_map to thread_map.

Solution found. Be careful! Due to multiprocessing, the estimation time (iteration per loop, total time, etc.) could be unstable, but the progress bar works perfectly.
Note: Context manager for Pool is only available in Python 3.3+.
from multiprocessing import Pool
import time
from tqdm import *
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
with Pool(processes=2) as p:
max_ = 30
with tqdm(total=max_) as pbar:
for _ in p.imap_unordered(_foo, range(0, max_)):
pbar.update()

You can use p_tqdm instead.
https://github.com/swansonk14/p_tqdm
from p_tqdm import p_map
import time
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
r = p_map(_foo, list(range(0, 30)))

based on the answer of Xavi Martínez I wrote the function imap_unordered_bar. It can be used in the same way as imap_unordered with the only difference that a processing bar is shown.
from multiprocessing import Pool
import time
from tqdm import *
def imap_unordered_bar(func, args, n_processes = 2):
p = Pool(n_processes)
res_list = []
with tqdm(total = len(args)) as pbar:
for i, res in tqdm(enumerate(p.imap_unordered(func, args))):
pbar.update()
res_list.append(res)
pbar.close()
p.close()
p.join()
return res_list
def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square
if __name__ == '__main__':
result = imap_unordered_bar(_foo, range(5))

import multiprocessing as mp
import tqdm
iterable = ...
num_cpu = mp.cpu_count() - 2 # dont use all cpus.
def func():
# your logic
...
if __name__ == '__main__':
with mp.Pool(num_cpu) as p:
list(tqdm.tqdm(p.imap(func, iterable), total=len(iterable)))

For progress bar with apply_async, we can use following code as suggested in:
https://github.com/tqdm/tqdm/issues/484
import time
import random
from multiprocessing import Pool
from tqdm import tqdm
def myfunc(a):
time.sleep(random.random())
return a ** 2
pool = Pool(2)
pbar = tqdm(total=100)
def update(*a):
pbar.update()
for i in range(pbar.total):
pool.apply_async(myfunc, args=(i,), callback=update)
pool.close()
pool.join()

Here is my take for when you need to get results back from your parallel executing functions. This function does a few things (there is another post of mine that explains it further) but the key point is that there is a tasks pending queue and a tasks completed queue. As workers are done with each task in the pending queue they add the results in the tasks completed queue. You can wrap the check to the tasks completed queue with the tqdm progress bar. I am not putting the implementation of the do_work() function here, it is not relevant, as the message here is to monitor the tasks completed queue and update the progress bar every time a result is in.
def par_proc(job_list, num_cpus=None, verbose=False):
# Get the number of cores
if not num_cpus:
num_cpus = psutil.cpu_count(logical=False)
print('* Parallel processing')
print('* Running on {} cores'.format(num_cpus))
# Set-up the queues for sending and receiving data to/from the workers
tasks_pending = mp.Queue()
tasks_completed = mp.Queue()
# Gather processes and results here
processes = []
results = []
# Count tasks
num_tasks = 0
# Add the tasks to the queue
for job in job_list:
for task in job['tasks']:
expanded_job = {}
num_tasks = num_tasks + 1
expanded_job.update({'func': pickle.dumps(job['func'])})
expanded_job.update({'task': task})
tasks_pending.put(expanded_job)
# Set the number of workers here
num_workers = min(num_cpus, num_tasks)
# We need as many sentinels as there are worker processes so that ALL processes exit when there is no more
# work left to be done.
for c in range(num_workers):
tasks_pending.put(SENTINEL)
print('* Number of tasks: {}'.format(num_tasks))
# Set-up and start the workers
for c in range(num_workers):
p = mp.Process(target=do_work, args=(tasks_pending, tasks_completed, verbose))
p.name = 'worker' + str(c)
processes.append(p)
p.start()
# Gather the results
completed_tasks_counter = 0
with tqdm(total=num_tasks) as bar:
while completed_tasks_counter < num_tasks:
results.append(tasks_completed.get())
completed_tasks_counter = completed_tasks_counter + 1
bar.update(completed_tasks_counter)
for p in processes:
p.join()
return results

Based on "user17242583" answer, I created the following function. It should be as fast as Pool.map and the results are always ordered. Plus, you can pass as many parameters to your function as you want and not just a single iterable.
from multiprocessing import Pool
from functools import partial
from tqdm import tqdm
def imap_tqdm(function, iterable, processes, chunksize=1, desc=None, disable=False, **kwargs):
"""
Run a function in parallel with a tqdm progress bar and an arbitrary number of arguments.
Results are always ordered and the performance should be the same as of Pool.map.
:param function: The function that should be parallelized.
:param iterable: The iterable passed to the function.
:param processes: The number of processes used for the parallelization.
:param chunksize: The iterable is based on the chunk size chopped into chunks and submitted to the process pool as separate tasks.
:param desc: The description displayed by tqdm in the progress bar.
:param disable: Disables the tqdm progress bar.
:param kwargs: Any additional arguments that should be passed to the function.
"""
if kwargs:
function_wrapper = partial(_wrapper, function=function, **kwargs)
else:
function_wrapper = partial(_wrapper, function=function)
results = [None] * len(iterable)
with Pool(processes=processes) as p:
with tqdm(desc=desc, total=len(iterable), disable=disable) as pbar:
for i, result in p.imap_unordered(function_wrapper, enumerate(iterable), chunksize=chunksize):
results[i] = result
pbar.update()
return results
def _wrapper(enum_iterable, function, **kwargs):
i = enum_iterable[0]
result = function(enum_iterable[1], **kwargs)
return i, result

This approach simple and it works.
from multiprocessing.pool import ThreadPool
import time
from tqdm import tqdm
def job():
time.sleep(1)
pbar.update()
pool = ThreadPool(5)
with tqdm(total=100) as pbar:
for i in range(100):
pool.apply_async(job)
pool.close()
pool.join()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing taking longer than single (normal) processing - python

Related

Is there a way to utilize all cpu cores and power in calculations?

Why is PyPy3 slower at starting new Processes?

multiprocessing Process vs single Pool.apply

Why using Asyncio is not reducing the overall execution time in Python and run functions concurrently?

Multiprocessing : use tqdm to display a progress bar

Categories

Resources