ThreadPool and Pool for parallel processing - python

Is there a way to use both ThreadPool and Pool in python to parallelise a loop by specifying the number of CPUs and cores you wish to use?
For example I would have a loop execute as:
from multiprocessing.dummy import Pool as ThreadPool
from tqdm import tqdm
import numpy as np
def my_function(x):
return x + 1
pool = ThreadPool(4)
my_array = np.arange(0,1e6,1)
results = list(tqdm(pool.imap(my_function, my_array),total=len(my_array)))
For 4 cores but it I wanted to spread these out on multiple CPUs as well, is there a simple way to adapt the code?

You are confusing between a core and a CPU. Generally, for all purposes both are considered to be the same(let's call them processor from now on).
When creating a thread pool in python, the threads are user level threads and are run on the same processor, due to Global Interpreter Lock(GIL) in python. As only one thread can control the python interpreter at a time. So, using (python)threads we don't get any real concurrency in data-intensive tasks.
How to solve this? Easy. Spawn multiple python processes running on different processors(each with its own interpreter). This is where the multi processing(mp) module is used, to spawn multiple processes from the parent python process in which it is called.
You can verify this by running htop(on linux, mac) and analysing the number of python processes. In case of mp module, they all will have the same name as the parent script where the pool.map function is called.
Timings for your code on a 8 core mac: 39.7s
Timing for this code on the same machine : 2.9s(note I can use 8 cores at max, but for comparison purposes using only 4)
Below is the modified code:
from multiprocessing.dummy import Pool as ThreadPool
from tqdm import tqdm
import numpy as np
import time
import multiprocessing as mp
def my_function(x):
return x + 1
pool = ThreadPool(4)
my_array = np.arange(0,1e6,1)
t1 = time.time()
# results = list(tqdm(pool.imap(my_function, my_array),total=len(my_array)))
pool = mp.Pool(processes=4) # Generally, set to 2*num_cores you have
res = pool.map(my_function, my_array)
print("Time taken = ", time.time() - t1)

multiprocessing.dummy.Pool is exactly simple ThreadPool, which don't use multicores and multicpus (because of GIL). you must use multiprocessing.Pool to run Process, which is process in your OS (if you define Pool(N) - N is number of this processes, if no - number of your cores in OS is default). Arguments this processes get from internal queue of Pool. 'case of that U will use all cpu and all core in your OS

Related

Python Pathos Multiprocessing: How to terminate the pool under a certain condition?

I need to run a bunch of parallel processes, but cannot use the standard multiprocessing package since its serialization with pickle does not work for more complex objects. Therefore I'm currently using pathos.multiprocessing which uses dill for the serialization and it works flawlessly in that regard.
However, I would like to have an exit condition, so that all procceses get terminated once the result of a process meets a certain condition (I'm computing objective values for an optimization problem and I want all procceses to stop once the result from a process is worse than any of the previous results).
For the standard multiprocessing package I found this solution (taken from https://stackoverflow.com/a/21491438/15799363). Can I do something similar with pathos.multiprocessing? I couldn't figure out how to pass a callback function to processes with pathos.
from random import random
from multiprocessing import Pool
from time import sleep
def add_something(i):
# Sleep to simulate the long calculation
sleep(random() * 30)
return i + 1
def run_my_process():
# Create a process pool
pool = Pool(100)
# Callback function that checks results and kills the pool
def check_result(result):
print(result)
if result == 90:
pool.terminate()
# Start up all of the processes
for i in range(100):
pool.apply_async(add_something, args=[i], callback=check_result)
pool.close()
pool.join()
if __name__ == '__main__':
run_my_process()

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

Python (3.7+) multiprocessing: replace Pipe connection between master and workers with asyncio for IO concurrency

Suppose we have a following toy version of master-worker pipeline to parallel data collection
# pip install gym
import gym
import numpy as np
from multiprocessing import Process, Pipe
def worker(master_conn, worker_conn):
master_conn.close()
env = gym.make('Pendulum-v0')
env.reset()
while True:
cmd, data = worker_conn.recv()
if cmd == 'close':
worker_conn.close()
break
elif cmd == 'step':
results = env.step(data)
worker_conn.send(results)
class Master(object):
def __init__(self):
self.master_conns, self.worker_conns = zip(*[Pipe() for _ in range(10)])
self.list_process = [Process(target=worker, args=[master_conn, worker_conn], daemon=True)
for master_conn, worker_conn in zip(self.master_conns, self.worker_conns)]
[p.start() for p in self.list_process]
[worker_conn.close() for worker_conn in self.worker_conns]
def go(self, actions):
[master_conn.send(['step', action]) for master_conn, action in zip(self.master_conns, actions)]
results = [master_conn.recv() for master_conn in self.master_conns]
return results
def close(self):
[master_conn.send(['close', None]) for master_conn in self.master_conns]
[p.join() for p in self.list_process]
master = Master()
l = []
T = 1000
for t in range(T):
actions = np.random.rand(10, 1)
results = master.go(actions)
l.append(len(results))
sum(l)
Because of the Pipe connections between master each worker, for every time step, we have to send a command to the worker through the Pipe, and the worker sends back the results. We need to do this for a long horizon. This will be sometimes a bit slow due to frequent communications.
Therefore, I am wondering if by using latest Python feature asyncio combined with Process to replace Pipe, could it be potentially speedup due to IO concurrency, if I understand its functionality correctly.
Multiprocessing module has already a solution for parallel task processing: multiprocessing.Pool
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(processes=4) as pool: # start 4 worker processes
print(pool.map(f, range(10))) # prints "[0, 1, 4,..., 81]"
You can achieve the same using multiprocessing.Queue. I believe that's how pool.map() is implemented internally.
So, what's the difference between multiprocessing.Queue and multiprocessing.Pipe? Queue is just a Pipe plus some locking mechanism. Therefore multiple worker processes can share just a single Queue (or rather 2 - one for commands, one for results), but with Pipe each process would need it's own Pipe (or a pair of, or a duplex one), exactly how you are doing it now.
The only disadvantage of Queue is performance - because all processes share one queue mutex it doesn't scale well for many processes. To be sure it can handle tens of thousands items/s I would choose Pipe, but for classic parallel task processing use case I think Queue or just Pool.map() could be OK because they are much easier to use. (Managing processes can be tricky and asyncio doesn't make it easier either.)
Hope that helps, I'm aware that I've answered a bit different question than you've asked :)

Multiprocessing on a set number of cores

In the code below, where Function is a function to be called, how might I specify the number of processors to be used as 10?
if __name__ == '__main__':
jobs = []
for l in lst:
p = multiprocessing.Process(target=Function, args=(l,))
jobs.append(p)
p.start()
This code will completely take over my server, so how do I limit it to ten cores? Should I put it in a loop?
Given that you are essentially mapping a function over a list of variables might I suggest that you use multiprocessing.Pool instead.
This is a class which creates a pool of a limited number of worker threads that can then be used to run a function over a list of inputs instead of Process where you create a thread per function call and then run them all at the same time
An example of using it in versions of python < 3.3 would be:
from multiprocessing import Pool
import contextlib
num_threads = 10
with contextlib.closing( Pool(num_threads) ) as pool:
results = pool.map(Function, lst)
If you are using python 3 than the Pool class can use a context manager by default and the code simplifies to:
from multiprocessing import Pool
num_threads = 10
with Pool(num_threads) as pool:
results = pool.map(lst)
Use a process pool to allocate set number of processes to do your task: https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

Python multiprocessing run time per process increases with number of processes

I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.

Categories