I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!
several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()
you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.
Related
I'm using a python api for a proprietary software to run numerical simulations. I need to do quite a few so have tried to speed things up using multiprocessing.pool() to run simulations in parallel. The simulations are independent and the function passed to multiprosessing.pool() returns nothing but the simulation results are saved to disk. As far as I understand this should be similar to opening X no of terminals and running a call to the API from each.
Using multiprocessing starts off well, I can see all processors running at 100% which is expected for the simulations. However after a while the processes seem to die. Eventually I end up with no active processes but still simulations that have not started. I think that the problem is that the API is sometimes a a little buggy. Certain errors cause python kernel to crash. I think this likely what is happening with my multiprocessing.pool().
Is there a way that I can add a new process for each one that dies so that there will always be processes in the pool? For now I can run the individual simulations that give problems manually.
Below is a minimum working example but I am not sure how to reproduce an error that causes the kernel to crash so it is not of much use.
from multiprocessing import Pool
from multiprocessing import cpu_count
import time
def test_function(a,b):
"Takes in two variables to justify starmap, pause,return nothing"
print(f'running case {a}')
' api(a,b) - Runs a simulation and saves output to disk'
'include error that "randomly" crashes python console/process'
time.sleep(5)
if __name__ == '__main__':
case_names = list(range(60))
b = 'b'
inputs = [(a,b) for a in case_names] #All the inputs in order needed by run_wdi
start_time = time.time()
# no_processes = cpu_count()
no_processes = min(cpu_count(),len(inputs))
print(f"Using {no_processes} processes on {cpu_count()} cpu's")
# with Pool(processes=no_processes) as pool:
with Pool() as pool:
result = pool.starmap(test_function, inputs)
end_time = time.time()
print(f'Total time {end_time-start_time}')
I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.
multiprocessing pool.map works nicely on my old PC but does not work on the new PC.
It hangs in the call to
def wait(self,timeout=None)
self._event.wait(timeout)
at which time the cpu utilization drops to zero% with no further response like it has gone to sleep.
I wrote a simple test.py as follows
import multiprocessing as mp
letters = ['A','B','C']
def doit(letter):
for i in range(1000):
print(str(letter) + ' ' + str(i))
if __name__ == '__main__':
pool = mp.Pool()
pool.map(doit,letters)
This works on the old PC with i7-7700k(4cores,8logical), python365-64bit, Win10Pro, PyCharm2018.1 where the stdout displays letters and numbers in non-sequential order as expected.
Though this same code does not work on the new build i9-7960(16core-32logical), python37-64bit, Win10Pro, PyCharm2018.3
New PC bios version has not been updated from 2017/11 (4 months older)
pool.py appears to be the same on both machines (2006-2008 R Oudkerk)
The codeline where it hangs in the 'wait' function is ...
self._event.wait(timeout)
Any help please on where I might look next to find the cause.
Thanks in advance.
....
EDIT::
My further interpretation -
1. GIL (Global interpreter Lock) is not relevant here as this relates to multi-threading only, not multiprocessing.
2. multiprocessing.manager is unnecessary here as the code is consuming static input and producing independent output. So pool.close and pool.join are not required either, as I am not post-process joining results
3. This link is a good introduction to multiprocessing though I don't see a solution in here.
https://docs.python.org/2/library/multiprocessing.html#windows
I am learning multiprocessing from "Introducing Python", there's such a example to demonstrate the multiprocessing
import os
import multiprocessing as mp
def do_this(what):
whoami(what)
def whoami(what):
print(f"Process {os.getpid()} says: {what}.")
if __name__ == "__main__":
whoami("I'm the main program.")
for i in range(4):
p = mp.Process(target=do_this, args=(f"I'm function {i}",))
p.start()
def do_this(what):
whoami(what)
def whoami(what):
print(f"Process {os.getpid()} says: {what}.")
if __name__ == "__main__":
whoami("I'm the main program.")
for i in range(4):
do_this(f"I'm function {i}")
Run it and come by:
## -- End pasted text --
Process 2197 says: I'm the main program..
Process 2294 says: I'm function 1.
Process 2293 says: I'm function 0.
Process 2295 says: I'm function 2.
Process 2296 says: I'm function 3.
However, it's easily achieved by a single process:
def do_this(what):
whoami(what)
def whoami(what):
print(f"Process {os.getpid()} says: {what}.")
if __name__ == "__main__":
whoami("I'm the main program.")
for i in range(4):
do_this(f"I'm function {i}")
## -- End pasted text --
Process 2197 says: I'm the main program..
Process 2197 says: I'm function 0.
Process 2197 says: I'm function 1.
Process 2197 says: I'm function 2.
Process 2197 says: I'm function 3.
I try best to grasp the idea of multiprocessing and what's problem it solved if not introduced in.
In the above case, what's the extra benefits of multiprocessing
The idea behind multiprocessing is that you can take a problem that requires a lot of math to run, and split the workload between multiple computing systems.
This is often done within a single computer, but can also happen over a network of computers. In the case of python a "multi-process" is executed within a single computer.
The way this works is modern cpu's have several cores. Each core is like it's own processor, in that it can process a single thread at a time.
The reason cpu's are divided into cores is because it's hard to make a single core faster, but it's easy to add more cores, which in turn gives you more total processing power.
The problem with this is that each core can only execute a single thread at a time. So if your program is entirely single threaded it doesn't matter how many cores you have it will only run at the speed of the single core it's on.
Dividing your python script like you did above separates it into several threads that can run independently on different cores. Each core processes the task you give it and the final answer is combined and printed to the screen.
In your example there really is no beneficent to using multiprocessing because you aren't doing a significant amount of work to slow the program down, but say you had massive arrays that required expensive math to run, dividing that array into parts and distributing those parts to the different processes would make the overall program run faster.
I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.
I wrote a small piece of code that has the same behaviour as my project:
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def simulation(steps, y): # the function that starts the parallel execution of f()
pool = Pool(processes=8, maxtasksperchild=int(steps/8))
results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
pool.close()
return results
def f(steps, y): # steps is used as a counter. My code doesn't need it.
a, b = np.random.random(2)
return y*a, y*b
def main():
steps = 2**20 # amount of times a random sample is taken
y = np.ones(5) # dummy variable to show that the next iteration of the code depends on the previous one
total_results = np.zeros((0,2))
for i in range(5):
results = simulation(steps, y[i-1])
y[i] = results[0][0]
total_results = np.vstack((total_results, results))
print(total_results, y)
if __name__ == "__main__":
main()
For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.
Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?
Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.
Though the script does use quite a bit of memory even with the "smaller" example values, the answer to
Does Python clone my entire environment each time the parallel
processes are run, including the variables not required by f()? How
can I prevent this behaviour?
is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system
% uname -a
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
COW seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe instead of forking. Since you mention using htop, you're using some flavour of UNIX or UNIX like system, and you get COW semantics.
For each iteration of the for loop the processes in simulation() each
have a memory usage equal to the total memory used by my code.
The spawned processes will display almost identical values of RSS, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC and submitted data will be copied. In your example the IPC and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation took the script's runtime from around 150s to 0.66s on this machine.
We can observe COW with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:
import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil
def read_arr(arr, done, stop):
with done:
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
def main():
# Create a large array
print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
A = np.random.random(2**28)
print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
done = Condition()
stop = Event()
p = Process(target=read_arr, args=(A, done, stop))
with done:
p.start()
done.wait()
print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
input("Press Enter...")
stop.set()
p.join()
if __name__ == '__main__':
main()
Output on this machine:
% python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...
Now if we replace the function read_arr with a function that modifies the array:
def mutate_arr(arr, done, stop):
with done:
arr[::4096] = 1
S = np.sum(arr)
print(S)
done.notify()
while not stop.is_set():
sleep(1)
the results are quite different:
Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...
The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.
Maybe you should know the difference between thread and process in Operating System. see this What is the difference between a process and a thread.
In the for loop, there are processes, not threads. Threads share the address space of the process that created it; processes have their own address space.
You can print the process id, type os.getpid().