Start a new multiprocessing.pool() process for each one that dies

Start a new multiprocessing.pool() process for each one that dies - python

I'm using a python api for a proprietary software to run numerical simulations. I need to do quite a few so have tried to speed things up using multiprocessing.pool() to run simulations in parallel. The simulations are independent and the function passed to multiprosessing.pool() returns nothing but the simulation results are saved to disk. As far as I understand this should be similar to opening X no of terminals and running a call to the API from each.
Using multiprocessing starts off well, I can see all processors running at 100% which is expected for the simulations. However after a while the processes seem to die. Eventually I end up with no active processes but still simulations that have not started. I think that the problem is that the API is sometimes a a little buggy. Certain errors cause python kernel to crash. I think this likely what is happening with my multiprocessing.pool().
Is there a way that I can add a new process for each one that dies so that there will always be processes in the pool? For now I can run the individual simulations that give problems manually.
Below is a minimum working example but I am not sure how to reproduce an error that causes the kernel to crash so it is not of much use.
from multiprocessing import Pool
from multiprocessing import cpu_count
import time
def test_function(a,b):
"Takes in two variables to justify starmap, pause,return nothing"
print(f'running case {a}')
' api(a,b) - Runs a simulation and saves output to disk'
'include error that "randomly" crashes python console/process'
time.sleep(5)
if __name__ == '__main__':
case_names = list(range(60))
b = 'b'
inputs = [(a,b) for a in case_names] #All the inputs in order needed by run_wdi
start_time = time.time()
# no_processes = cpu_count()
no_processes = min(cpu_count(),len(inputs))
print(f"Using {no_processes} processes on {cpu_count()} cpu's")
# with Pool(processes=no_processes) as pool:
with Pool() as pool:
result = pool.starmap(test_function, inputs)
end_time = time.time()
print(f'Total time {end_time-start_time}')

Related

Faster Startup of Processes Python

I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!

several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()

you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.

object is not iterable with multiprocessing?

I keep getting Type errors on list not being callable, although the I receive the print of output on my terminal... what is calling the list if we are on a loop?
def work(page):
#-------------------------
#make obj of page and do something
grabthis = Some_class1(page)
f = Someclass_2(grabthis,page)
output = f.extract()
print(output)
pages='PDFPAGES'
#set page
save = []
for page in pages:
go = work(page)
start = multiprocessing.Process(target=go)
start.start()
save.append(start)
if go == 'norun':
continue
for items in save:
start.join()
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
TypeError: 'list' object is not callable
what is the correct way to iterate a bunch of files through multiprocessing or threading?

See the comment posted by Michael Butscher as it is most likely that your intention is to have function work process pages either concurrently (multithreading) or in parallel (multiprocessing). The difference is that with multithreading each thread must acquire the Global Interpreter Lock (GIL) before it can run Python code so no two threads will ever be executing Python code at the same time (in parallel). This is not too much of an issue if work is (1) mostly I/O bound and releases the GIL when it is waiting for I/O (or a network request) to complete so that most of the time it is just waiting or (2) you are executing C/C++ code (which some library modules use for their implementation) that releases the GIL. Otherwise you have CPU-intensive processing to do in which case multiprocessing is the way to go. However, multiprocessing has has additional overhead that serial processing does not, i.e. the creation of processes and moving data across processes (different address spaces). So unless work is significantly CPU-intensive, a multiprocessing solution will run more slowly than a serial one.
Let's assume work is such that multiprocessing is the correct approach. Assuming that in addition to its high CPU requirements there is also a fair amount of waiting involved. Then it might be advantageous to create more processes than you have CPU cores since the processes will from time to time go in a wait state and allow other processes to run. But if there is little or no I/O processing involved, you gain nothing by creating more processes than you have CPU cores. Let's assume the latter and let N be the number of CPU cores you have and M is the number of pages that you have to process. If M is <= N, then you could create a process for each page as you are doing since you do not seem to be returning a value back from work (but using multiprocessing pool is probably simpler):
from multiprocessing import Process
def work(page):
#-------------------------
#make obj of page and do something
grabthis = Some_class1(page)
f = Someclass_2(grabthis,page)
output = f.extract()
print(output)
pages='PDFPAGES'
# Required for Windows or any platform that uses the *spawn* method to
# create new processes:
if __name__ == '__main__':
processes = []
for page in pages:
p = multiprocessing.Process(target=go, args=(page,))
p.start()
processes.append(p)
for p in processes:
p.join()
But if M > P, i.e. you have more pages to process than CPU cores you have, or if work needs to return a result back to the main process, I would then use a multiprocessing pool, which is also suitable even if M < P:
from multiprocessing import Pool, cpu_count
def work(page):
#-------------------------
#make obj of page and do something
grabthis = Some_class1(page)
f = Someclass_2(grabthis,page)
output = f.extract()
print(output)
pages='PDFPAGES'
# Required for Windows or any platform that uses the *spawn* method to
# create new processes:
if __name__ == '__main__':
# This will create a pool whose size is never more than the number of
# CPU cores you have or the number of pages you have to process:
pool_size = min(cpu_count(), len(pages))
pool = Pool()
pool.map(go, pages) # or results = pool.map(go, pages) if `work` returns something
# Cleanup pool:
pool.close()
pool.join()
But if work is mostly I/O bound, then use a multithreading pool. The pool size can be quite large, but you should still keep it to a reasonable size (200?):
from multiprocessing.pool import ThreadPool
def work(page):
#-------------------------
#make obj of page and do something
grabthis = Some_class1(page)
f = Someclass_2(grabthis,page)
output = f.extract()
print(output)
pages='PDFPAGES'
if __name__ == '__main__':
# This will create a pool whose size is never more than 200 or
# the number of pages you have to process:
pool_size = min(200, len(pages))
pool = ThreadPool(pool_size)
pool.map(go, pages) # or results = pool.map(go, pages) if `work` returns something
# Cleanup pool:
pool.close()
pool.join()
Note
The above are just generalizations. But if your work function is iterating files as you say, then multithreading might be the best approach. But there is a maximum data rate your disk has so creating more threads will not help performance. Moreover, if you don't have a solid state drive, then the extra head movement caused by reading multiple files concurrently can hurt performance and two threads may run more slowly than the serial approach. You could start with a pool size of 2 and see if it improves performance and then slowly increasing the pool size. The only problem is that your operating system probably caches disk data so that when you rerun the code with a different pool size, it will run faster just due to the caching. You either need to find a way of purging the disk cache between runs or re-booting between runs.

Python multiprocessing unexpected high memory usage

In our system we have started to experience problems with a task that was working ok, but now it seems to be hanging and uses a high amount of memory (so other tasks fail and raise MemoryError).
Context: example code below had no problem, but for a new database huge_dataframe became a lot bigger. The question is that if I run both parts separately it works. But running all together in process_data_task will lead to the problem. Running under Python 2.7 on Linux.
I suspect that is something with fork(), but how could each sub-process take so much memory? huge_dataframe is deleted before starting multiprocessing. Also is strange that do_recalculations hangs only when is called inside process_data_task (child not joining?), but it doesn't throw any exception. Any explanation or idea for troubleshooting?
def process_data_task():
# part 1, high memory usage
huge_dataframe = retrieve_data()
process_table(huge_dataframe)
# added this trying to fix the problem but it didn't help
del huge_dataframe
gc.collect()
# part 2, heavy CPU usage, multiprocessing
do_recalculations() # recalculate items in parallel
# Multiprocessing done here
def do_recalculations():
processes = cpu_count()
items_to_update = [...] # query database
work_total_list = chunkify(items_to_update, processes)
p = Pool(processes)
result = p.map(sub_process_func, work_total_list)
p.close()
p.join()

Python multiprocessing run time per process increases with number of processes

I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.

multiprocessing.pool context and load balancing

I've encountered some unexpected behaviour of the python multiprocessing Pool class.
Here are my questions:
1) When does Pool creates its context, which is later used for serialization? The example below runs fine as long as the Pool object is created after the Container definition. If you swap the Pool initializations, serialization error occurs. In my production code I would like to initialize Pool way before defining the container class. Is it possible to refresh Pool "context" or to achieve this in another way.
2) Does Pool have its own load balancing mechanism and if so how does it work?
If I run a similar example on my i7 machine with the pool of 8 processes I get the following results:
- For a light evaluation function Pool favours using only one process for computation. It creates 8 processes as requested but for most of the time only one is used (I printed the pid from inside and also see this in htop).
- For a heavy evaluation function the behaviour is as expected. It uses all 8 processes equally.
3) When using Pool I always see 4 more processes that I requested (i.e. for Pool(processes=2) I see 6 new processes). What is their role?
I use Linux with Python 2.7.2
from multiprocessing import Pool
from datetime import datetime
POWER = 10
def eval_power(container):
for power in xrange(2, POWER):
container.val **= power
return container
#processes = Pool(processes=2)
class Container(object):
def __init__(self, value):
self.val = value
processes = Pool(processes=2)
if __name__ == "__main__":
cont = [Container(foo) for foo in xrange(20)]
then = datetime.now()
processes.map(eval_power, cont)
now = datetime.now()
print "Eval time:", now - then
EDIT - TO BAKURIU
1) I was afraid that that's the case.
2) I don't understand what the linux scheduler has to do with python assigning computations to processes. My situation can be ilustrated by the example below:
from multiprocessing import Pool
from os import getpid
from collections import Counter
def light_func(ind):
return getpid()
def heavy_func(ind):
for foo in xrange(1000000):
ind += foo
return getpid()
if __name__ == "__main__":
list_ = range(100)
pool = Pool(4)
l_func = pool.map(light_func, list_)
h_func = pool.map(heavy_func, list_)
print "light func:", Counter(l_func)
print "heavy func:", Counter(h_func)
On my i5 machine (4 threads) I get the following results:
light func: Counter({2967: 100})
heavy func: Counter({2969: 28, 2967: 28, 2968: 23, 2970: 21})
It seems that the situation is as I've described it. However I still don't understand why python does it this way. My guess would be that it tries to minimise communication expenses, but still the mechanism which it uses for load balancing is unknown. The documentation isn't very helpful either, the multiprocessing module is very poorly documented.
3) If I run the above code I get 4 more processes as described before. The screen comes from htop: http://i.stack.imgur.com/PldmM.png

The Pool object creates the subprocesses during the call to __init__ hence you must define Container before. By the way, I wouldn't include all the code in a single file but use a module to implement the Container and other utilities and write a small file that launches the main program.
The Pool does exactly what is described in the documentation. In particular it has no control over the scheduling of the processes hence what you see is what Linux's scheduler thinks it is right. For small computations they take so little time that the scheduler doesn't bother parallelizing them(this probably have better performances due to core affinity etc.)
Could you show this with an example and what you see in the task manager? I think they may be the processes that handle the queue inside the Pool, but I'm not sure. On my machine I can see only the main process plus the two subprocesses.
Update on point 2:
The Pool object simply puts the tasks into a queue, and the child processes get the arguments from this queue. If a process takes almost no time to execute an object, than Linux scheduler let the process execute more time(hence consuming more items from the queue). If the execution takes much time then this scheduler will change processes and thus the other child processes are also executed.
In your case a single process is consuming all items because the computation take so little time that before the other child processes are ready it has already finished all items.
As I said, Pool doesn't do anything about balancing the work of the subprocesses. It's simply a queue and a bunch of workers, the pool puts items in the queue and the processes get the items and compute the results. AFAIK the only thing that it does to control the queue is putting a certain number of tasks in a single item in the queue(see the documentation) but there is no guarantee about which process will grab which task. Everything else is left to the OS.
On my machine the results are less extreme. Two processes get about twice the number of calls than the other two for the light computation, while for the heavy one all have more or less the same number of items processed. Probably on different OSes and/or hardware we would obtain even different results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.