Unexpected behavior with multiprocessing pool imap_unordered method - python

I have a process that I'm trying to parallelize with the multiprocessing library. Each process does a few things, but one of the things each process does is to call a function that runs an optimization. Now, 90% of the time each optimization will complete in less than 1 minute, but if it doesn't, it might never converge. There is an internal mechanism that terminates the optimization function if it doesn't converge after say, 20,000 iterations, but that can take a long time (~1hr).
Right now my code looks something like this:
pool = multiprocessing.Pool()
imap_it = pool.imap_unordered(synth_worker, arg_tuples)
while 1:
try:
result = imap_it.next(timeout=120)
process_result(result)
except StopIteration:
break
except multiprocessing.TimeoutError:
pass
pool.close()
pool.join()
This seems to work fairly well as I am able to process the first 90% of results fairly quickly, but towards the end, the longer processes start to bog things down and only 1 process completes every ~10 minutes. The weird thing is that once it gets to this point and I run top at the terminal, it seems like only 5-6 parallel processes are running even though I have 24 CPUs in my Pool. I see all 24 CPU's engaged at the beginning of the run, also. Meanwhile I see that there are still over 100 remaining tasks that haven't finished (or even started). I'm not utilizing any of the built-in chunking options of imap_unordered so my understanding is that as soon as one process finishes, the worker that was running that finished process should pick up a new one. Why would there be fewer parallel processes running than there are remaining tasks unless all remaining jobs were already allocated to the 5-6 jammed workers that are waiting for their non-convergent optimizations to hit their 20,000 iteration cutoff? Any ideas? Should I try another method to iterate over the results of my Pool as they come in?
Output of top at the beginning of a run here
and output of top once the jobs seem to get stuck here
and this is what stdout looks like when the jobs start to get stuck.

Related

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

subprocess.Popen() performance degrades as process count rises

I have an application that runs and manages many services using subprocess.Popen(). Each of these services runs until it is explicitly told to come down. I'm noticing that the time to return from the subprocess.Popen() call increases at a fairly linear rate as more processes are spawned by the arbiter.
My basic code looks like this:
process_list = []
for command in command_list:
start_tm = time.time()
process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
end_tm = time.time()
print end_tm-start_tm
process_list.append(process)
I'm seeing that the print of end_tm-start_tm increases as I spawn more and more processes. The services run by each command can be in any order and I see the same behavior. The time increase isn't completely linear but I keep seeing a pattern: the first process takes ~0.005 seconds to spawn, the 10th takes ~0.125 seconds, the 20th takes ~0.35 seconds, and so on.
My arbiter process runs upwards of 100 subprocesses. I could split it up so that multiple arbiters run with a smaller number of subprocesses each but I want to understand what the issue is first. Is the overhead of one process owning many subprocesses so great that each additional subprocess adds to the return time of subprocess.Popen()? Is there anything I could do to potentially mitigate that?
EDIT: I split my single arbiter process into two. In my previous test, my arbiter was running 64 processes. I created two separate configurations of my arbiter that each ran 32 processes. I ran the first arbiter, let it completely start all 32 processes, then kicked off the second arbiter.
In both cases, the first process again took ~0.005 seconds to start up and the 32nd and final process took about ~0.45 seconds to start up. In my previous test of a single arbiter with 64 processes, the first process took ~0.005 seconds to start while the 64th would take roughly 0.85 seconds.
Not a direct answer to your question about "why" what you are noticing is occurring, but I would highly recommend changing strategies for how you are dealing with your multiprocessing by using ThreadPoolExecutor to manage system resources.
Since your system can't effectively manage more processes than it has system threads, I would try:
>>> from concurrent.futures import ThreadPoolExecutor
>>> from multiprocessing import cpu_count
>>> with ThreadPoolExecutor(workers=cpu_count()) as pool:
results = pool.map(lambda cmd: subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE), command_list)
I find the API to be easy and the resource managment to be quite effective, and there are less "gotchas".
https://docs.python.org/3.6/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor.shutdown
Although this is for 3.6, the API is largely the same once you pip install the concurrent.futures module for 2.7.

RuneTimeError: Can't start new thread, threading library

So, in this code I am testing some multi threading to speed up my code. If I have a large number of tasks in queue I get RuneTimeError: Can't start new thread error. For example, range(0,100) works, but range(0,1000) won't work. I am using threading.Semaphore(4) and this is correctly working, only processing 4 threads at a time, tested this is working. I know why I am getting this error, because even though I am using threading.Semaphore it still technically starts all the threads at the start, but just pauses them until it's the threads turn to run and starting 1000 threads at the same time is to much for the PC to handle. Is there anyway to fix this problem? (Also, yes, I know about GIL)
def thread_test():
threads = []
for t in self.tasks:
t = threading.Thread(target=utils.compareV2.run_compare_temp, args=t)
t.start()
threads.append(t)
for t in threads:
t.join()
for x in range(0,100):
self.tasks.append(("arg1","arg2"))
thread_test()
Instead of starting 1000 threads and then only letting 4 do any work at a time, start 4 threads and let them all do work.
What are the extra 996 threads buying you?
They're using memory and putting pressure on the system's scheduler.
The reason you get the RuntimeError is probably that you've run out of memory for the call stacks for all of those threads. The default limit varies by platform but it's probably something close to 8MiB. A thousand of those and you're up around 8GiB... just for stack space.
You can reduce the amount of memory available for the call stack of each thread with threading.stack_size(...). This will let you fit more threads on your system (but be sure you don't set it below the amount of stack space you actually need or you'll crash your process).
Most likely, what you actually want is a number of processes close to the number of physical cores on your host or a threading system that's more light-weight than what the threading module gives you.

python keep using same threads in long duration as long running process

I have a model that I want to keep running for over 6 hours paralleled:
pool = multiprocessing.Pool(10)
for subModel in self.model:
pool.apply_async(self.compute, subModel, features)
pool.close()
pool.join()
The problem right now is too slow, as I have to call pool = multiprocessing.Pool(10) and pool.join() each time to construct and deconstruct 10 threads while the model is run tons of times in 6 hours.
The solution I believe is to have 10 threads kept running in the background, so that whenever new data coming in, it will go into model right away not worrying about creating new threads and deconstructing them which wastes a lot of time.
Is there a way in python that allows you to have long running process so that you don't need to start and stop over and over again?
You don't need to close() and join() (and destroy) the Pool after submitting one set of tasks to it. If you want to make sure that your apply_async() has completed before going on, just call apply() instead, and use the same Pool next time. Or, if you can do other stuff while waiting for the tasks, save the result object returned by apply_async and call wait() on it once you can't go on without it being complete.

Python multiprocessing pool number of jobs not correct

I wrote a python program to launch parallel processes (16) using pool, to process some files. At the beginning of the run, the number of processes is maintained at 16 until almost all files get processed. Then, for some reasons which I don't understand, when there're only a few files left, only one process runs at a time which makes processing time much longer than necessary. Could you help with this?
Force map() to use a chunksize of 1 instead of guessing the best value by itself, es.:
pool = Pool(16)
pool.map(func, iterable, 1)
This should (in theory) guarantee the best distribution of load among workers until the end of the input data.
See here
Python, before starts the execution of the process that you specify in applyasync/asyncmap of Pool, assigns to each worker a piece of the work.
For example, lets say that you have 8 files to process and you start a Pool with 4 workers.
Before starting the file processing, two specific files will be assigned to each worker. This means that if some worker ends its job earlier than the others, will simply "have a break" and will not start helping the others.

Categories